Você está na página 1de 194

KIT Scientific Reports 7551

Proceedings of the

5th International Workshop on


Reconfigurable Communication-centric
Systems on Chip 2010 – ReCoSoC‘10

May 17-19, 2010


Karlsruhe, Germany
Michael Hübner, Loïc Lagadec, Oliver Sander, Jürgen Becker (eds.)

Proceedings of the 5th International Workshop on Reconfigurable


Communication-centric Systems on Chip 2010 – ReCoSoC‘10
May 17-19, 2010
Karlsruhe, Germany
Karlsruhe Institute of Technology
KIT SCIENTIFIC REPORTS 7551
Proceedings of the
5th International Workshop on
Reconfigurable Communication-centric
Systems on Chip 2010 – ReCoSoC‘10

May 17-19, 2010


Karlsruhe, Germany

Michael Hübner
Loïc Lagadec
Oliver Sander
Jürgen Becker
(eds.)
Report-Nr. KIT-SR 7551

Umschlagsbild:
Wikimedia Commons. Fotograf: Meph666

Impressum
Karlsruher Institut für Technologie (KIT)
KIT Scientific Publishing
Straße am Forum 2
D-76131 Karlsruhe
www.uvka.de

KIT – Universität des Landes Baden-Württemberg und nationales


Forschungszentrum in der Helmholtz-Gemeinschaft

Diese Veröffentlichung ist im Internet unter folgender Creative Commons-Lizenz


publiziert: http://creativecommons.org/licenses/by-nc-nd/3.0/de/

KIT Scientific Publishing 2010


Print on Demand

ISSN 1869-9669
ISBN 978-3-86644-515-4
ReCoSoC'10 Reconfigurable Communication-centric Systems on Chip
Michael Hübner, Karlsruhe Institute of Technology, Karlsruhe, Germany
Loïc Lagadec, Université de Bretagne Occidentale, Lab-STICC, Brest, FRANCE

The fifth edition of the Reconfigurable Communication-centric Systems-on-Chip


workshop (ReCoSoC 2010) was held in Karlsruhe, Germany from May 17th to May
19th, 2010.
ReCoSoC is intended to be a periodic annual meeting to expose and discuss gathered
expertise as well as state of the art research around SoC related topics through plenary
invited papers and posters. Similarly to the event in 2008 and the years before, several
keynotes given by internationally renowned speakers as well as special events like e.g.
tutorials underline the high quality of the program.

ReCoSoC is a 3-day long event which endeavors to encourage scientific exchanges and
collaborations. This year again ReCoSoC perpetuates its original principles: thanks to
the high sponsoring obtained from our partners registration fees will remain low.

Goals:
ReCoSoC aims to provide a prospective view of tomorrow's challenges in the multi-
billion transistor era, taking into account the emerging techniques and architectures
exploring the synergy between flexible on-chip communication and system
reconfigurability.

The topics of interest include:

- Embedded Reconfigurability in all its forms


- On-chip communication architectures
- Multi-Processor Systems-on-Chips
- System & SoC design methods
- Asynchronous design techniques
- Low-power design methods
- Middleware and OS support for reconfiguration and communication
- New paradigms of computation including bio-inspired approaches

A special thank goes to the local staff, especially to the local chair Oliver Sander, Mrs.
Hirzler, Mrs. Bernhard and Mrs. Daum who enabled a professional organization before
and while the conference. Thanks to Gabriel Marchesan, the web-page of the conference
was always up to date and perfectly organized. Furthermore Prof. Becker supported the
conference through his group at the Institute for Information Processing Technology.
We also thank the International Department for offering the providing the „Hector
Lecture Room“ for the conference.

Michael Hübner
Loïc Lagadec
ReCoSoC 2010 Program Co-Chairs

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany i


Program Committee

Jürgen Becker Karlsruhe Institute of Technology Germany


Pascal Benoit LIRMM, Montpellier France
Koen Bertels TU Delft Netherlands
Christophe Bobda University of Potsdam Germany
Lars Braun Karlsruhe Institute of Technology Germany
René Cumplido INAO México
Debatosh Debnath Oakland University USA
Jean-Luc Dekeyser University of Lille France
Didier Demigny ENSSAT, Lannion France
Peeter Ellervee Tallinna Tehnikaülikool Estonia
Christian Gamrat CEA France
Georgi Gaydadjiev TU Delft Netherlands
Manfred Glesner TU Darmstadt Germany
Diana Goehringer Fraunhofer IOSB Germany
Jim Harkin University of Ulster Northern Ireland
Andreas Herkersdorf Technische Universität München Germany
Thomas Hollstein TU Darmstadt Germany
Michael Hübner Karlsruhe Institute of Technology Germany
Leandro Indrusiak University of York UK
Loic Lagadec Université de Bretagne Occidentale France
Heiner Litz Universität Heidelberg Germany
Patrick Lysaght Xilinx Inc. USA
Fearghal Morgan NUI Galway Ireland
Johnny Öberg KTH Sweden
Ian O'connor LEOM, Lyon France
Katarina Paulsson Ericsson Sweden
J.-L. Plosila University of Turku Finland
Bernard Pottier University of Bretagne Occidentale France
Ricardo Reis UFRGS Brazil
Michel Robert LIRMM, Montpellier France
Alfredo Rosado Universitat de Valencia Spain
Eduardo Sanchez EPFL Switzerland
Oliver Sander Karlsruhe Institute of Technology Germany
Gilles Sassatelli LIRMM, Montpellier France
Tiberiu Seceleanu University of Turku Finland
Dirk Stroobandt Universiteit Gent Belgium
Lionel Torres LIRMM, Montpellier France
Francois Verdier Université de Cergy-Pontoise France
Nikos Voros Technological Educational Institute of Mesolonghi Greece
Hans-Joachim Wunderlich Universität Stuttgart Germany
Peter Zipf TU Darmstadt Germany

ii May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Table of Contents

Session 1: Multiprocessor System on Chip

A Self-adaptive communication protocol allowing fine tuning between flexibility and


performance in Homogeneous MPSoC systems …………………………………...……… 1
Remi Busseuil, Gabriel Marchesan Almeida, Sameer Varyani, Pascal Benoit, Gilles
Sassatelli

Instruction Set Simulator for MPSoCs based on NoCs and MIPS Processors ………....….. 7
Leandro Möller, André Rodrigues, Fernando Moraes, Leandro Soares Indrusiak,
Manfred Glesner

Impact of Task Distribution, Processor Configurations and Dynamic Clock Frequency


Scaling on the Power Consumption of FPGA-based Multiprocessors ……………………. 13
Diana Goehringer, Jonathan Obie, Michael Huebner, Juergen Becker

Session 2: Design-optimization of Reconfigurable Systems

Novel Approach for Modeling Very Dynamic and Flexible Real Time Applications ……. 21
Ismail Ktata1, Fakhreddine Ghaffari, Bertrand Granado and Mohamed Abid

New Three-level Resource Management for Off-line Placement of Hardware Tasks on


Reconfigurable Devices …………………………………………………………………… 29
Ikbel Belaid, Fabrice Muller, Maher Benjemaa

Exploration of Heterogeneous FPGA Architectures ……………………………………… 37


Umer Farooq, Husain Parvez, Zied Marrakchi and Habib Mehrez

Session 3: Self-Adaptive Reconfigurable System

Dynamic Online Reconfiguration of Digital Clock Managers on Xilinx Virtex-II/


Virtex II-Pro FPGAs: A Case Study of Distributed Power Management ………………… 45
Christian Schuck, Bastian Haetzer, Jürgen Becker

Practical Resource Constraints for Online Synthesis …………………………………….. 51


Stefan Döbrich, Christian Hochberger

ISRC: a runtime system for heterogeneous reconfigurable architectures ………………… 59


Florian Thoma, Juergen Becker

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany iii


Session 4: Fault Tolerant Systems

A Self-Checking HW Journal for a Fault Tolerant Processor Architecture …………….… 67


Mohsin Amin, Camille Diou, Fabrice Monteiro, Abbas Ramazani, Abbas Dandache

A Task-aware Middleware for Fault-tolerance and Adaptivity of Kahn Process Networks


on Network-on-Chip ………………………………………………………..……………. 73
Onur Derin, Erkan Diken

Dynamic Reconfigurable Computing:


the Alternative to Homogeneous Multicores under Massive Defect Rates ……………….. 79
Monica Magalhães Pereira, Luigi Carro

Session 5: Analysis of FPGA Architectures

An NoC Traffic Compiler for efficient FPGA implementation of Parallel Graph


Applications ………………………………………………………………………………... 87
Nachiket Kapre, André DeHon

Investigation of Digital Sensors for Variability Characterization on FPGAs ………….….. 95


Florent Bruguier, Pascal Benoit, Lionel Torres

Investigating Self-Timed Circuits for the Time-Triggered Protocol …………….………… 101


Markus Ferringer

First Evaluation of FPGA Reconfiguration for 3D Ultrasound Computer Tomography ….. 109
Matthias Birk, Clemens Hagner, Matthias Balzer, Nicole Ruiter, Michael Huebner,
Juergen Becker

Session 6: Security on Reconfigurable Systems

ECDSA Signature Processing over Prime Fields for Reconfigurable Embedded Systems . 115
Benjamin Glas, Oliver Sander, Vitali Stuckert, Klaus D. Müller-Glaser, Jürgen Becker

A Secure Keyflashing Framework for Access Systems in Highly Mobile Devices ………. 121
Alexander Klimm, Benjamin Glas, Matthias Wachs, Jürgen Becker, Klaus D. Müller-
Glaser

iv May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Session 7: Reconfigurable Computing and Reconfigurable Education Special Session

Teaching Reconfigurable Processor: the Biniou Approach ………………………………. 127


Loic Lagadec, Damien Picard, Pierre-Yves Lucas

Behavioral modeling and C-VHDL co-simulation of Network on Chip on FPGA for


Education …………………………………………………………………………………... 135
C. Killian, C. Tanougast, M. Monteiro, C. Diou, A. Dandache, S. Jovanovic

Experimental Fault Injection based on the Prototyping of an AES Cryptosystem ……….. 141
Jean-Baptiste Rigaud, Jean-Max Dutertre, Michel Agoyany, Bruno Robissony, Assia Tria

Poster Session

Reducing FPGA Reconfiguration Time Overhead using Virtual Configurations …………. 149
Ming Liu, Zhonghai Lu, Wolfgang Kuehn, Axel Jantsch

Timing Synchronization for a Multi-Standard Receiver on a Multi-Processor System-on-


Chip ………………………………………………………………………………………... 153
Roberto Airoldi, Fabio Garzia and Jari Nurmi

Mesh and Fat-Tree comparison for dynamically reconfigurable applications …………….. 157
Ludovic Devaux, Sebastien Pillement, Daniel Chillet, Didier Demigny

Technology Independent, Embedded Logic Cores


Utilizing synthesizable embedded FPGA-cores for ASIC design validation ……………… 161
Joachim Knäblein, Claudia Tischendorf, Erik Markert, Ulrich Heinkel

A New Client Interface Architecture for the Modified Fat Tree (MFT) Network-on-Chip
(NoC) Topology …………………………………………………………………………… 169
Abdelhafid Bouhraoua and Muhammad E. S. Elrabaa

Implementation of Conditional Execution on a Coarse-Grain Reconfigurable Array …….. 173


Fabio Garzia, Roberto Airoldi, Jari Nurmi

Dynamically Reconfigurable Architectures for High Speed Vision Systems ……………... 175
Omer Kilic, Peter Lee

Virtual SoPC rad-hardening for satellite applications ……………………………………... 179


L. Barrandon, T. Capitaine, L. Lagadec, N. Julien, C. Moy, T. Monédière

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany v


vi May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010
A Self-adaptive communication protocol allowing
fine tuning between flexibility and performance in
Homogeneous MPSoC systems
Remi Busseuil, Gabriel Marchesan Almeida, Sameer Varyani, Pascal Benoit, Gilles Sassatelli
Laboratoire d’Informatique,
de Robotique et de Microelectronique
de Montpellier (LIRMM)
Montpellier, France
Email: firstname.lastname@lirmm.fr

Abstract—MPSoC have become a popular design style for community to research new approaches for achieving system
embedded systems that permit devising tradeoffs between per- adaptability and scalability. The most promising development
formance, flexibility and reusability. While most MPSoCs are tends to be homogeneous structures based on a Network on
heterogeneous for achieving a better power efficiency, homoge-
neous systems made of regular arrangements of a unique instance Chip (NoC), with distributed identical nodes containing both
of a given processor open interesting perspectives in the area of computing capabilities and memory. Such systems allow using
on-line adaptation. advanced techniques that permit optimizing online application
Among these techniques, task migration appears very promis- mapping. Among these techniques, this paper puts focus on
ing as it allows performing load balancing at run-time for dynamic load balancing based on task migration.
achieving a better resources utilization. Bringing such a technique
into practice requires devising appropriate solutions in order to Migrating processing tasks at runtime while ensuring real-
meet quality of service requirements. This paper puts focus on time constraints are met require devising precise deterministic
a novel technique that tackles the difficult problem of inter-task protocols to guarantee application consistency. In this paper,
communication during the transient phase of task migration. we propose an adaptive communication protocol, based on
The proposed adaptive communication scheme is inspired from TCP and UDP models, which ensure determinism of critical
TCP/IP protocols and shows acceptable performance overhead
while providing communication reliability at the same time. migration mechanisms while providing enhanced useful ser-
vices such as port opening.
I. I NTRODUCTION This paper is organized as follows: section 2 presents
Thanks to the technology shrinking techniques, an inte- related works in the field of communication inside distributed
grated circuit can include an exponentially increasing number structures and examples of Network on Chip protocols. Section
of transistors. This trend plays an important role at the 3 introduces the platform developed regarding scalability,
economic level, although the price per transistor is rapidly adaptability and reuse issues used to test our communication
dropping the NRE (Non-Recurring Engineering) costs, and protocol, named HS-Scale. Section 4 describes how to ensure
fixed manufacturing costs increase significantly. This pushes determinism and reliability with our protocol facing the prob-
the profitability threshold to higher production volumes open- lems raised by this type of platform. Finally, section 5 shows
ing a new market for flexible circuits which can be reused some results concerning the performance of our protocol.
for several product lines or generations, and scalable systems Section 6 draws some conclusions on the presented work and
which can be designed more rapidly in order to decrease gives some perspectives about other upcoming challenges of
the Time-to-Market. Moreover, at a technological point of the area.
view, current variability issues could be compensated by more
flexible and scalable designs. In this context, Multiprocessor II. S TATE - OF - THE - ART
Systems-on-Chips (MPSoCs) are becoming an increasingly In this section we will discuss the communication trends
popular solution that combines flexibility of software along inside the new emerging MPSoC architecture, i.e. based on
with potentially significant speedups. NoC, with distributed memory architecture, and a message
These complex systems usually integrate a few mid-range passing communication model. In this context, we will then
microprocessors for which an application is usually statically see different task migration techniques. Finally, we will present
mapped at design-time. Those applications however tend to an overview of existing NoC communication protocols.
increase in complexity and often exhibit time-changing work-
load which makes mapping decisions suboptimal in a number A. Communication inside distributed memory architecture
of scenarios. These facts challenge the design techniques Nowadays, distributed memory structures tend to become
and methods that have been used for decades and push the the most attracted solution to achieve scalability and reuse

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 1


challenges in new emerging architectures. To ensure coherency C. Network on Chip communication
between the hardware and the software in such systems, the Network communication has been widely studied for many
Message Passing Model of computation is the most commonly years, mainly in the context of cluster of PCs and High Per-
used - except in some marginal architectures like COMA [1] or formance Computing. However, Network on Chip communica-
ccNUMA [2] MPSoC. This model of computation is based on tion, even if a lot of concepts can be taken from those studies,
explicit communication between tasks. Here communications differs in many manners from traditional Network. A NoC, for
among tasks take place through messages and are implemented example, has rarely to be dynamically expanded, so it does
with functions allowing reading and writing to communication not need a live connection service. However, to provide reuse
channels. Synchronizations between tasks are explicit and and scalability, NoC protocol should be made for an arbitrary
made by blocking reading or writing primitives. CORBA, number of nodes. [8] illustrates some techniques concerning
DCOM, SOAP, and MPI are examples of message passing on-chip communication, like energy-efficient protocols, or
models. Message Passing Interface (MPI) is the most popular lightweight encapsulation. As in standard Network, reordering
implementation of the message passing model, and, to the best and adaptive routing is often provided, to avoid saturation of a
of our knowledge, embedded implementations exist only for node, like in [9]. At an architectural level, regular structure like
this model [3]. 2D mesh [10] is the most frequently used, but other structures
One of the other hot issues in today computing architecture exist, like [11] which uses an octagon for example. This last
is load balancing and resource usage optimization. Indeed, article proposes both packet switching, where each packet is
in nowadays MPSoC, task migration techniques have been redirected individually, and circuit switching, where a unique
mainly studied, to reduce hotspot and increase the overall channel is opened during the whole communication process.
resource usage. Next paragraph gives a brief overview of some As for task migration, task placement can increase com-
task migration techniques. munication throughput. [12] proposes a circuit-switch NoC
statically programmed at compile time to optimize the overall
network bandwidth. A lightweight NoC architecture called
B. Task migration
HERMES, based on a X then Y routing is proposed in
For shared memory systems such as today’s multi-core [10]. It provides simple packet switching network with unique
computers, task migration is facilitated by the fact that no predictable routing for each packet from the same sender and
data or code has to be moved across physical memories: since receiver. Hence, neither reordering nor acknowledgment is
all processors are entitled to access any location in the shared necessary.
memory, migrating a task comes down to electing a different One particularity of these protocols is to be hardware
processor for execution. But in the case of multiprocessor- dependent: the structure of the NoC will influence significantly
distributed memory/message passing architectures, both pro- the software communication policy. We will see in the next
cess code and state have to be migrated from a processor paragraph the platform used to develop our new protocol.
private memory to another, and synchronizations must be
performed using exchanged messages. III. HS-S CALE (H ARDWARE AND S OFTWARE S CALABLE
PLATFORM )
Task migration has also been explored for decreasing com-
munication overhead or power consumption [4]. In [5], authors The key motivations of our approach being scalability and
present a migration case study for MPSoCs that relies on the self-adaptability, the system presented in this paper is built
μClinux operating system and a check pointing mechanism. around a distributed memory/message passing system that
The system uses the MPARM framework [6], and although provides efficient support for task migration. The decision-
several memories are used, the whole system supports data making policy that controls tasks processes is also fully
coherency through a shared memory view of the system. In distributed for scalability reasons. This system therefore aims
[7] authors present an architecture aiming at supporting task at achieving continuous, transparent and decentralized runtime
migration for a distributed memory multiprocessor-embedded task placement on an array of processors for optimizing
system. The developed system is based on a number of application mapping according to various potentially time
32-bit RISC processors without memory management unit changing criteria.
(MMU). The used solution relies on the so-called task replicas
technique; tasks that may undergo a migration are present A. System overview
on every processor of the system. Whenever a migration is The architecture is made of a homogeneous array of
triggered, the corresponding task is respectively inhibited from PE (Processing Elements) communicating through a packet
the initial processor and activated in the target processor. switching network. For this reason, the PE is called NPU,
Although the efficiency of these techniques has been proven, for Network Processing Unit. Each NPU, as detailed later, has
none of these papers mention the communication issues due to multitasking capabilities which enable time-sliced execution of
task migration. Among those task migration specific protocols, multiple tasks. This is implemented thanks to a tiny preemptive
we can cite: localization of a task, set up of the communication multitasking Operating System which runs on each NPU. The
or maintenance of communication channel during task move. structural view of the NPU is depicted in Figure 1.

2 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


RAW UDP TCP

Data OS Services
Network Layer
Processing Layer HS-Scale RAW
Transport UDP TCP
protocol
μP RAM NI

Internet Hermes Hermes Hermes


Timer UART FScal

Link Hermes physical Layer

Fig. 1. HS-Scale structural description Fig. 2. HS-Scale available communication protocols

The NPU is built of two main layers, the network layer measure can be computed as the Manhattan distance between
and the processing layer. The Network layer is essentially a the two nodes hosting the tasks.
compact routing engine (XY routing). The processing layer Several migration policies have been developed, like the
is based on a simple and compact RISC microprocessor, its use of a static threshold to start a migration. In this case, the
static memory and a few peripherals (a timer, an interrupt task is migrated to the neighbor with the best value of the
controller, an UART and a frequency scaler) as shown in selected metric - for example the neighbor with the lowest
Figure 1. A multitasking microkernel implements the support cpu workload. For more information about the task migration
for time multiplexed execution of multiple tasks [13]. policies, we invite the reader to read our previous paper [15].
The communication framework of HS-Scale is derived from
the Hermes Network-on-chip [10]. The lightweight operating
system we use was designed for our specific needs, inspired C. Communication system
by the RTOS of Steve Roads [14]. Despite being small (35 The Network of HS-Scale is based on the NoC HERMES
KB), this kernel does preemptive switching between tasks and [10]. It provides a low area overhead packet-switching network
also provides them with a set of communication primitives thanks to a simple X then Y routing algorithm. Only two fields
that are presented later. The OS is capable of dynamic task are needed to encapsulate a HERMES packet: one for the
loading and dynamic task migration. sender and receiver addresses, and the second for the number
of 32-bits words inside the packet.
B. Self-adaptive mechanisms
However, such protocol is too simple to provide high level
The platform is entitled to take decisions that relate to services usually used in real-time multi-task operating systems.
application implementation through task placement. These de- In the standard usually used Internet encapsulation model,
cisions are taken in a fully decentralized fashion as each NPU 4 layers of functionality are provided: the link layer, the
is endowed with equivalent decisional capabilities. Each NPU Internet layer, the transport layer and the application layer [16].
monitors a number of metrics that drive an application-specific The HERMES protocol supply link layer and Internet layer
mapping policy; based on these information a NPU may functionality. Transport and data layers are so implemented
decide to push or attract tasks which results in respectively in software - as OS services - using the concept of TCP. To
parallelizing or serializing the corresponding tasks execution, keep compatibility between HERMES and IP, XY addresses
as several tasks running onto the same NPU are executed in of HERMES have been statically mapped to IP addresses.
a time-sliced manner. Transport layer was adapted to provide the same services as
Mapping decisions are specified on an application-specific TCP and UDP: the notion of ports have been raised, redirection
basis in a dedicated operating system service. Although the and retransmission have been included in TCP. A checksum
policy may be focused on a single metric, composite policies has been optionally made to provide reliability in non reliable
are possible. Three metrics are available to the remapping networks. As this network can be considered as reliable, this
policy for taking mapping decisions: feature has not been used, but it makes the protocol more
• NPU workload generic in term of hardware possible platform.
• FIFO queues filling level Figure 2 shows the different protocols implanted in HS-
• Task distance Scale. A RAW protocol, simply using Hermes routing layer
NPU workload is measured as the amount of time used and with packets directly given to the OS has been made to
to process the user tasks - i.e. excluding the idle task and the provide a base rate of the bandwidth. When a task needs a
communication tasks. Task distance corresponds to the number communication channel, it uses an unused port which will
of hop a packet needs to go through during a communication become the identity of the communication channel. When the
between two tasks. As the Network structure is a 2D mesh, this communication is closed, the port is marked as unused again.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 3


T1 N23 T1 N13 (5) T2 update
T2 N43 T2 N33
its position (OS) T2

(1) T2 position? (6) Data transfer


(OS message) (UDP) (3) T2 (4) Packets are
moves redirected (TCP)
(5) Close the
Connexion (TCP) (1) Warning
(2) T2 => N43 for migration (TCP)
(OS message) (3) Open the
T1 T2 T1 T2
Connexion (TCP)
(0) Data transfer
(UDP)
(4) Data transfer
(UDP) (2) Secure data
transfer (TCP)

Fig. 3. Inter-node task communication protocol Fig. 4. Communication protocol during task migration

IV. C OMMUNICATION ISSUES IN SELF - ADAPTIVE B. Task Migration


HOMOGENEOUS PLATFORMS Communication during task migration main issue can be
expressed as follows: how to keep the maximum performance
A. Different types of communication during the transfer of a task from a NPU to another. Without
any particular protocol, the simplest way to ensure no dropping
A homogeneous and regular platform like HS-Scale has an of packets is to close the communication during task migration.
application domain more generic than heterogeneous, applica- Although this method can be considered as reliable, the
tion specific platform. Thus, the communication has to face opening and closing protocol plus the loss of the connection
numerous issues to provide both genericity and performance. will bring a big overhead in term of performance.
We can distinguish different types of communication in HS- To ensure no loss of packets during migration, we have to
Scale: focus our attention on two points: first, we need to ensure the
• Communication between tasks intra-node, i.e. inside the reception and the order of the packets, so that no packets are
same node dropped. Second, the packets have to be redirected, in case
• Communication between tasks inter-node, i.e. between they go to the wrong node. Those features are again part of
two different nodes the TCP protocol, so the idea is also here to use TCP to ensure
• Service messages, provided by the Operating system of the reliability of the system.
a node to another one Figure 4 shows the different steps of the communication
• Exception messages, provided by a task to an Operating during a task migration - here, the receiver. Before the migra-
system to request a service tion, the task which wants to migrate send a message to the
tasks communicating with itself, warning them that it wants
For intra-node communication between tasks, we use simple to communicate and so switch the communication in TCP
FIFO queue handled by the operating system of the node. (action 1 in the figure). The migration can begin when the task
The operating system provides some basic primitives of MPI receives at least one TCP packet from every sender tasks (2).
(Message Passing Interface), so tasks can easily communicate. If packets arrive during migration, they are redirected to the
For inter-node communication between tasks, that protocol new NPU which receives the task (3): if the task is ready, the
is more complicated. Figure 3 represents the concept of the packet is consumed, if not, the packet is stored in a fifo, and
protocol. First, we need to open a connection between the two dropped when the fifo is full. When the migration is complete,
tasks: a service message is sent to the master node to find the the task sends a message to the tasks communicating with it
receiver task. As RAW protocol does not provide any opening to update its position and to switch in UDP again. As TCP
or close of connection, the TCP three-steps handshake protocol provides reordering, if TCP packets arrive after because of
is used. After opening, we need to transfer the information redirection, the task can reorder the packets.
with the best performance we have, with no need of Quality
of Service: so the UDP protocol is used. At the end of the V. P ERFORMANCE OF THE AUTO - ADAPTIVE PROTOCOL
transmission, the TCP four-steps closing protocol is used. This A. Maximum bandwidth achievable
protocol is transparent for the task itself: the Operating System The first purpose that we need to focus on in our commu-
is handling it, and only the MPI services are provided. The nication protocol is the bandwidth. Indeed, the bandwidth has
Operating System has to check whether the receiving task is to be tuned to fit the purpose of the chip. Figure 5 shows the
on the same node or abroad. bandwidth for the two protocols available, comparing them
Service messages and Exception messages need Quality of with a raw transfer as described before. The comparison is
Service to ensure reliability of the Network. For this purpose, made with different size of packets and with different amount
TCP protocol is used to send messages. of data. A timer in each node measures the time to transfer, to

4 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


500
800
450
700
400
600
350

Bandwidth in kB/s
Bandwidth in kB/s

500 300
TCP 1 tick
400 UDP 250 10 ticks
RAW 50 ticks
200
300
150
200
100
100
50

0 0
100 B 100 B 100 B 200 B 200 B 200 B 512B 750B 800B 825B 900B
2 MB 4 MB 8 MB 2 MB 4 MB 8 MB 11MB 16MB 17MB 17MB 19MB (a)
TCP à vide
(b)
TCP en charge
(c)
UDP à vide
(d)
UDP en charge

Fig. 5. TCP, UDP and RAW Bandwidth versus packet size and amount of Fig. 6. TCP and UDP Bandwidth when the CPU is idle or in load
data transferred

charge on the bandwidth. Figure 6 shows the bandwidth of


compute the bandwidth. The RAW communication can achieve UDP/TCP communication when - for (a) in TCP and (c) in
an average bandwidth more than 600 kB/s, when the UDP can UDP - the CPU is in a communication only mode, i.e. there
achieve 550 kB/s at best, and TCP 300 kB/s. The performance is no other task running on the CPU, and when - for (b) in
delta between RAW and UDP, less than 10%, proves the effi- TCP and (d) in UDP - the CPU is in heavy loaded mode,
ciency of our UDP like protocol. The TCP protocol, however, i.e. when the CPU runs a mjpeg decoder based on 3 inter-
has a bandwidth 50% smaller than RAW communication, but dependent tasks. The packet size is fixed to 750 bytes, and the
this protocol has been developed to be used on precise, few measures are made on a 16MB transfer. We can see that the
communication process like OS messages, opening or closing CPU load can lower the communication bandwidth down to
of a connection, or task migration. 50% of its nominal value. This reduction can be explained by
The second point raised by figure 5 is the bandwidth the heavy computation needed to process TCP or UDP packet
variations compared to the segmentation of the data and the with a general purpose CPU like those used in HS-Scale.
amount of data transmitted. If in RAW, the variations are quite Even so, the software implementation of TCP encapsulation
low, they can go up to 45% for the UDP and TCP protocols. and decapsulation, responsible of these results, is not a bad
The low bandwidth obtained with really small packets (100 choice here: as this protocol is considered to be used for
or 200 bytes) can be explained by the fact that encapsulation special purpose communication, which can be considered as
in UDP and TCP is huge, around 50 bytes, which makes the negligible in terms of data transferred compared to stream
payload of each packets really small. In the contrary, for really inter-process communication, the throughput is not a critical
huge packets, hardware and software capability of a NPU is issue. In this case, this implementation seems more appropriate
too low to register the whole packet in one tick, which makes than a hardware one, which would consume area.
the operating system run a rescheduling, and so it lowers the The second point stressed by figure 6 is the time between
bandwidth. two rescheduling of the communication protocol. The Operat-
However, the optimum size, around 750 bytes per packets, ing System in HS-Scale uses simple round robin rescheduling
is really dependent of the platform. In hardware first, the with fixed size time slots called ticks: the rescheduling pro-
computation capability of the CPU to deal with a packet cedure runs after each tick. As the communication procedure
will influence the overall processing time. Time to process is considered as a task, the time to reschedule will vary in
a packet will vary in function of frequency, CPU architecture function of the load of the CPU and of the sleeping time
or Processing Unit design - with the addition of a dedicated parameter. This parameter is the number of ticks between the
Network encoder/decoder for example. But in software too, the last proceeding of the communication task and its insertion
optimum can greatly vary. Indeed, this optimum depends on in the round robin queue. It can be tuned from 1 tick, which
the average time between two rescheduling of the communica- correspond to each task being rescheduled only once between
tion procedure. Depending on CPU charge, the communication two communication runs, to any positive numbers.
procedure rescheduling will proceed less often, and so the Figure 7 illustrates this principle of rescheduling. In (a)
optimum will change. Hence, next paragraph will show results and (b), the CPU is in communication only mode: in this
about bandwidth variations with a CPU in charge. case, the rescheduling should be as often as possible, to
avoid empty slots, like in (b). In (c) and (d), the CPU is
B. Performance in charge in heavy loaded mode: in this case, the rescheduling time
As the CPU is both used for communication decoding and can influence the computation time, but also the bandwidth.
computation, it is interesting to see the influence of CPU In term of computation time, the ratio of time spent on the

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 5


R EFERENCES
(a) Com Com Com Com Com Com Com …
[1] F. Dahlgren and J. Torrellas, Cache-only memory architectures, Com-
time puter, vol. 32, no. 6, pp. 7279, 1999.
10 slots
[2] P. Stenstrm, T. Joe, and A. Gupta, Comparative performance evaluation
(b) Com Empty … … Empty Com Empty Empty … of cache-coherent numa and coma architectures, ISCAS conference,
1992.
time
Rescheduling of Com
[3] M. Saldana and P. Chow, TDM-MPI: an MPI implementation for multiple
processors across multiple FPGAs, Proceedings of the International
Conference on Field Programmable Logic and Applications (FPL), 2006.
(c) Com T1 T2 T3 T4 Com T1 … [4] S. Carta, M. Pittau, A. Acquaviva, et al., Multi-processor operating
system emulation framework with thermal feedback for systems-on-chip,
time
Proceedings of the Great Lakes Symposium on VLSI (GLSVLSI), 2007.
10 slots [5] S. Bertozzi, A. Acquaviva, D. Bertozzi, and A. Poggiali, Supporting
task migration in multi-processor systems-onchip: a feasibility study,
(d) Com T1 … … T1 T2 T3 T4 Com T1 …
Proceedings of the Conference on Design, Automation and Test in Europe
time (DATE), 2006.
Rescheduling of Com [6] L. Benini, D. Bertozzi, A. Bogliolo, F. Menichelli, and M. Olivieri,
MPARM: exploring the multi-processor SoC design space with systemC,
Fig. 7. Communication protocol, rescheduling issues Journal of VLSI Signal Processing Systems for Signal, Image, and Video
Technology, 2005.
[7] M. Pittau, A. Alimonda, S. Carta, and A. Acquaviva, Impact of task
migration on streaming multimedia for embedded multiprocessors: a
communication task is, on the average: quantitative evaluation, Proceedings of the IEEE/ACM/IFIP Workshop
on Embedded Systems for Real-Time Multimedia (ESTIMedia), 2007.
1 [8] V. Raghunathan, M. B. Srivastava and R. K. Gupta, A survey of techniques
P = for energy efficient on-chip communication, DAC Conference, 2003.
(N − 1) + T [9] P Guerrier and A. Greiner, A Generic Architecture for On-Chip Packet-
Switched Interconnections, DATE Conference, 2000.
where N is the number of ticks and T the number of tasks. In [10] Fernando Moraes, Ney Calazans, Aline Mello, Leandro Mller and
term of bandwidth also, the variations are no more monotonous Luciano Ost, HERMES: an infrastructure for low area overhead packet-
versus the number of ticks: if we have a situation where the switching networks on chip, IEEE VLSI Journal, 2004
[11] Faraydon Karim, Anh Nguyen and Sujit Dey, An Interconnect Architec-
time spent to receive - in gray in fig. 7 (c) and (d) - is smaller ture for Networking Systems on Chips, IEEE micro, 2002
than a timeslot, we can have a situation, like illustrated in fig. [12] Jian Liang, Andrew Laffely, Sriram Srinivasan, and Russell Tessier, An
7, where the average time spent to receive is bigger with few architecture and compiler for scalable on-chip communication, IEEE
VLSISoC, 2004
ticks than without. This situation can be observed in figure [13] Gabriel Marchesan Almeida, Gilles Sassatelli, Pascal Benoit, Nicolas
6 (d), where the bandwidth is higher with 10 ticks than with Saint-Jean, Sameer Varyani, Lionel Torres, and Michel Robert, An
1 tick. We can conclude that the rescheduling time can be AdaptiveMessage Passing MPSoC Framework, International Journal
of Reconfigurable Computing (IJRC) journal, 2009
an issue in this kind of architecture, and that it has to be [14] Steve Rhoads, Plasma Most MIPS I(TM),
tuned accurately depending on the application. If the CPU http://www.opencores.org/project,plasma
has a variable load, a dynamically variable time to reschedule [15] G. Marchesan Almeida, N. Saint-Jean, S. Varyani, G. Sassatelli, P.
Benoit and L. Torres, Exploration of Task Migration Policies on the HS-
the communication, depending on the receiving speed and the Scale System, Reconfigurable Communication-centric Systems-on-Chip
number of tasks, is certainly the best solution. (ReCoSoC08), 2008
[16] Vinton Cerf, Yogen Dalal and Carl Sunshine, SPECIFICATION OF
INTERNET TRANSMISSION CONTROL PROGRAM, 1974
VI. C ONCLUSION
Future new MPSoC architectures will have to ensure flexi-
bility and adaptability to face the new challenges raised by
technology shrinking and computing requirement. For this
purpose, array of identical software independent nodes linked
by a NoC seems to be the most promising solution: it can
ensure generic computation with high performance thanks to
a good load balancing strategy.
This article describes an adaptive communication protocol
purposely made to face dynamic task migration issues in such
homogeneous structures. As every node is independent, it has
to deal with the asynchronicity of each node, with providing
sequential behavior of critical codes. For this purpose, this
protocol has been built using inspiration of TCP and UDP fea-
tures. Thanks to their historically proved reliability, scenarios
of secure communication channel creation and conservation
during task migration has been displayed. Finally, performance
issues show the interest of such protocol with a really small
overhead for UDP-like transaction.

6 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Instruction Set Simulator for MPSoCs based on NoCs and MIPS Processors

Leandro Möller1, André Rodrigues1, Fernando Moraes2, Leandro Soares Indrusiak3, Manfred Glesner1

1
Darmstadt University of Technology - Institute of Microelectronic Systems - Darmstadt, Germany
2
Faculty of Informatics - Catholic University of Rio Grande do Sul - Porto Alegre, Brazil
3
Department of Computer Science - University of York - York, United Kingdom
Email: moller@mes.tu-darmstadt.de

Abstract in the previous paragraphs and the tools to aid the


development of complex applications are the goal of this
Even though Multiprocessor System-on-Chip (MPSoC) is a work. A MIPS-like processor was connected to the
hot topic for a decade, Instruction Set Simulators (ISSs) for HERMES NoC and presented in [2]. In [2] the debug of
it are still scarce. Data exchange among processors and complex applications are implemented based on print
synchronization directives are some of the most required directives. The work presented here improves the
characteristics that ISSs for MPSoCs should supply to debugability by providing an Instruction Set Simulator
really make use of the processing power provided by the (ISS) for the MIPS processor while considering the
parallel execution of processors. In this work a framework communication time and traffic under simulation in the
for instantiating ISSs compatible with the MIPS processor NoC.
is presented. Communication among different ISS instances The ISS used in this work is the MARS ISS, developed
is implemented by message passing, which is actually by the Missouri State University [3]. This ISS was
performed by packets being exchanged over a NoC. The connected the RENATO NoC model [4], which is an actor-
NoC, the ISS and the framework that controls the co- oriented model based on the HERMES NoC. The
simulation between them are all implemented in Java. Both simulation environment used to control both the simulation
ISS and the framework are free open-source tools of the NoC and the ISS is the Ptolemy II [5], developed by
implemented by third parties and available on the internet. the EECS department at UC Berkley.
The rest of this work is divided as follows. Section 2
1. Introduction presents other ISSs targeting MPSoC architectures. A
background about the tools and basic information required
Multiprocessor systems have become a standard in the to understand this work is presented in Section 3. Section 4
computer industry since the release of the Intel Pentium D presents how the communication among ISSs takes place.
in 2005 [1]. Since then, processor manufacturers have Section 5 presents timing delays of the system and Section
focused in multi-core architectures to raise the processing 6 concludes this work.
power, favoring a larger number of cores instead of trying
to achieve higher clock speeds, avoiding also the 2. Related Works
complexity of superscalar pipelines. While executing
several small applications in parallel have a significant In this section different MPSoCs that have tools for
improve in performance with actual multiprocessor debugging their embedded software are presented. Table 1
systems, a unique complex application needs a careful summarizes the most important information of these works
development to use wisely this processing power. It is not and adds the work proposed in this paper. As presented in
simply to write the application code with multiple threads, Table 1, all works use SystemC as simulation engine and
but each thread has to be really executing in the same time memory mapped techniques to communicate with other
as the other threads, instead of paused in a wait directive. processors, except the work proposed on this paper that
While communication infrastructures based on bus have uses the Ptolemy II simulation engine and the message
been sufficient for multiprocessor systems so far, the passing technique to communicate with other processors.
increase of number of cores and data transfer associated MPARM [6] uses ARM processors connected through
will demand a more complex on-chip interconnection. For AMBA bus to compose the MPSoC. Multiprocessor
this purpose Networks-on-Chip (NoCs) have arisen as a applications are debugged with the SWARM ISS, which is
scalable solution to future increase of number of cores. The developed in C++ and was wrapped to communicate with
use of a NoC represents no direct changes to the developer the MPSoC simulated in SystemC. The platform allows
of the complex application, but it counts when the booting multiple parallel uClinux kernels on independent
execution time of the complex application is being processors.
analyzed. STARSoC [7] uses OpenRisc1200 processors connected
The design space exploration of the scenario presented through Wishbone bus. Debugging is implemented with the

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 7


OR1Ksim ISS, which is implemented in C language. The 405, Sparc-V8, Microblaze, Nios-II, ST-231, ARM-7tdmi
OR1Ksim also allows to be remote operated using GDB. and ARM-966. The GDB client/server protocol has been
Operating System is not yet supported. implemented to interface with these processors. The
HVP [8] supports several processors and therefore following operating systems are supported: DNA/OS,
several ISSs. The work presented MPSoCs that contain MutekH, NetBSD, eCos and RTEMS. Several bus and
ARM9 processors using ARM’s ISS and in-house VLIW NoCs with different topologies wrapped with the VCI
and RISC processors debugged by the LisaTek ISS. The communication standard were ported and presented at
ARM processors execute a lightweight operating system www.soclib.fr.
(name was not disclosed). The communication among The proposed work is based on a MIPS-like processor,
processors was reported to be AMBA among ARM implemented in hardware by the Plasma processor
processors and SimpleBus among the in-house processors available for free at Opencores [10] and implemented by
used. MARS [3] when simulating the processor as an ISS. All
SoClib [9] is a project developed jointly by 11 previous works used SystemC as simulation environment;
laboratories and 6 industrial companies. It contains this work uses Ptolemy II [5]. This work also differs from
simulation models for processor cores, interconnect and the others because it exchanges data between processors by
bus controllers, embedded and external memory using the native protocol of the NoC, therefore no extra
controllers, or peripheral and I/O controllers. The MPSoC translation is needed before sending and receiving packets.
accepts the following processor cores: MIPS-32, PowerPC-
Table 1 – MPSoCs that have tools for debugging embedded software.
Work ID Simulation engine Processor Communication Data exchange ISS OS
MPARM SystemC ARM Bus (Amba) Memory SWARM uClinux
STARSoC SystemC OpenRisc 1200 Bus (Wishbone) Memory OR1Ksim No
HVP SystemC Several Bus (several) Memory Several Yes
SoClib SystemC Several Bus / NoC (several) Memory GDB several
Proposed Ptolemy II Plasma (MIPS) NoC (Hermes) Message MARS No

This means that MARS simulates the execution of


3. Background programs written in the MIPS assembly language. MARS
can be executed by command line or Graphical User
This session reviews the required infrastructure to build Interface. MARS was developed by Peter Sanderson and
our MPSoC simulation environment. Kenneth Vollmar, from the Missouri State University, and
is written entirely in Java and distributed in an executable
3.1. Ptolemy II Jar file. MARS can simulate 155 basic instructions from
the MIPS-32 instruction set, as well as about 370 pseudo-
instructions or instruction variations, 17 syscall functions
Ptolemy II [5] is a framework developed by the
for console and file I/O and 21 syscalls for other uses.
Department of Electrical Engineering and Computer
Sciences of the University of California at Berkeley and it
is implemented in Java. The key concept behind Ptolemy II 3.3. RENATO NoC
is the use of well-defined models of computation to
manage the interactions between various actors and RENATO NoC [4] was developed using the Ptolemy II
components. In this work only the Discrete Event (DE) framework and its behavior and timing constraints are
model of computation was used, but others are available on based on the HERMES NoC. The basic element of the NoC
Ptolemy II. is a five bi-directional port router, which is connected to 4
In DE, the communication between actors is modeled as other neighbor routers and to a local IP core, following a
tokens being sent across connections. The sent token and MESH topology. The router employs a XY routing
its timestamp constitute an event. When an actor receives algorithm, round-robin arbitration algorithm and input
an event, it is activated and a reaction might occur, which buffers at each input port.
may change the internal state of the actor and / or generate The RENATO NoC model can be connected to a
new events, which might in its turn generate other debugging tool called NoCScope [11]. NoCScope provides
reactions. The events are processed chronologically [5]. improved observability of RENATO routers and overall
resources in use. Seven scopes are currently available,
3.2. MARS ISS allowing the user to see information about hot spots, power
consumption, buffer occupation, input traffic, output
traffic, end-to-end and point-to-point communications.
MARS [3] is a MIPS Instruction Set Simulator (ISS).

8 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Ptolemy II
PE PE PE
Network Processor
02 12 22 Interface
.
.
Output Buffer SENDFLIT:
lb $s0, 0($t7)
9 4 7 1 3 8 2 mtc0 $s0, $s0
addi $t7, $t7, 4
subi $t0, $t0, 1
bgtz $t0, SENDFLIT
PE N L
PE
9 8 74 3 21
.
.
RCVDFLIT:
bgtz $t6, SIZE
Input Buffer
01 W 11 E 21 mfc0 $t0, $t0
move $s1, $t0
li $t6, 1
eret
.
S .

PE PE PE

00 10 20 Processing Element (PE)


Figure 1 – Block diagram of the proposed multiprocessor ISS.

point of view of the processor, coprocessor 0 is now the NI.

4. Communication among processors 4.2. NI to NoC


This section presents how the MARS ISS was connected With the packet stored in the NI output buffer, the NI
to the RENATO NoC to allow the creation of a sends the packet flit by flit to the input local port of the
multiprocessor ISS. Figure 1 shows a block diagram of the router where this NI is connected. This happens following
system that will be used in the next subsections to guide the the flow control protocol in use by the NoC and using the
explanation of each component. timing delays set on the NoC model being executed by
Ptolemy.
4.1. Processor to NI
4.3. NoC to NI
In the current version of this work, each processor
executes the MIPS assembly code of one task of the When packets are being received from the NoC into the
application. Communication between tasks happens by NI, a different buffer (input buffer) is used, thus allowing
exchanging packets. In order to send a packet to another parallel sending and receiving of packets. The receiving of
task, the header of the packet and the packet data need to packets also occur following the flow control in use by the
be first stored in the data memory of the processor. The NoC and using the timing delays set on the NoC model.
header of the packet is composed by the address of the
target router where the processor is connected and the 4.4. NI to processor
number of data flits this packet contains. After that, the
send packet subroutine is called. As soon as the flits of the packet arrive in the input
The send packet subroutine first reads the size flit of the buffer of the NI, the NI launches a specific interruption to
packet stored in the memory to a register and reads to the processor meaning that a new packet has arrived. The
another register the output buffer size available in the NI. If MARS ISS, which was executing its task, saves its context
there is enough space available in the NI to store the and receives the interruption in the form of a Java
packet, the subroutine proceeds sending the packet flit by exception. The standard routine for handling exceptions is
flit to the NI. The process of “reading” a flit from the NI called. By the ID of the specific exception, the exact
uses the instruction “move from coprocessor 0” (mfc0), exception is found out to be the “new message from
while the process of “sending” a flit to the NI uses the network exception”. The specific subroutine of this
instruction “move to coprocessor 0” (mtc0). Thus, from the exception is launched. This subroutine mainly reads the

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 9


complete packet from the NI using the “move from 3002 MARS #1 sending target flit (21) to NI #1
coprocessor 0” (mfc0) instruction to read each flit of the 3002 MARS #1 sending size flit (10) to NI #1
3002 MARS #1 sending payload flit #0 (9) to NI #1
packet. After the complete packet was read from the NI and 3003 MARS #1 sending payload flit #1 (9) to NI #1
stored in the processor’s memory, the processor’s context 3003 MARS #1 sending payload flit #2 (4) to NI #1
3003 MARS #1 sending payload flit #3 (7) to NI #1
is restored and it can now continues with its execution 3003 MARS #1 sending payload flit #4 (1) to NI #1
possibly using the data that was received. 3003 MARS #1 sending payload flit #5 (3) to NI #1
3003 MARS #1 sending payload flit #6 (8) to NI #1
3003 MARS #1 sending payload flit #7 (2) to NI #1
5. Synchronization 3003
3086
MARS #1
MARS #1
sending
sending
payload
payload
flit
flit
#8
#9
(6)
(5)
to
to
NI #1
NI #1
3087 NI #1 sending target flit (21) to NoC
The straightforward solution in Java to connect more 3089 NI #1 sending size flit (10) to NoC
3091 NI #1 sending payload flit #0 (9) to NoC
than one MARS ISS to the NoC is to create a new MARS 3093 NI #1 sending payload flit #1 (9) to NoC
instance object for every new MARS instantiated in the 3095 NI #1 sending payload flit #2 (4) to NoC
3097 NI #1 sending payload flit #3 (7) to NoC
NoC. However, this alternative failed due to the fact that 3099 NI #1 sending payload flit #4 (1) to NoC
MARS has been programmed using several static classes, 3101 NI #1 sending payload flit #5 (3) to NoC
3103 NI #1 sending payload flit #6 (8) to NoC
attributes and methods. All of its main resources, such as 3105 NI #1 sending payload flit #7 (2) to NoC
the memory and the register bank, are declared as static. 3107 NI #1 sending payload flit #8 (6) to NoC
3109 NI #1 sending payload flit #9 (5) to NoC
Therefore, if one tries to run more than one instance of 3112 NoC sending target flit (21) to NI #2
MARS concurrently inside a single Java Virtual Machine 3116 NoC sending size flit (10) to NI #2
(JVM), all the running instances will share the same 3120 NoC sending payload flit #0 (9) to NI #2
3120 NI #2 sending payload flit #0 (9) to MARS #2
resources, which will lead to unexpected behavior. 3124 Noc sending payload flit #1 (9) to NI #2
One possible workaround for this problem is to run each 3128 Noc sending payload flit #2 (4) to NI #2
3132 Noc sending payload flit #3 (7) to NI #2
MARS instance in a different JVM. Java does not directly 3136 Noc sending payload flit #4 (1) to NI #2
share memory between multiple VMs, so by running each 3140 Noc sending payload flit #5 (3) to NI #2
3144 Noc sending payload flit #6 (8) to NI #2
MARS in a different JVM, one is safely isolating each 3148 Noc sending payload flit #7 (2) to NI #2
instance of MARS. One problem with this approach is that 3152 Noc sending payload flit #8 (6) to NI #2
3156 Noc sending payload flit #9 (5) to NI #2
the exchange of messages between different JVMs is only 3166 NI #2 sending payload flit #1 (9) to MARS #2
possible by using APIs such as Java Remote Method 3170 NI #2 sending payload flit #2 (4) to MARS #2
Invocation (RMI) and sockets, which would greatly 3172 NI #2 sending payload flit #3 (7) to MARS #2
3174 NI #2 sending payload flit #4 (1) to MARS #2
increase the complexity of the system. 3175 NI #2 sending payload flit #5 (3) to MARS #2
Another solution would be to reprogram MARS to 3177 NI #2 sending payload flit #6 (8) to MARS #2
3178 NI #2 sending payload flit #7 (2) to MARS #2
remove the problematic static attributes and make them 3180 NI #2 sending payload flit #8 (6) to MARS #2
unique for each instance. However, this solution was also 3181 NI #2 sending payload flit #9 (5) to MARS #2
not optimal, considering the large number of static
members declared in MARS and that every new future Figure 2 – Timing delays of the most important
version of MARS would also require these modifications. events during the transfer of a packet between two
A better solution is to instantiate isolated ClassLoaders, processors.
one for each instance of MARS to be loaded. This works All the following comments presented in this paragraph
because a static element in Java is unique only in the refer to Figure 2. Between times 3002 and 3086 MARS #1
context of a ClassLoader, therefore the static elements will sends the packet to the NI connected to it (NI #1), exactly
not interfere with the other instances of MARS called by as explained in Section 4.1. Eleven of the twelve flits of the
other ClassLoaders. By using this approach, the task of packet were sent in the first 2 simulation cycles, and the
exchanging messages between the MARS instance and its last flit of the packet at time 3086. This strange behavior
corresponding NI also becomes trivial, and can be done implies the following results: (1) MARS #1 thread was
simply by injecting a NI object when instantiating MARS. executed two times concurrently to Ptolemy thread,
A side effect of this solution is that each MARS instance between times 3002-3003 and 3086; (2) MARS thread can
and the NoC are considered as different threads by Java, be faster enough to execute at least 11 mtc0 instructions in
and this would require extra algorithms based on wait and a row during 2 simulation cycles of Ptolemy; (3) MARS
notify directives to maintain the time constraints followed thread was not called again during 83 simulation cycles
by the NoC. As the main goal of this work is not provide (3086-3003). Between times 3087 and 3109 each flit of the
good latency figures to the multiprocessor system packet was sent constantly every 2 simulation cycles from
application under simulation, we proceeded without the NI #1 to the NoC, exactly as explained in Section 4.2. This
extra algorithms, aiming a faster simulation. Figure 2 behavior is equal to the real HERMES NoC that needs 2
presents a printout of the most important events occurred clock cycles to transfer a flit using handshake flow control.
during the transfer of a packet composed by 2 header flits Between time 3112 and 3156 all the flits from the packet
and 10 payload flits from MARS #1 to MARS #2. MARS were delivered from the NoC to NI #2 as explained in
#1 is connected to router 00 as illustrated in Figure 1 and Section 4.3. However, due to some technical difficulties in
MARS #2 is connected to router 21. No extra traffic is the current version, it was not possible to deliver each flit
currently occupying the NoC. every 2 simulation cycles, but 4 simulation cycles in this

10 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


case. At time 3120 it is possible to see that NI #2 delivered
the first payload flit immediately to MARS #2. Between References
times 3166 and 3181 the rest of the payload flits were
delivered to MARS #2 as described in Section 4.4. Here [1] Intel Corporation. Intel Pentium D (Smithfield) Processor.
again it is possible to see that the data transfer did not Available at: http://ark.intel.com/ProductCollection.aspx?
follow a constant pattern, similar to one the occurred codeName=5788.
between times 3002 and 3086. This unpredictable behavior [2] Carara, E.; Oliveira, R.; Calazans, N.; Moraes, F. “HeMPS -
is a side effect of running multiple threads with no proper A Framework for NoC-Based MPSoC Generation”. In:
synchronization. ISCAS’09, 2009, pp. 1345-1348.
[3] Vollmar, D. and Sanderson, D. “A MIPS assembly language
simulator designed for education”. Journal of Computing
6. Conclusion and Future Work Sciences in Colleges, vol. 21(1), Oct. 2005, pp. 95-101.
[4] Indrusiak, L.S.; Ost, L.; Möller, L.; Moraes, F.; Glesner, M.
This work presented an ISS for multiprocessor systems Applying UML Interactions and Actor-Oriented Simulation
based on the MIPS processor. In this work the RENATO to the Design Space Exploration of Network-on-Chip
NoC model was connected to two instances of the MARS Interconnects. In: ISVLSI’08, 2008, pp. 491-494.
ISS and as result applications based on more then one [5] Eker, J.; Janneck, J.; Lee, E.; Liu, J.; Liu, X.; Ludvig, J.;
Neuendorffer, S.; Sachs, S.; Xiong, Y. “Taming
processor can be easily debugged with the presented Heterogeneity - The Ptolemy Approach”. Proceedings of the
approach. The most important contribution of this work is IEEE, vol. 91 (2), Jan. 2003, pp. 127-144.
the NI, which allows both systems to communicate, thus [6] Benini, L.; Bertozzi, D.; Bogliolo, A.; Menichelli, F.;
creating a more realistic multiprocessing system model Olivieri, M. “MPARM: Exploring the Multi-Processor SoC
composed by computation and communication. Design Space with SystemC”. The Journal of VLSI Signal
Initial figures regarding latency between processors’ Processing, vol. 41 (2), Sep. 2005, pp. 169-182.
communication through the NoC were measured and we [7] Boukhechem, S.; Bourennane, E. “SystemC Transaction-
report to be insufficient in the current version. In order to Level Modeling of an MPSoC Platform Based on an Open
have a good latency figure we must: (1) back annotate the Source ISS by Using Interprocess Communication”.
International Journal of Reconfigurable Computing, vol.
timing delays of each assembly instruction from a real 2008, Article ID 902653, 2008, 10 p.
MIPS processor to MARS; (2) add extra synchronization [8] Ceng, J.; Sheng, W.; Castrillon, J.; Stulova, A.; Leupers, R.;
logic to mimic the timing delays between processor and NI. Ascheid, G.; Meyr, H. “A high-level virtual platform for
In the current version of this work we guarantee only the early MPSoC software development”. In: CODES+ISSS'09,
NoC timing delays as presented in [4]. Future works will 2009, pp. 11-20.
be related to steps 1 and 2. [9] Pouillon, N.; Becoulet, A.; Mello, A.; Pecheux, F.; Greiner,
A. “A Generic Instruction Set Simulator API for Timed and
Untimed Simulation and Debug of MP2-SoCs”. In: RSP'09,
2009, pp. 116-122.
[10] Opencores. Available at: http://www.opencores.org.
[11] Möller, L.; Indrusiak, L.S.; Glesner, M. “NoCScope: A
Graphical Interface to Improve Networks-on-Chip
Monitoring and Design Space Exploration”. In: IDT'09,
2009.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 11


12 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010
Impact of Task Distribution, Processor
Configurations and Dynamic Clock Frequency
Scaling on the Power Consumption of FPGA-based
Multiprocessors

Diana Goehringer, Jonathan Obie Michael Huebner, Juergen Becker


Fraunhofer IOSB ITIV, Karlsruhe Institute of Technology (KIT)
Ettlingen, Germany Karlsruhe, Germany
{diana.goehringer, jonathan.obie}@iosb.fraunhofer.de {michael.huebner, becker}@kit.edu

Abstract— As only the currently required functionality on a be adjusted to find an optimal parameterization of the processor
dynamic reconfigurable FPGA-based system is active, a good core in relation to the target application. For example, a
performance per power ratio can be achieved. To find such a specific cache size can speed up the application tremendously,
good performance per power ratio for a given application is a but also the optimal partition of functions onto the two cores
difficult task, as it requires not only knowledge of the behavior of has a strong impact on the speed and power consumption of the
the application, but also knowledge of the underlying hardware system. The examples show the huge design space, if only one
architecture and its influences on the performance and the static parameter is used. It is obvious, that the usage of multiple
and dynamic power consumption. Is it for example better to use parameters for system adjustment leads to a multidimensional
two processors running at half the clock frequency than a single
optimization problem, which is not or at least very hardly
processor? The main contributions of this paper are: the
description of a tool flow to measure the power consumption for
manageable by the designer. In order to gain experience
multiprocessor systems in Xilinx FPGAs, a novel runtime regarding the impact of processor parameterization in relation
adaptive architecture for analyzing the performance per power to specific application scenario, it is beneficial to evaluate e.g.
tradeoff and for dynamic clock frequency scaling based-on the the performance and power-consumption of an FPGA-based
inter-processor communication. Furthermore, we use three system and normalize the results to a standard design with a
different application scenarios to show the influence of the clock default set of parameter. The result of such an investigation is a
frequency, different processor configurations and different first step for developing standard guidelines for designers and
application partitions onto the static and dynamic power an approach for an abstraction of the design space in FPGA-
consumption as well as onto the overall system performance. based system design. This paper presents first results of a
parameterizable multiprocessor system on a Xilinx Virtex-4
Keywords- Power Consumption, Multiprocessor System-on- FPGA, where the parameterization of the processor is
Chip (MPSoC), Dynamic Frequency Scaling, Task Distribution, evaluated in terms of power consumption and performance.
Application Partitioning, Dynamic and Partial Reconfiguration, Moreover, the varying partition of the different application
FPGA. scenarios is evaluated in terms of power consumption for a
fixed performance. For this purpose, a tool flow for analyzing
I. INTRODUCTION the power consumption through generating the value change
Parameterizable function blocks used in FPGA-based dump (VCD) file from the post place and route simulation will
system development, open a huge design space, which can only be introduced. The presented flow enables to generate the most
hardly be managed by the user. Examples for this are accurate power consumption estimation from this level of
arithmetic blocks like divider, adder, soft IP-multiplier, which abstraction. A further output of the presented work is an
are adjustable in terms of bitwidth and parallelism. Additional overview of the impact of parameterization to the power
to arithmetic blocks, also soft-IP processor cores provide a consumption. The results can be used as a basic guideline for
variety of parameters, which can be adapted to the designers, who want to optimize their system performance and
requirements of the application to be realized with the system. power consumption.
Especially, Xilinx offers with the MicroBlaze Soft-IP RISC The paper is organized in the following manner: In Section
processor [1] a variety of options for characterizing the core II related work is presented. Section III describes the power
individually. These options are amongst others the use of cache estimation tool flow used in this approach. The novel system
memory and its size, the use of an arithmetic unit, a memory architecture used for analyzing the performance and the power
management unit and the number of pipeline stages. consumption of the different applications is presented in
Furthermore, the tools offer to deploy up to two processor Section IV. The application scenarios are described in Section
cores as multiprocessor on one FPGA. Every option now can V. In Section VI the application integration and the results for
performance and power consumption are given. Finally, the

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 13


paper is closed by presenting the conclusions and future work The XPE tool is based on an excel spreadsheet. It receives
in Section VII. information about the number and types of used resources via
the report generated by the mapping process (MAP) of the
II. RELATED WORK Xilinx tool flow. Alternatively, the user can manually set the
Optimization of the dynamic and static power consumption values for the number and type of used resources. The
is very important, especially for embedded systems, because frequencies used within the design have to be manually set by
they often use batteries as a power source. the user. The advantage of this method is that results are
Therefore, many researchers like for example Meintanis et obtained very fast. The disadvantage is that the results are not
al [2] explored the power consumption of Xilinx Virtex-II Pro, very accurate, especially for the dynamic power consumption.
Xilinx Spartan-3 and Altera Cyclone-II FPGAs. They This is, because the different toggling rates of the signals are
estimated the power consumption at design-time using the not taking into account. Also, the results are not as accurate,
commercial tools provided by Xilinx and Altera. They further because they are based on the MAP report, and not on the post
explored the differences between the measured and estimated place and route (PAR) report, which resembles the system used
power consumption for these FPGAs. Becker et al. [3] for generating the bitstream.
explored the difference between measured and estimated power The XPower tool estimates the dynamic and static power
consumption for the Xilinx Virtex-2000E FPGA. Furthermore, consumption for submodules, different subcategories and the
they explored the behavior of the power consumption, when whole system based on the results of a post place and route
using dynamic reconfiguration to exchange the FPGA-system (PAR) simulation. This makes the estimation results much
at runtime. more accurate compared to the XPE tool, because the final
Other works focus on the development of own tools and placed and routed system is considered for the power
models for efficient power estimation at design-time for estimation. But even more important, due to the simulation of
FPGA-based systems. Poon et al. [4] present a power model to the PAR system with real input data, the toggling rates of the
estimate the dynamic, short circuit and leakage power of signals can be extracted and used within the power estimation.
island-style FPGA architectures. This power model has been For estimating the power consumption with the XPower tool
integrated into the VPR CAD flow. It uses the transition the following input files are required:
density signal model [5] to determine signal activities within x Native Circuit Description (NCD) file, which specifies the
the FPGA. Weiss et al. [6] present an approach for design-time design resources
power estimation for the Xilinx Virtex FPGA. This estimation x Physical Constraint File (PCF), which specifies the design
method works well for control flow oriented applications but constraints
not so well for combinatorial logic. Degalahal et al. [7] present x Value Change Dump (VCD) file, which specifies the
a methodology to estimate dynamic power consumption for simulated activity rates of the signals
FPGA-based system. They applied this methodology to explore The NCD and the PCF file are obtained after the PAR
the power consumption of the Xilinx Spartan-3 device and to phase of the Xilinx implementation tool flow. The VCD file is
compare the estimated results with the measured power generated by doing a simulation of the PAR design with the
consumption. ModelSim simulator.
All these approaches focus either on the proposal of a new Due to the higher accuracy the XPower tool was used here.
estimation model or tool for estimating the power consumption As we wanted to estimate the power consumption for systems
at design-time or they compare their own or commercial with one or two MicroBlaze processors, the hardware and the
estimation models and tools with the real measured power software executables of the different system were designed
consumption. The focus of the investigations presented in this within the Xilinx Platform Studio (XPS)[10]. Figure 1. shows
paper is to show the impact of parameterization of IP-cores, the flow diagram for doing power estimation with XPower for
here specifically the MicroBlaze soft processor, which differs an XPS system.
from the approaches mentioned above where the topic is more System Design in Xilinx Platform Studio (XPS)
on tool development for power estimation.
The novelty of our approach is to focus on the requirements
of the target application and to propose a design guideline for Synthesis (using XST) and
system developers of processor-based FPGA systems. This Implementation(Translate, Map, PAR) in the
means, providing guidance in how to design a system to EDK XPS GUI Environment
achieve a good tradeoff between performance and power
consumption for a target application. To develop such a
Post PAR Timing Simulation Model
guideline the impact of the frequency, different processor
Generation (Simgen)
configurations and the task distribution in a processor-based
design is investigated in this paper for different application
scenarios. To the best of our knowledge, similar work has not Timing Simulation and Generation of
done before. VCD file (ModelSim)

III. TOOL FLOW FOR POWER MEASUREMENT


Xilinx provides two kinds of tools for power consumption Power Estimation in XPower
estimation: Xilinx Power Estimator (XPE) [8] and Xilinx
Figure 1. Flow Diagram of the EDK XPower Flow
Power Analyzer (XPower) [9].

14 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


After the system has been designed and implemented
within the XPS environment, the Simgen [10] tool is used to
generate the post PAR timing simulation model of the system.
This simulation model is the used to simulate the behavior of
the system with the ModelSim simulator and to generate the FPGA FSL
PCI Bus

VCD file. In the last step XPower is used to read in the VCD, VIRTUAL_IO
TIMER
the NCD and the PCF files of the design and to estimate the PLB
dynamic and static power consumption.Care has to be taken, µB0

because in a normal Xilinx implementation flow the software


UART
executables are integrated into the memories of the processors
after the bitstream has been generated. When using XPower Key:
and the post PAR simulation, the memories of the processor FSL : Fast Simplex Link
PLB : Processor Local Bus
have to be initialized in an earlier step. This means, into the UART : Universal Asynchronous Receiver Transmitter
µB : Microblaze
post PAR simulation model, otherwise the simulated system PCI : Peripheral Component Interconnect

behavior and the VCD file would not be accurate.


Figure 3. Uni-processor system
IV. NOVEL SYSTEM ARCHITECTURE
The following subsections explain the new components and
The system structure of the dual-processor system is shown their features more in detail.
in Figure 2. Three new components have been designed and
implemented: the Virtual-IO, the Bridge and the A. Virtual-IO
Reconfigurable Clock Unit. All three components have been The Virtual-IO component is used to communicate with the
integrated into a library for the XPS tool. Therefore, they can host PC via the PCI-bus. It provides an input and an output port
be inserted and parameterized using the graphical user interface to the PCI-bus and one input and one output port for each
(GUI) of the XPS tool, which makes them easy reusable within MicroBlaze processor. It consists of two FIFOs, one for the
other XPS designs. incoming and one for the outgoing data of the PCI-bus. Each
FIFO is controlled via a Finite State Machine (FSM), as it is
shown in Figure 4.

From LB FIFO To μBs


PCI Bus
FPGA FSL FSL
VIRTUAL_IO FSM
TIMER
FSM
PLB FSL FSL
µB0 BRIDGE µB1

Reconfiguration
To LB FIFO From μBs
UART Signal

Clock0 RECONFIGURABLE CLOCK UNIT Clock1

Figure 4. Virtual-IO component


Key:
FSL : Fast Simplex Link
PLB : Processor Local Bus The Virtual-IO is a wrapper around 6 different modules.
UART : Universal Asynchronous Receiver Transmitter
µB : Microblaze
The first module is Virtual-IO 1, which sends data first to μB0
PCI : Peripheral Component Interconnect and then to μB1. It then receives the calculated results in the
Figure 2. Dual processor design with three new components:
same order. The second module is Virtual-IO 2, which sends
Virtual-I/O, Bridge and Reconfigurable Clock Unit data only to μB0. Results are only received over μB1.
Therefore, μB0 sends its results to μB1, which then sends the
The Virtual-IO receives data from the host PC and sends results of μB0 together with its own results back to the Virtual-
results back to the host PC via the PCI-bus. The Virtual-IO IO 2. The third module is Virtual-IO 3, which sends first data
communicates via the Fast Simplex Links (FSLs) [11] with two to μB0. Afterwards, it sends in parallel to both processors μB0
MicroBlaze processors (μB0 and μB1). μB0 communicates and μB1 the same data. Finally, it sends some data only to
with the user via the UART interface. It also has a timer, which μB1. After the execution of the processors, first μB0 and then
is used to measure the performance of the overall system. The μB1 send their results back to the Virtual-IO 3. The fourth
two processors communicate with each other via FSLs over the module is Virtual-IO 4, which is only connected to one of the
Bridge component. Depending on the fill level of the FIFOs processors, e.g. μB0. Due to this, this module is used in all uni-
within the Bridge reconfiguration signals are send to the processor designs. For a dual-processor design it sends data to
Reconfigurable Clock Unit. The Reconfigurable Clock Unit μB0, which then forwards parts of the data to μB1. After
reconfigures the clocks of the two processors based on the execution μB1 sends its results back to μB0, which forwards
reconfiguration signals issued by the Bridge. For the uni- the results of the execution of the two processors to the Virtual-
processor system, which is used for comparison, the Bridge, IO 4. The fifth module is Virtual-IO 5, which sends the same
the Reconfigurable Clock Unit, μB1 and their connections data to both processors in parallel, but receives the results only
were removed as shown in Figure 3. via μB0. The sixth module is Virtual-IO 6. It is very similar to

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 15


Virtual-IO 5. The only difference is that it receives the used. Instead of stalling the corresponding processor, the
calculation results from μB1 instead of μB0. BUFGMUX primitive is used to provide CLK_IN, the original
The modules can be selected in the XPS GUI via the input clock of the two DCM, to the processor, whose DCM is
parameters of the Virtual-IO component. Other parameters that under reconfiguration. The BUFGMUX is a special clock
can be set by the user are: the number of input and output multiplexer primitive, which assures, that no glitches occur,
words for each processor separately, the number of common when switching to a different clock. After the configuration of
input words and the size of the image (only for image the DCM is finished, the BUFGMUX is used to switch back to
processing applications). the DCM clock. An alternative would be to stall the processor,
while its clock is being reconfigured. Because 200 ms are quite
B. Bridge a long time, especially for image processing applications,
The Bridge module is used for the inter-processor where each 40 ms a new input frame is received from a
communication. It consists of two asynchronous FIFOs camera; this would result in a loss of input data.
controlled by FSMs, to support a communication via the two To prevent an oscillation, the controller logic will stop
different clock domains of the processors, as shown in Figure increasing the clock frequency, if 125 MHz for this MicroBlaze
5. This Bridge component controls the fill level of the two have been reached, which is the maximum frequency supported
FIFOs. If one FIFO is to nearly full, it is assumed that the by the MicroBlaze and its peripherals, or if its clock frequency
processor, which reads from this FIFO, is too slow. As a result, has been increased for three consecutive times. If the
a reconfiguration signal to increase the clock rate of this reconfiguration signal is still asserted meaning the processor is
processor is send to the Reconfiguration Clock Unit. still too slow, then the DCM of the faster processor is
Reconfiguration reconfigured to provide a slower clock to the faster processor.
Signal for Clock 1
Alternatively, instead of dynamically reconfiguring the
Clock Domain 0 Clock Domain 1
DCM, different output ports of a DCM could be used to
From μB0 FIFO To μB1 generate different clocks. Using several BUFGMUXes the
different clocks could be selected. The advantage is a faster
FSM FSM switch between different clocks and the drawback is that not as
many different clocks are possible as when dynamic
To μB0 FIFO From μB1
reconfiguration is used. This will be investigated in future
work.
Reconfiguration
Signal for Clock 0
V. APPLICATION SCENARIOS
Figure 5. Internal structure of the Bridge Three different applications scenarios were used to explore
the impact of the processor configurations, the task distribution
C. Reconfigurable Clock Unit and the dynamic clock frequency scaling on the power
The internal structure of the Reconfigurable Clock Unit is consumption of FPGA-based processor systems. The three
shown in Figure 6. It consists of two Digital Clock Managers different algorithms are described in detail in the next
(DCMs) [12], two Clock Buffer Multiplexer primitives subsections. The first algorithm is the well known sorting
(BUFGMUXes) [13] and the Logic component, which controls algorithm called Quicksort [14]. It consists of a lot of branches
the reconfiguration of the DCMs. and comparisons. The second algorithm is an image processing
algorithm called Normalized Squared Correlation (NCC),
Reconf iguration which consists of many arithmetic operations, e.g. multiply and
Signals divide. The third algorithm is a variation of a bioinformatic
LOGIC
algorithm called DIALIGN [15], which consists of many
comparisons and additions and subtractions. These algorithms
CLOCK 0
with their different algorithm requirements, e.g. branches,
DCM
To μB0 comparators, multiply & divide, add & subtract, were used to
provide a user guideline of designing a system with a good
CLK IN
CLOCK 1
performance per power tradeoff for a specific application. By
DCM To μB1
comparing the algorithm requirements of new applications with
the three example algorithms, the system configurations of the
most similar example algorithm is chosen as a starting system.
Figure 6. Internal structure of the Reconfigurable Clock Unit Such a guideline to limit the design space is very important to
save time and achieve a higher time-to-market, because the
The Logic component receives the reconfiguration signals simulation and the power estimation with XPower are very
of the Bridge component. It then starts the reconfiguration of time-consuming. Also, the bitstream generation to measure the
the DCM primitive for the slower processor. For the performance of the application on the target hardware
reconfiguration the specific ports provided by Xilinx for architecture is time-consuming. These long design times can be
dynamic reconfiguration of the Virtx-4 DCM primitive are shorten by starting with an appropriate design, e.g. the right
used. During the reconfiguration process the DCM has to be processor configurations, a good task distribution and a well
kept in a reset state for a minimum of 200 ms. During this time selected execution frequency.
interval the outputs of this DCM are not stable and cannot be

16 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


A. Sorting Algorithm: Quicksort processor design running at 100 MHz, which is a standard
Quicksort [14] is a well known sorting algorithm with a frequency for Virtex-4 based MicroBlaze systems.
divide and conquer strategy. It sorts a list by recursively A. Impact of the clock frequency
partitioning the list around a pivot and sorting the resulting
sublists. It has an average complexity of  (n log n). First of all the impact of the variation of the clock
frequency onto the power consumption was explored for a uni-
B. Image Processing Algorithm: Normalized Squared processor system, which executes the NCC algorithm on one
Correlation MicroBlaze. The MicroBlaze was configured to use a 5-stage
2D Squared Normalized Correlation (NCC) is often used to pipeline and no arithmetic unit. The results for the dynamic and
identify an object within an image. The evaluated expression is quiescent power consumption for the core and the other
shown in equation (1). components as well as the total power consumption of the
system are given in TABLE I. The quiescent power
¦ n
i 0 ¦
m
j 0
A p
(i , j )  A p * T (i , j )  T
2
consumption is also called static power consumption in the
¦ * ¦ ¦ T ( i , j )  T
C ( p)
n
i 0 ¦
m
j 0
A p (i , j )  A p
2 n
i 0
m
j 0
2 following, because it represents the power consumption of the
user configured FPGA without any switching activity.
T : Template image with n Rows and m Columns
(1) The impact of the clock frequency onto the static - and the
A p : Sub window of the search region wit h n Rows and m Columns
dynamic power consumption is presented in Figure 8. and
T : Mean of T Figure 9. respectively. As can be seen the static power
A p : Mean of Ap consumption increases by around 0,24 mW / MHz, while the
dynamic power consumption increases by around 3,26 mW /
This algorithm uses a template T of the object to be
MHz.
searched for and moves this template over the search region A
Out of this results the impact onto the total power
of the image. Ap, the subwindow of the search region at point p
consumption, which is around 3,5 mW / MHz. The impact on
with the same size as T, is then correlated with T. The result of
the total power consumption as well as on the performance is
this expression is stored at point p in the result image C. The
shown in Figure 10.
more similar Ap and T are the higher is the result of the
correlation. If they are equal, the result is 1. The object is then mW
Uni-Processor Results with NCC
core_quiescent
detected at the location with the highest value. 470
468
466
C. Bioinformatic Algorithm: DIALIGN 464
462
DIALIGN [15] is a bioinformatics algorithm, which is used 460
for comparison of the alignment of two genomic sequences. It 458
456
 0,24 mW/ M Hz
produces the alignment with the highest number of similar 454
452
elements and therefore the highest score as shown in Figure 7. 450
MHz

40 50 60 70 80 90 100 110
Sequence a: A T G A G C A G AT G A G - C A G
DIALIGN
Sequence b: C A T G A G T C A G C A T G A G T C A G Figure 8. Impact of the clock frequency onto the static
power consumption of a uni-processor design.
Figure 7. Alignment of two sequences a and b with DIALIGN.

Uni-Processor Results with NCC


core_dynamic
VI. INTEGRATION AND RESULTS mW

350
For the power consumption estimation and the performance
measurement a Xilinx Virtex-4 FX 100 FPGA was used. The 300

performance was measured on the corresponding FPGA Board 250


 3,26 mW/ M Hz
from Alpha Data [16]. As measuring the exact power
200
consumption of the FPGA on this board is not possible, it was
MHz
estimated at design-time using the XPower tool flow as 150
40 50 60 70 80 90 100 110
described in Section III. The impact of the clock frequency, the
configuration of the processor and the task distribution onto the
Figure 9. Impact of the clock frequency onto the dynamic
power consumption and the performance of the system has power consumption of a uni-processor design.
been explored and the results are presented in the following
subsections. For each exploration some parameters had to be
Tot al_Pow er Execut ion Tim e
kept fixed to assure a fair comparison. For the exploration of 1500
mW  3,5 mW/ M Hz
150
ms

the impact of the clock frequency, the algorithm and the 1450 130
110
1400
processor configuration have been kept fixed. For the 1350
90
70
exploration of the impact of the configuration of the processor 1300
1250 MHz
50 MHz
50 60 70 80 90 100
the clock frequency were kept fixed. Finally, for the 50 60 70 80 90 100

exploration of the task distribution, the processor configuration Figure 10. Impact of the clock frequency onto the total
and the performance were kept fixed to lower the overall power consumption and onto the execution time of a uni-
system power consumption while maintaining the performance processor design executing the NCC algorithm.
similar to the performance achieved with a reference uni-

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 17


TABLE I. IMPACT OF THE VARIATION OF THE CLOCK FREQUENCY ONTO THE POWER CONSUMPTION

Clk Freq. (MHz) PCoreDynamic(mW) POthersDynamic(mW) PCoreQuiescent(mW) POthersQuiescent(mW) PTotal(mW) PTotal(%)

50 180 26 453 641 1298 - 11,9


60 232 26 457 641 1355 - 8,0
70 243 26 458 641 1367 - 7,2
80 273 26 460 641 1398 - 5,1
90 301 26 462 641 1428 - 3,1
100 343 26 465 641 1473 NA
and a 5-stage pipeline would be optimal from a performance
B. Impact of the processor configurations perspective. If the power consumption needs to be reduced and
For exploration, a uni-processor design consisting of a some performance degradation is acceptable, than the reference
single MicroBlaze running at 100 MHz was used. The results system or the AU+RP system would be a good choice.
were compared against a reference configuration, which was a
MicroBlaze with a 5-stage pipeline and no arithmetic unit % NCC at 100 MHz
60
(integer divider and barrel shifter). The following Referenced to System
50 w ith 5-Stage Pipeline
configurations were explored: and no AU
40
i. adding an arithmetic unit (AU)
30
ii. reduction of the pipeline to 3-stages (RP) Tot al_Pow er (mW)
20
iii. combination of i and iii (AU+RP) Core_Dynamic (mW)
10
The impact onto the power consumption and the performance Core_St at ic (mW)
0
was explored for all three algorithms. The impact is very Execution Time (ms)
-10
different for the different applications, due to the different
algorithm requirements, as mentioned in Section V and its -20 AU : Arithmetic Unit
RP: Reduced Pipeline
subsections. -30
AU RP AU + RP
Figure 11. and TABLE II. show the impact of the different
configurations for the Quicksort algorithm. Due to the multiple
Figure 12. Impact of the MicroBlaze configurations for
branches in the algorithm a reduction of the pipeline stages is the NCC algorithm.
very beneficial in terms of execution time and power
consumption. The impact of the addition of the arithmetic unit In Figure 13. and TABLE IV. the impact onto the
only provides a minimal improvement in terms of performance, performance and power consumption of the three different
but with a stronger degradation of the power consumption. processor configurations compared to the reference system are
Depending on the performance and power consumption presented for the DIALIGN algorithm. Adding an AU
constraints, either the system with the AU + RP or the RP improves the execution time only a little bit, while increasing
system would be chosen. the overall power consumption compared to the reference
% Quicksort at 100 MHz design. The reduction of the pipeline to 3-stages improves the
15
total power consumption by 6,8%, but worsening the execution
Referenced to System
10 w ith 5-Stage Pipeline
time by 25%. The combination of AU+RP shows nearly the
5 and no AU same impact as the RP system. Therefore, the reference system
0 is the best choice, if performance is the most important factor.
-5 Tot al_Pow er (mW) If on the other hand the power consumption is more important,
-10 Core_Dynamic (mW) than the RP system would be a good choice for these kinds of
-15 Core_St at ic (mW) algorithms.
-20 Execut ion Time (ms)
-25
%
Dialign at 100 MHz
AU : Arithmetic Unit
-30 RP: Reduced Pipeline 30 Referenced to System
w ith 5-Stage Pipeline
-35
and no AU
AU RP AU + RP 20

10
Figure 11. Impact of the MicroBlaze configurations for Tot al_Pow er (mW)
the Quicksort algorithm. 0 Core_Dynamic (mW)
Core_St at ic (mW)
Figure 12. and TABLE III. show the impact of the different -10
Execut ion Time (ms)
configurations for the NCC algorithm. As this algorithm -20
AU : Arithmetic Unit
requires many arithmetic operations, the addition of an AU RP: Reduced Pipeline
improves the overall execution time, while the reduction of the -30
AU RP AU + RP
pipeline stages results in a strong degradation (over 50%). This
degradation is due to the reason that the execution of arithmetic
Figure 13. Impact of the MicroBlaze configurations for
operations take more clock cycles, if the pipeline is reduced. the DIALIGN algorithm
Therefore, for this and similar algorithms a system with an AU

18 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


TABLE II. IMPACT OF THE MICROBLAZE CONFIGURATIONS FOR THE QUICKSORT ALGORITHM AT 100 MHZ

μB Parameter PCoreDynamic(mW) POthersDynamic(mW) PCoreStatic(mW) POthersStatic(mW) PTotal(mW) PTotal(%) Time (ms)


Default 438,33 26,13 472,91 639,56 1576,93 NA 18,42
Arithmetic Unit 493,26 26,14 477,20 639,58 1636,18 + 3,76 18,10
3-stage Pipeline 354,23 26,13 466,41 639,56 1486,33 - 5,75 17,21
Both 372,84 26,13 467,84 639,58 1506,39 - 4,47 16,89

TABLE III. IMPACT OF THE MICROBLAZE CONFIGURATIONS FOR THE NCC ALGORITHM AT 100 MHZ

μB Parameter PCoreDynamic(mW) POthersDynamic(mW) PCoreStatic(mW) POthersStatic(mW) PTotal(mW) PTotal(%) Time (ms)


Default 341,68 26,09 465,44 639,57 1472,78 NA 67,74
Arithmetic Unit 366,28 26,13 467,33 639,57 1499,31 + 1,80 53,62
3-stage Pipeline 269,63 26,10 459,97 639,57 1395,27 - 5,26 103,64
Both 269,40 26,12 459,95 639,58 1395,05 - 5,28 88,84

TABLE IV. IMPACT OF THE MICROBLAZE CONFIGURATIONS FOR THE DIALIGN ALGORITHM AT 100 MHZ

μB Parameter PCoreDynamic(mW) POthersDynamic(mW) PCoreStatic(mW) POthersStatic(mW) PTotal(mW) PTotal(%) Time (μs)


Default 431,67 26,06 472,38 639,58 1569,69 NA 786,48
Arithmetic Unit 464,39 26,06 474,93 639,59 1604,97 + 2,25 777,64
3-stage Pipeline 343,01 26,05 465,54 639,59 1474,19 - 6,08 982,85
Both 355,88 26,06 466,53 639,58 1488,05 - 5,20 988,05

TABLE V. QUICKSORT POWER CONSUMPTION

Uni-Processor (100MHz) Dual_2 (80/50 MHz) Dual_5 (95 MHz)


Execution Time - ms 18,42 18,80 19,27
Core (dyn/stat)_Power - mW 438,33 / 472,91 295,89 / 461,95 384,34 / 468,72
Total Power - mW 1576,93 1475,56 1570,79
Total Power - % NA - 6,43 - 0,39

TABLE VI. NCC POWER CONSUMPTION

Uni-Processor (100MHz) Dual_3 (54 MHz) Dual_2 (87,5/50 MHz)


Execution Time - ms 67,74 67,28 67,62
Core (dyn/stat)_Power - mW 341,68 / 465,44 297,39 / 462,07 322,32 / 463,96
Total Power - mW 1472,78 1477,20 1504,02
Total Power - % NA + 0,30 + 2,12

TABLE VII. DIALIGN POWER CONSUMPTION

Uni-Processor (100MHz) Dual_5 (50 MHz) Dual_6 (50 MHz)


Execution Time - ms 30,21 30,16 30,16
Core (dyn/stat)_Power - mW 431,67 / 472,38 440,80 / 473,09 352,45 / 466,27
Total Power - mW 1569,69 1631,62 1536,44
Total Power - % NA + 3,95 - 2,12
so partitioned that μB0 receives the whole data to be sorted. It
C. Impact of the task distribution and the frequency scaling then divides the data into two parts and sends the second part to
To measure the impact onto the power consumption the μB1. Both then sort their partition. μB0 forwards its sorted part
algorithms were partitioned onto two MicroBlaze processors. of the list to μB1, which sends the final combined sorted list
The frequency for the two processors was chosen in such a via the Virtual-IO 2 to the host PC. With this partition the
way, that the execution time of the dual-processor design was overall power consumption could be reduced by 6,43%
as similar as possible to the reference system consisting of a compared to the single processor reference system.
single MicroBlaze running at 100 MHz. For all systems the The second partition called Dual_5 (95 MHz) uses the
configurations of the processors were fixed to a 5-stage Virtual-IO 5 to send incoming data to both processors running
pipeline and no arithmetic unit. at 95 MHz. μB0 searches the list for elements smaller and μB1
TABLE V. shows the results for distributing the Quicksort searches the list for elements bigger than the pivot. When one
algorithm on two processors instead of one. Two partitions has found an element the position of this element is send to the
were done. The first one is called Dual_2 (80/50 MHz), which other processor. Both processor then update their lists by
means, that the Virtual-IO 2 was used and μB0 was running at swapping the own found element with the one the other
80 MHz while μB1 was running at 50 MHz. The algorithm was processor has found. At the end both processors have as a

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 19


result a searched list. μB0 then sends its resulting list back to extend existing hardware benchmarks from different
the host PC via the Virtual-IO 5. The power consumption of application domains in terms of a parameterization guideline
this version is nearly the same as the reference system, while also for further FPGA series from Xilinx.
the total execution time increases. Furthermore, the paper provides a tutorial for the estimation
TABLE VI. shows the result for the partitioning of the of the power consumption on a high level of abstraction, but
NCC algorithm onto two processors. The first partitioning uses with a high accuracy through post place and route simulation.
the Virtual-IO 3 to partition the incoming image into two Therefore, other research in this area can be done and
overlapping tiles, one for each processor. The overlapping part exchanged in the community.
is send to both processors simultaneously. As the NCC is a
window-based image processing algorithm, the boarder pixels ACKNOWLEDGMENT
between the two tiles are needed by both processors. Each of
the processors runs at 54 MHz, which results in a similar The authors would like to thank Prof. Alba Cristina M. A.
execution time, and also in a similar total power consumption de Melo and Jan Mendonca Correa for providing us with their
as the reference design. C code implementation of the DIALIGN algorithm.
The second partition called Dual_2 (87,5 /50 MHz) uses
Virtual-IO 2 to send the whole image to μB0. μB0 runs at 87, 5 REFERENCES
MHz and calculates the complete numerator and the [1] “Xilinx MicroBlaze Reference Guide”, UG081 (v7.0), September
denominator. Then it forwards both to μB1, which does the 15, 2006, available at: http://www.xilinx.com.
division and sends the results back to the Virtual-IO 2. μB1 [2] D. Meintanis, I. Papaefstathiou: “Power Consumption Estimations
runs at 50 MHz. While the execution time is nearly the same, vs Measurements for FPGA-based Security Cores”; International
the overall power consumption is increased slightly by 2,12%. Conference on Reconfigurable Computing and FPGAs 2008
(ReConFig 2008), Cancun, Mexico, December 2008.
TABLE VII. shows the result for executing the DIALIGN
Algorithm with two processors. Two partitions were done. The [3] J. Becker, M. Huebner, M. Ullmann: “Power Estimation and
Power Measurement of Xilinx Virtex FPGAs: Trade-offs and
first one is called Dual_5 (50 MHz) and uses Virtual-IO 5 to Limitations”; In Proc. of the 16th Symposium on Integrated
send the incoming sequences to both processors running at 50 Circuits and Systems Design (SBCCI’03), Sao Paulo, Brazil,
MHz. Each processor calculates half of the resulting score September 2003.
matrix. μB0 calculates on a row-based fashion all values above [4] K. Poon, A. Yan, S.J.E. Wilton: “A Flexible Power Model for
the main diagonal. μB1 calculates on a column-based fashion FPGAs”; In Proc. of 12th International Conference on Field-
all values below the main diagonal. The scores on the main Programmable Logic and Applications (FPL 2002), September
2002.
diagonal are calculated by both processors. After μB0 has
[5] F. Najm: “Transition density: a new measure of activity in digital
finished calculating one row and μB1 one column respectively, circuits”; IEEE Transactions on Computer-Aided Design, vol. 12,
they exchange the first score nearest to the main diagonal, as no. 2, pp. 310-323, February 1993.
this score is needed by both processors for calculating the next [6] K. Weiss, C. Oetker, I. Katchan, T. Steckstor, W. Rosenstiel:
row/column respectively. While the execution time is nearly “Power estimation approach for SRAM-based FPGAs”; In Proc. of
the same, the overall power consumption is increased by International Symposium on Field Programmable Gate Arrays
3,95%. (FPGA’00), pp. 195-202, Monterey, CA, USA, 2000.
The second partition is called Dual_6 (50 MHz). It uses the [7] V. Degalahal, T. Tuan; “Methodology for high level estimation of
FPGA power consumption”; In Proc. of ASP–DAC 2005
Virtual-IO 6 to send the sequences to the processors, which run Conference, Shanghai, January 2005.
both at 50 MHz. Here a systolic array approach is used for [8] “Xilinx Power Estimator User Guide”, UG440 (v3.0), June 24,
executing the DIALIGN algorithm. μB1 then sends the final 2009, available at: http://www.xilinx.com.
alignment and the score back to the host PC. With this partition [9] “Development System Reference Guide”, v9.2i, Chapter 10
the overall power consumption could be reduced by 2,12% XPower, available at: http://www.xilinx.com.
compared to the single processor reference system. [10] “Embedded System Tools Reference Manual”, Embedded
Development Kit, EDK 9.2i, UG111 (v9.2i), September 05, 2007,
VII. CONCLUSIONS AND OUTLOOK Chapter 3, available at: http://www.xilinx.com.
This paper reports the research and evaluation of different [11] “Fast Simplex Link (FSL) Bus (v2.00a)”; DS449 Dec. 1, 2005,
available at http://www.xilinx.com.
microprocessor parameterization, application and data
[12] “Virtex-4 FPGA Configuration User Guide”, UG071 (v1.11), June
partitioning on a dual-processor system. The results of the 9, 2009, available at: http://www.xilinx.com.
experiments show the impact of the different parameterization
[13] “Virtex-4 FPGA User Guide”, UG070 (v2.6), December 1, 2008,
on the power consumption and performance in relation to a set available at: http://www.xilinx.com.
of selected applications. Depending on the application type it [14] C. A. R. Hoare: “Quicksort”; Computer Journal, vol. 5, 1, 10–15
can be seen that different parameter configurations, e.g. (1962).
configuration of the processors and their frequencies, but also a [15] A. Boukerche, J. M. Correa, A. C. M. A. de Melo, R. P. Jacobi: “A
good application partitioning, are essential for achieving an Hardware Accelerator for Fast Retrieval of DIALIGN Biological
efficient tradeoff between performance and power constraints. Sequence Alginments in Linear Space”; IEEE Transactions on
The results can be used to guide developers what parameter set Computers, vol. 59, no. 6, pp. 808-821, 2010.
suits to a certain application scenario. The vision is that more [16] Alpha-Data: http://www.alpha-data.com
application scenarios will be analyzed in order to provide a
broad overview of the parameter impact. It is envisioned to

20 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Novel Approach for Modeling Very Dynamic and
Flexible Real Time Applications
Ismail Ktata1,2, Fakhreddine Ghaffari1, Bertrand Granado1 and Mohamed Abid2
1
ETIS Laboratory, CNRS UMR8051, University of Cergy-Pontoise,
ENSEA, 6 avenue du Ponceau F95000 Cergy-Pontoise, France
2
Computer & Embedded Systems Laboratory (CES), National School
of Engineers of Sfax, (ENIS), B.P.W. 3038 Sfax, Tunisia
1
Email: {firstname.name}@ensea.fr
2
Email: {firstname.name}@enis.rnu.tn

Abstract could be abstracted at some level in two ways: at design time


by providing design tools and at run time by providing an
Modeling techniques are used to solve a variety of practical operating system that abstracts the lower level of the system
problems related to processing and scheduling in several domains [4]. Moreover, such architecture requires the presence of an
like manufacturing and embedded systems. In such flexible and
appropriate operating system that could manage new tasks at
dynamic environments, there are complex constraints and a
variety of unexpected disruptions. Hence, scheduling remains a run time and under different constraints. This operating
complex and time-consuming process and the need for accurate system, and to effectively manage dynamic applications, has
models for optimizing this process is increasing. This paper deals to be able to respond rapidly to events. This can be achieved
with dynamically reconfigurable architectures which are by providing a suitable scheduling approach and dedicated
designed to support complex and flexible applications. It focuses services like hardware preemption that decreases
on defining a solution approach for modeling such applications configurations and contexts transfer times. To realize an
where there is significant uncertainty in the duration, resource efficient schedule of an application, this operating system
requirements and number of tasks to be executed. needs to know the behavior of this application, in particular
the part where the dynamicity can be exploited on a DRA.
Keywords: Dynamically Reconfigurable Architecture, In this paper, we are interested in the modeling of
uncertainty, scheduling, modeling methodologies, DFG. applications that could be executed on dynamically
reconfigurable architecture. This kind of applications is
characterized, in addition to its real-time constraints, by
several types of flexibility. The purpose is to improve the
I. Introduction performance of the modeling techniques which facilitates the
job to design an efficient scheduling approach.
Today, integrated silicon applications are more and more The remainder of this paper is structured as follows: in
complex. Moreover, in spite of its performance, ASICs Section 2, brief review is given about the context and the
development is still long and very expensive, and provides related work on modeling techniques used in different
inefficient solutions for many applications which are domains. Section 3 describes our new technique modeling
composed of several heterogeneous tasks with different applications targeted to DRA. The Section 4 reports a
characteristics. In addition, the growing complexity of real- description of the proposed modeling method and
time applications today presents important challenges, in great comparisons with other models, while the last Section draws
part due to their dynamic behavior and uncertainties which conclusions.
could happen at runtime [1]. To overcome these problems,
designers tend to use dynamically reconfigurable architectures
(DRA). The development of the latter opens new horizons in II. Context and problem definition
the field of architecture design. Indeed, the DRAs are well
suited to deal with the dynamism of new applications and Today, embedded systems are more and more used in
allow better compromise between cost, flexibility and several domains: automobiles, robots, planes, satellites, boats,
performance [2]. In particular, fine grained dynamically industrial control systems, etc. An important feature of these
reconfigurable architectures (FGDRA), as a kind of DRAs, systems is to be reactive. A reactive system is a system that
can be adapted to any application more optimally than coarse continuously reacts to its environment at a rate imposed by
grain DRAs. This feature makes them today an interesting this environment itself. It receives inputs from the
solution when it comes to handle computational tasks in a environment, responds to these stimuli by making a number of
highly constrained context. However, this type of architecture operations and produces the outputs used by the environment,
makes the applications design very complex [3], especially known as reactions. Dynamically reconfigurable architectures
with the lack of suitable and efficient tools. This complexity are an interesting solution for this type of applications. Due to

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 21


that emerging range of applications with dynamic behavior, assuming that this schedule will be executed exactly as
dynamic scheduling for reconfigurable system-on-chip planned. However, this could become infeasible during the
(RSoC) platforms has become an important field of research execution due to the dynamic environment, where
[2]. unexpected events continually occur. Therefore, in this
case, a reactive approach may be more appropriate.
A. Problematics  Instead of anticipating future uncertainties, reactive
This paper deals with the constraint-based scheduling for scheduling takes decisions in real-time when some
real-time applications executed on FGDRA. In particular, we unexpected events occur. A reference deterministic
focus on two major problems: scheduling, determined off-line, is sometimes used and re-
 The modeling of the application that should exhibit its optimized. In general, reactive methods may be more
dynamical aspects and must allow the expression of its appropriate for high degrees of uncertainty, or when
constraints, in particular real-time constraints. information about the uncertainty is not available.
 The run-time performance of the scheduling algorithm that A combination of the advantages of both precedent
must be reasonable in term of overhead for a typical approaches is called proactive-reactive scheduling. This
application. hybrid method implies a combination of a proactive strategy
The different components of a scheduling problem are the for generating a protected baseline schedule with a reactive
tasks, the potential constraints, the resources and the objective strategy to resolve the schedule infeasibilities caused by the
function. Tasks execution must be programmed to optimize a disturbances that occur during schedule execution. Hence, this
specific objective with the consideration of several criteria. scheduling/rescheduling method permits to take into account
Many resolution strategies have been proposed in literature uncertainties all over the execution process and ensures better
[5]. Usually these methods assume that processing times can performance [9] [10]. For rescheduling, the literature provided
be modeled with deterministic values. They use predictive two main strategies: schedule repair and complete
schedule that gives an explicit idea of what should be done. rescheduling. The first strategy is most used as it takes less
Unfortunately, in real environments, the probability of a pre- time and preserves the system stability [11].
computed schedule to be executed exactly as planned is low Scheduling techniques are quite different depending on the
[6]. This is because of not only variations, but also because of nature of the problem and the type of disturbance considered:
a lot of data that are only previsions or estimations. It is then resources failure, the duration of the variation and the fact that
necessary to deal with uncertainty or flexibility in the process new tasks can occur, etc. The mainly used methods are
data. Hence, for an effective resolution, we need to make a dispatching rules, heuristics, metaheuristics and artificial
significant reformulation of the problem and the solving intelligence techniques [12]. In [13] authors considered a
methods in order to facilitate the incorporation of this scheduling problem where some tasks (called "uncertain
uncertainty and imprecision in scheduling [7]. tasks") may need to be repeated several times to satisfy the
Uncertainty in scheduling may arise from many sources [8]: design criteria. They used an optimization methodology based
 The release time of tasks can be variable, even unexpected. on stochastic dynamic programming. In [14] and [15]
scheduling problem with uncertain resource availabilities was
 New unexpected tasks may occur.
encountered. Authors used proactive-reactive strategies and
 Cancellation or modification of existing tasks.
heuristic techniques. Another uncertainty case, which is
 The execution order of tasks on resources can be changed. uncertain tasks duration, had been studied in [16] and [17].
 Resources may become unavailable. Authors discuss properties of robust schedules, and develop
 Tasks assignments: if a task could be done on different exact and heuristic solution approaches.
resources (identical or not), the choice of this resource can
be changed. This flexibility is necessary if such a resource C. Scheduling model
becomes unusable or less usable than others.
To study a system we need models to describe it, including
 The ability to change execution mode: this mode includes
significant system characteristics of geometry, information
the approval or disapproval of preemption, whenever a
and dynamism. The latter is a crucial system characteristic as
task could be resumed or not, the overlap between tasks,
it permits to represent how a system behaves and changes
changing the range of a job, taking into account whether or
states over time. Moreover, dynamic modeling can cover
not a time of preparation, changing the number of
different domains from the very general to the very specific
resources needed for a task, etc.
[18]. Model types have different presentations, as shown in
We are considering real cases where some variations could
figure 1: some are text-based using symbols while others have
occur and some data may change over the forecast. The model
associated diagrams.
has to be few sensible to data uncertainties and variations, and
be flexible to be adaptable to the possible disturbances.  Graphical models use a diagram technique with named
symbols that represent process and lines that connect the
symbols and represent relationships and various other
B. Scheduling under uncertainty
graphical notations to represent constraints (figure 1(a), (b)
In general, there are two main approaches dealing with (d)).
uncertainty in a scheduling environment according to phases  Textual models typically use standardized keywords
in which uncertainties are taken into account [8]: accompanied by parameters (figure 1(c)).
 Proactive scheduling approach aims to build a robust In addition, some models have static form, whereas others
baseline schedule that is protected as much as possible have natural dynamics during model execution as in figure 1
against disruptions during schedule execution. It takes into (a). The solid circle (a token) moves through the network and
account uncertainties only in design phase (off-line). represents the executional behavior of the application.
Hence, it constructs predictive schedule based on statistical
and estimated values for all parameters, thus implicitly

22 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


dynamically modifiable systems [24]. In fact, the PN structure
presents only the static properties of a system while the
dynamic one results from PN execution which requires the use
of tokens or markings (denoted by dots) associated with places
[25]. The conventional model suffers from good specification
of complex systems like lack of the notion of time which is an
essential factor in embedded applications and lack of
hierarchical composition [26]. Therefore, several formalisms
have independently been proposed in different contexts in
order to overcome the problems cited above, such as
introducing the concepts of hierarchy, time, and valued
tokens. Timed PNs are those with places or transitions that
have time durations in their activities. Stochastic PNs include
the ability to model randomness in a situation, and also allow
for time as an element in the PN. Colored PNs allow the user
and designer to witness the changes in places and transitions
Figure 1. Four types of dynamic system models. through the application of color-specific tokens, and
(a) Petri net. (b) Finite state machine. (c) Ordinary differential movement through the system can be represented through the
equation. (d) Functional block model. changes in colors [18].
Most of the methodologies mentioned provide no
In the domain of embedded systems, a large number of sufficient support for systems which include variable dynamic
modeling languages have been proposed [19], [20], [21], features. Dynamic creation of tasks for instance is not
including extensions to finite state machines, data flow supported by most of the systems above mentioned [26]. In
graphs, communicating processes, and Petri nets, among [26], authors proposed an extension of high-level Petri net
others. In this section we present main models of computation model [27] in order to capture dynamically modifiable
for real-time applications reported in the literature. embedded systems. They coupled that model with graph
transformation techniques and used a double pushout
 Finite State Machines approach which consists of the replacement of a Petri net by
The classical Finite State Machine (FSM) representation is another Petri net after firing of transitions. This approach
probably the most well-known model used for describing allows modeling dynamic tasks creation but not variable
control systems. However, one of the disadvantages of FSMs execution time nor variable number of needed resources.
is the exponential growth of the number of states that have to
be explicitly captured in the model as the system complexity
III. Proposed method
increases making the model increasingly difficult to visualize
and analyze [22]. For dynamic systems, the FSM
Before beginning to describe our modeling method, we
representation is not appropriate because the only way to
define the constraints that typically appear in dynamic
model such kind of systems is to create all the states that
systems. In our case, we consider a firm real-time context. In
represent the dynamic behavior of the application. It is then
fact, for actually developed applications, especially
unthinkable to use it as the number of states could be
multimedia and control applications, tardiness or deadline
prohibitive.
violations results only in degradation of Quality of Service
(QoS) without affecting good processing. In this context
 Data-Flow Graph hardware tasks are characterized by the following parameters:
A data-flow graph (DFG) is a set of compute nodes - Execution time (Ci),
connected by directed links representing the flow of data. It is - Deadline (Di),
very popular for modeling data-dominated systems. It is - Periodicity (Pi),
represented by a directed graph whose nodes describe the - Precedence constraints among tasks,
processing and the arcs represent the partial order followed by - Tasks could be preempted,
the data. However, the conventional model is inadequate for - Used resources of the DRA.
representing the control unit of systems [23]. It provides no In real world, all characteristics may change: tasks (the
information about the ordering of processes. It is therefore release time, deadline and execution time), as well as the
inappropriate to model dynamic applications. availability of resources. There are several types of changes
like uncertainty, unexpected changes and values variation. For
 Petri Net our study, we will consider three cases of dynamic scheduling
Petri net (PN) is a modeling formalism which combines a problems for dynamic applications:
well-defined mathematical theory with a graphical (a) The number of tasks is not fixed. It may change from
representation of the dynamic behavior of systems [18]. Petri iteration to another.
Net is a 5-tuple PN = (P, T, F, W, Mo) where: P is a finite set (b) The tasks execution time may change too.
of places which represent the status of the system before or (c) The number of needed resources for tasks execution is
after the execution of a transition. T is a finite set of variable. In addition, the number of available resources may
transitions which represent tasks. F is a set of arcs (flow decrease after a failure occurs.
relation). W: F  {1, 2, 3, ... } is a weight function. M0 is the For those cases, the goal is to develop a robust scheduling
initial marking. However, though Petri net is well-established method that is little sensible to data uncertainties and
for the design of static systems, it lacks support for variations between theory and practice. In addition, the

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 23


schedule has to be flexible to be adaptable to the possible Between arcs, we find circles marking events of beginning or
disturbances. Therefore, we consider a proactive-reactive end of tasks (figure 2). In this model, we replace the release
approach. The proactive technique (a reference schedule time feature by the execution time as it would be changed over
computed offline for the static part) is used to facilitate the execution, and we keep the deadline as it will be useful to
execution of the reactive strategy online, so that scheduling minimize the makespan (or the length of the schedule). In
decisions have a better quality and produced in a shorter time. figure 3, tasks with variable execution time are {T1, T7, T8}.
To represent all those constraints in the same model and to be This will be noticed by the use of asterisk.
adequate for the adopted scheduling approach, our graph will
be composed of two forms of nodes. The first type of nodes
refers to static tasks which are known in advance and whose
execution is permanent during the whole application of the
lifecycle execution. The second type is for dynamic tasks
which could be executed in a cycle and not in others and with
a variable number of needed resources. Therefore, the first
step is to separate the two parts of the application (static and
dynamic). Then, a priority-based schedule is established
offline for static part. Such a schedule serves very important
functions. The first is to allocate cells of the hardware DRA to
the different hardware tasks. The second is to serve as a basis Figure 2. PERT graph
for planning tasks that may occur during execution. The basic
idea is to sort static tasks from the graph and schedule them To be executed, hardware tasks need a minimum number
according to a priority scheme, while respecting their of resources. The percentage of this minimal number is
precedence constraints (topological order). Tasks are executed indicated in labels over tasks nodes. This case is frequently
at the earliest possible time. If there is equality between some found on some computer vision applications where a first task
tasks, task which has maximum execution time will have is detecting objects and then a particular processing is applied
priority to be launched before the others. At runtime, the on each detected object. To execute multiple instances of this
objective is to generate a new schedule that deviates from the process, the minimum needed number of resources will be
original schedule as little as possible so that the repair multiplied by the number of instances. In figure 3, task T9 will
operations will be mostly simple. Therefore, the online be executed n times. T9 is represented with circled node and
scheduler must take into account the next tasks to be the minimum needed number of resources indicated in the
performed with their dynamic aspects (different duration, label will be multiplied by n. If the integer number n=0 then
more or less instances to execute, more or less number of task will not occur. The instance number is unexpected from
needed cells for execution). It has to prefetch configurations the static phase and decision will be taken online. The number
context of new tasks in the columns that are available for n will depend on the previous executions and the actual input
executing and to find the possible way to integrate them in the data to be processed (such as keypoints in the robotic vision
current schedule without effect on performance. This schedule application). Thus, n will be recalculated, after each period,
repair must rely on rapid algorithms based on a simple based on its previous values. For more details, we take an
scheduling technique so that it can perform online execution example of execution. In each iteration the scheduler will have
with no overhead. It consists in finding a suitable partitioning two values of n: np which is a predicted value of n to be used
of N tasks, forming the application, to be executed on M target by the scheduler for the next period, and nr which is the real
resources of the hardware devices. In addition, tasks execution value of n for the executed task. At t=0, let n= np =0 (no
order has to meet the real time constraints. This scheduling prediction to execute T9). In the first execution, the
problem is known to be NP-hard [28] [29]. In this context, application needs n= nr =10 instances of T9. So, for the
heuristics are schedule repair methods which offer fast and second iteration, the scheduler will predict to execute n= np
reasonably good solution but do not guarantee to find an =10 instances of T9, while, in the real execution, n= nr =6. For
optimal schedule. the next execution, np could be: the maximum of precedent
We make use of the example shown in figure 3 in order to values, the highest values of n, the average of real last values
illustrate the different definitions corresponding to our model. (n= np =8), or a Gaussian like probability which is a typical
For this example, the set {T1, T2, T3, T4, T5, T6, T7, T8} realistic distribution, etc. The choice will depend on the
represents the static tasks which are always executed in each application. For the example of a moving robot which need to
period and {T9, T10, T11} are dynamic tasks which may be predict the direction and the presence or absence of obstacles,
executed in some period but not in others and whose number it will be preferred to take the last real values with different
of resource requirements is variable. Each task is represented weights. In our model, the number n is represented above the
with time characteristics: Ci for the execution time and Di for arcs. The arcs represent the dependencies between tasks. For
the deadline. static (permanent) tasks, we represent arcs with solid lines,
The first dynamic feature considered in this model is tasks while unpredictable dependencies are represented by dashed
with variable execution time. To represent that case, we are lines.
inspired by the Program Evaluation and Review Technique
(PERT) [30], which is a network model that allows
randomness in activity execution times. For each task, PERT
indicates a start date and end time at the earliest and latest.
The chart identifies the critical path which determines the
minimum duration of the project. Tasks are represented by
arcs with an associated number presenting the tasks duration.

24 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Figure 3. New model for dynamic application Figure 4. (a) A Petri net representation of the example of figure 3.
(b) A DFG representation of the example of figure 3.
On the reconfigurable device, resource failure may occur
which will affect the resource availability. Thus, execution Timing information is useful for determining minimum
may occur or not depending on the available resources application completion time, latest starting time for an activity
compared with needed ones. A variable, which is initialized as which will not delay the system, and so on. We have inspired
the total number of resources, will indicate the amount of from the PERT chart to explicitly represent useful time
remaining resources. From the ready list, algorithm features that are, in our case, execution time and deadline.
determines the tasks that can be executed on the However, Petri net does not provide any of this type of
reconfigurable device. Tasks with the higher priorities will be information. The only important aspect of time is the partial
placed first until the area of device is fully occupied. ordering of transitions. For example, it presents variable tasks
duration with a set of consequent transitions for each task
which will complicate the model (see figure 5). The addition
IV. Comparison of models
of timing information might provide a powerful new feature
for Petri nets but may not be possible in a manner consistent
When we compare with other models (section 2-C), the
with the basic philosophy of Petri nets Research [33].
proposed technique presents several advantages. For
For resource representation, PN models this feature by an
unpredicted number of occurring tasks, the data flow graph
added place with a fixed number of tokens. To begin
does not contain information about the number of instances.
execution, a task removes a token from the resource place and
During execution of the application, every task represented by
gives it back in the end of its execution. However, this model
the nodes of DFG is executed once in each iteration [31]. Only
is inadequate in our case since the number of available
when all nodes have finished their executions, a new iteration
resources may change over the execution.
can start. So to model this, we need to represent n nodes of the
Therefore, the use of conventional modeling methods is
same task, which increases the size of the model (see figure
not effective in our case. In fact, with PN and DFG model
4(b)). In our model, information about number of instances of
(figure 4), there is no distinction between static tasks and
a same task to be executed is noted by the circle form of the
dynamic ones (that may not be executed), nor an explicit
task and the indicated number above its arc. For PN model,
notion of time (as variable execution time of some tasks).
(see figure 4(a)), arcs could be labeled with their weights
where a k-weighted arc can be interpreted as the set of k
parallel arcs [32]. But, from its definition (section 2-C),
weights are positive integers, so it cannot present a fictive arc
with non firing transition representing a task that may not be
executed in some iterations. In figure 4(a), if T9 and T10 are
not executed, then n should be null, which is impossible from
PN definition. In addition, to fire T8 all input places should
have at least one token, which will be not possible if T10 or
T11 was not executed (fired).

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 25


and try to fit into the set of already guaranteed tasks, to delay
or to reject these new dynamic tasks.

V. Conclusion

This paper presented a particular modeling problem


dealing with the implementation of dynamic and flexible
applications on dynamically reconfigurable architecture. The
purpose is to consider most of the dynamic features supported
by the architecture and to present them in an easy and efficient
method. For that point we have proposed a model, based on
some features of existing modeling techniques, and which is
more dedicated to dynamic real time applications. The main
advantage of our specification model is the possibility to
obtain more exact scheduling characteristics from the
representation. Those characteristics include the distinction
between static and dynamic occurring tasks, bring out tasks
Figure 5. PN model for variable tasks duration whose execution time may change over the time and
determine the number of resources needed for executing
The main advantage of the proposed method is the hardware tasks. So, we will be able either to take decisions or
possibility to present several dynamic features of real time not. As a reconfigurable architecture, we target OLLAF [4]
applications with the minimum of nodes and thus in a simple (Operating system enabled Low LAtency Fgdra), an original
formalism. Indeed, from the first sight of the model, we can FGDRA specifically designed to enhance the efficiency of an
bring out three main characteristics of this model: RTOS services necessary to manage such architecture. Future
- The distinction between the static execution (which is works will consist in integrating our scheduling approach
presented by the squared nodes) and the dynamic execution among the services of an RTOS taking into account the new
i.e. the tasks whose execution and number of instances is possibilities offered by OLLAF.
uncertain (presented by the circled nodes),
- The tasks whose execution time is variable (presented by the VI. References
asterisk),
- The percentage, for each task, of needed resources for its [1] C.Steiger, H.Walder, M.Platzner, "Operating systems for
execution. In each iteration, and depending on the available reconfigurable embedded platforms: online scheduling of real-
resources, schedule is able to decide which ready tasks could time tasks", Computers, IEEE Transactions on Volume 53, Issue
be executed on the device. 11, Nov. 2004 Page(s): 1393 - 1407.
Those informations will be useful further for the scheduler. [2] J. Noguera, R.M. Badia, "Multitasking on reconfigurable
We consider a 1D area model as it is a commonly used architectures: Microarchitecture support and dynamic
scheduling", ACM Transactions on Embedded Computing
reconfigurable resource model, but even 2D area model could
Systems, Volume 3, Issue 2 (May 2004) pp. 385-406.
be considered. In that considered 1D model, tasks can be [3] A. Mtibaa, B. Ouni and M. Abid, "An efficient list scheduling
allocated anywhere along the horizontal device dimension; the algorithm for time placement problem", Computers and
vertical dimension is fixed and spans the total height of the Electrical Engineering 33 (2007) 285–298.
hardware task area (see figure 6). [4] S. Garcia, B. Granado, "OLLAF: a Fine Grained Dynamically
Reconfigurable Architecture for OS Support", EURASIP
Journal on Embedded Systems, October 2009.
[5] P. Brucker & S. Knust, "Complex Scheduling", Springer Berlin
Heidelberg, 2006.
[6] V. T'kindt and J.-C. Billaut, "Multicriteria Scheduling: Theory,
Models and Algorithms", Springer-Verlag (Heidelberg), second
edition (2006).
[7] N. González, R. Vela Camino, I. González Rodríguez,
"Comparative Study of Meta-heuristics for Solving Flow Shop
Scheduling Problem Under Fuzziness", Second International
Work-Conference on the Interplay Between Natural and
Artificial Computation, IWINAC 2007, Spain, June 18-21,
2007, Proceedings Part I : 548-557.
[8] A. J. Davenport and J. C. Beck, "A Survey of Techniques for
Scheduling with Uncertainty", accessible on-line at
http://tidel.mie.utoronto.ca/publications.php on February 2006,
Figure 6. 1D area model of a reconfigurable device 2000.
[9] H.Aytug, M.A.Lawley, K.McKay, S.Mohan, R.Uzsoy,
Based on the proposed tasks model, statically defined tasks "Executing production schedules in the face of uncertainties: A
represented by squared nodes, will be scheduled and placed on review and some future directions", European Journal of
the reconfigurable device during the proactive phase. Whereas Operational Research 161, 2005, p86-110.
dynamically defined tasks, which are represented by circled [10] W.Herroelen and R.Leus, “Project scheduling under uncertainty:
Survey and research potentials”, European Journal of
nodes, will be scheduled online in a reactive phase. This
Operational Research, Vol. 165(2) (2005) 289--306.
online decision will take into account the variable parameters

26 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


[11] G.E.Vieira, J.W.Hermann, and E.Lin, "Rescheduling [31] Oliver Sinnen. "Task Scheduling for Parallel Systems" (Wiley
manufacturing systems: a framework of strategies, policies and Series on Parallel and Distributed Computing). Wiley-
methods", Journal of Scheduling, 6 (1), 36-92 (2003). Interscience, 2007.
[12] D. Ouelhadj, and S. Petrovic, "A survey of dynamic scheduling [32] Tadao Murata, "Petri nets: Properties, Analysis and
in manufacturing systems", Journal of Scheduling, 2008. Applications", Proceedings of the IEEE, 77(4):541-574, April
[13] Peter B. Luh, Feng Liu and Bryan Moser, "Scheduling of design 1989.
projects with uncertain number of iterations", European Journal [33] James L. Peterson, "Petri net theory and the modeling of
of Operational Research, 1999, vol. 113, issue 3, pages 575-592. systems", Prentice Hall PTR, 1981.
[14] O.Lambrechts, E.Demeulemeester, W.Herroelen, "Proactive and
reactive strategies for resource-constrained project scheduling
with uncertain resource availabilities", Journal of scheduling
2008, vol.11, no.2, pp. 121-136.
[15] S.Liu, K.L.Yung, W.H.Ip, "Genetic Local Search for Resource-
Constrained Project Scheduling under Uncertainty",
International Journal of Information and Management Sciences
2007, VOL 18; NUMB 4, pages 347-364.
[16] M. Turnquist and L. Nozick, "Allocating time and resources in
project management under uncertainty", Proceedings of the 36th
Annual Hawaii International Conference on System Sciences,
Island of Hawaii, January 2003.
[17] J. Christopher Beck, Nic Wilson, "Proactive Algorithms for Job
Shop Scheduling with Probabilistic Durations", Journal of
Artificial Intelligence Research 28 (2007) 183–232.
[18] P.A Fishwick, "Handbook of dynamic system modeling",
Chapman & Hall/CRC Computer and Information Science
Series 2007.
[19] S. Edwards, L. Lavagno, E. A. Lee, and A. Sangiovanni-
Vicentelli, "Design of Embedded Systems: Formal Models,
Validation, and Synthesis", Proc. IEEE, 85(3):366–390, March
1997.
[20] L. Lavagno, A. Sangiovanni-Vincentelli, and E. Sentovich,
"Models of computation for embedded system design". In A. A.
Jerraya and J. Mermet, editors, System-Level Synthesis, pages
45–102, Dordrecht, 1999. Kluwer
[21] A. Jantsch. "Modeling Embedded Systems and SoC’s:
Concurrency and Time in Models of Computation". Morgan
Kaufmann, San Francisco, CA, 2003.
[22] L. Alejandro Cortes, "Verification and Scheduling Techniques
for Real-Time Embedded Systems", Ph. D. Thesis No. 920,
Dept. of Computer and Information Science, Linköping
University, March 2005.
[23] L. Alejandro Cortés, P. Eles and Z. Peng, "A Survey on
Hardware/Software Codesign Representation Models", SAVE
Project, Dept. of Computer and Information Science, Linköping
University, Linköping, June 1999.
[24] Carsten Rust, Franz J. Rammig: "A Petri Net Based Approach
for the Design of Dynamically Modifiable Embedded Systems".
DIPES 2004: 257-266.
[25] M. Tavana, "Dynamic process modelling using Petri nets with
applications to nuclear power plant emergency management",
Int. J. Simulation and Process Modelling (2008), Vol. 4, No. 2,
pp.130–138.
[26] Franz-Josef Rammig, Carsten Rust, "Modeling of Dynamically
Modifiable Embedded Real-Time Systems", WORDS Fall 2003:
28-34.
[27] E. Badouel and J. Oliver. "Reconfigurable Nets, a Class of High
Level Petri Nets Supporting Dynamic Changes". In Proc. of a
workshop within the 19th Int’l Conf. on Applications and
Theory of Petri Nets, 1998.
[28] Michael Garey and David Johnson, "Computers and
Intractability: A Guide to the Theory of NP-completeness",
Freeman, 1979.
[29] Z.A. Mann, A. Orbán, "Optimization problems in system-level
synthesis". Proceedings of the 3rd Hungarian-Japanese
Symposium on Discrete Mathematics and Its Applications,
Tokyo-Japan (2003).
[30] F. Chauvet, J.-M. Proth, "The PERT Problem with Alternatives:
Modelisation and Optimisation", Report N° RR-3651 (1999)
SAGEP (INRIA Lorraine) France.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 27


28 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010
New Three-level Resource Management for Off-line
Placement of Hardware Tasks on Reconfigurable Devices
Ikbel Belaid, Fabrice Muller Maher Benjemaa

University of Nice Sophia-Antipolis Research Unit ReDCAD


LEAT-CNRS, France National engineering school of Sfax
e-mail: belaid@unice.fr, fmuller@unice.fr Tunisia
e-mail: Maher.Benjemaa@enis.rnu.tn
Abstract—The FPGA devices are widely used in in accordance with the ability of placer to allocate free
reconfigurable computing systems, these devices can resources in reconfigurable hardware device to the elected
achieve high flexibility to adapt themselves to various task. While some proposed techniques increase the
applications by reconfiguring dynamically portions of performance of scheduling as well as of application, these
the dedicated resources. This adaptation with the techniques suffer from placement problems: resource
application requirements reaches better performance wastage, task rejection and configuration overheads. In this
and efficient resource utilization. However, the run-time paper, under FOSFOR project1, we focus mainly on an
partial reconfiguration brings more complex efficient management of hardware tasks on reconfigurable
partitioning of the FPGA reconfigurable area. This hardware devices by taking advantage from the run-time
issue implies that efficient task placement algorithm is reconfiguration. Nowadays, heterogeneous SRAM-based
required. Many on-line and off-line algorithms designed FPGAs are the most prominent reconfigurable hardware
for such partially reconfigurable devices have been devices. In this work, we target the recent Xilinx column-
proposed to provide efficient hardware task placement. based FPGAs to optimize the quality of hardware task
In these previous proposals, the quality of hardware placement basing on mixed integer programming
task placement is measured by the resource wastage and formulation and by using powerful solvers which rely on
task rejection and major of these research works the complete non-exhaustive resolution method called
disregard the configuration overhead. Moreover, these Branch and Bound. Experiments are conducted on an
algorithms optimize these criteria separately and do not application of heterogeneous tasks and an improvement in
satisfy all goals. These considerations can not reflect the placement quality was shown by 30 % as an average rate of
overall situation of placement quality. In this paper, we resource utilization which achieves up 27 % of resource
have interested in off-line placement of hardware tasks gain comparing to static design. In the worst case, the
in partially reconfigurable devices and we propose a resulted configuration overhead is 10 % of the total running
novel three-level resource management for hardware time and we discarded the issue of task rejection.
task placement. The proposed off-line resource The rest of the paper is organized as follows: the next
management is based on mixed integer programming section reviews some related work of hardware task
formulation and enhances placement quality which is placement. Section 3 details our three-level off-line strategy
measured by the rate of task rejection, resource of resource management on FPGA. Section 4 depicts the
utilization and configuration overhead. formulation of placement problem as mixed integer
Keywords-hardware task placement; mixed integer programming. Section 5 describes the obtained results and
programming; run-time reconfiguration the evaluation of placement quality. Concluding remarks
and future works are presented in section 6.
I. INTRODUCTION
The FPGA devices get faster and larger due to the high II. RELATED WORK
density of their heterogeneous resources. Consequently, the Placement problem consists of two main functions: i)
number and the complexity of modules to load on them the partitioning that handles the free resource space to
increases, hence better performance can be achieved by identify the potential sites for hardware task execution and
exploiting FPGAs in reconfigurable computing systems. ii) the fitting that selects the feasible placement solution.
Furthermore, the ability to reconfigure the FPGA partially Many research groups investigated on the placement of
as it is running speeds up the applications in reconfigurable hardware tasks on FPGAs. Current strategies dealing with
systems. The technique of run-time partial reconfiguration task placement are divided into two categories: off-line
also improves the performance of hardware task placement and on-line placement.
scheduling. For hardware tasks, placement and scheduling 1
FOSFOR (Flexible Operating System FOr Reconfigurable platforms) is
are strongly linked; the scheduler decision should be taken French national program (ANR) targeting the most evolved technologies.
Its main objective is to design a real time operating system distributed on
hardware and software execution units which offers required flexibility to
application tasks through the mechanisms of dynamic reconfiguration and
homogeneous Hw/Sw OS services.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 29


A. On-line Methods for Hardware Task Placement placement as more-dimensional packing problem. From a
The main reference is [1], Bazargan et al. propose on- set of tasks modeled as 3D polytopes with two spatial
line scenario and introduces two partitioning techniques: dimensions and the time of computation and basing on
Keeping All Maximal Empty Rectangles and Keeping packing classes as well as on a fixed scheduling, they
Non-overlapping Empty Rectangles. Both techniques search a feasible placement on a fixed-size chip to
manage free resource space to search the empty holes. accommodate a set of tasks. The resolution is performed
Nevertheless, the fitting is conducted by bin-packing rules. by Branch and Bound technique to optimality of dynamic
In [2], Walder et al. deals with 2D on-line placement by hardware reconfiguration. By optimizing the total
relying on efficient partitioning algorithms that enhance execution time and the resource utilization, the method of
the Bazargan’s partitioning such as On-The-Fly placement in [10] proposed by Danne and Stuehmeier
partitioning. The Walder’s partitioner delays split decision consists of two phases. The first phase is the recursive bi-
instead of using the Bazargan’s heuristics for decision of partitioning by means of slicing tree that defines the
split and uses hash matrix data structure that finds a relative position of each hardware task towards the other
feasible placement in constant time. Ahmadinia et al. hardware task placement and finds the appropriate room in
presents in [3] a new method of on-line placement by the reconfigurable device for each hardware task according
managing the occupied space on the device and by fitting to task’s resources and inter-task communication. The
the tasks on the sites by means of Nearest Possible Position second phase uses the obtained room topology to achieve
method that reduces the communication cost. Some the sizing that computes the possible sizes for each room.
metaheuristics are adopted to resolve the hardware task Major of the existing strategies provide a non-guarantee
placement such as [4] that employs an on-line task system as they suffer from task rejection and resource
rearrangement by using genetic algorithm approach. In [4], wastage. Major of the proposed methods of placement are
when a new arriving task could not be placed immediately applicable only in homogeneous devices and addresses
by first-fit strategy, the proposed approach combining two near-identical and non-preemptive hardware tasks. As we
genetic algorithms allows the task rotation and tries to have full knowledge about the set of hardware tasks and
rearrange a subset of tasks executing on the FPGA to allow the features of the reconfigurable device, in this work, we
the processing of the pending task sooner. present a realistic three-level resource management
solution as a new strategy to perform off-line placement of
B. Off-line Methods for Hardware Task Placement hardware tasks in heterogeneous FPGA.
In the off-line scenario for hardware task placement,
[1] defines 3D templates in time and space dimensions and III. THREE-LEVEL OFF-LINE RESOURCE
uses simulated annealing and greedy research heuristics. MANAGEMENT
By considering the placement of hardware tasks as In our three-level resource management, we are based
rectangular items on hardware device as rectangular unit, on features of hardware tasks and reconfigurable hardware
several approaches for resolving the two-dimensional device. We use Xilinx’s Virtex FPGA as a reference for the
packing problem are proposed. For example, in [5], the hardware reconfigurable device to lead our hardware
off-line approximate heuristics: Next-Fit Decreasing resource management study.
Height, First-Fit Decreasing Height and Best-Fit
A. Terminology
Decreasing Height are presented as strip packing
approaches based on packing items by levels. Lodi et al. We define few terms which are used to describe the
propose also in [6] and [7] different off-line approaches to three-level resource management. Throughout the paper,
resolve hardware task placement as 2D bin-packing the number of tasks is presented by NT, NZ is the number
problem for instance Floor-Ceiling algorithm and of Reconfigurable Zones (RZ), NR is the number of
Knapsack packing algorithm. The Knapsack packing Reconfigurable Physical Blocs (RPB) specific for a given
algorithm proposed in [7] initializes each level by the RZ and NP is the number of Reconfigurable Bloc (RB)
tallest unpacked item and completes it by packing tasks as types in the chosen technology. We consider two levels of
the associated Knapsack problem that maximizes the total abstraction.
area within level. In [8], as bin packing problem, an off- 1) Application Level: Each hardware task (Ti) is
line approach is proposed by Fekete et al. through a graph- featured by the worst case execution time (Ci), the period
theoretical characterization of the packing of a set of items (Pi) and a set of predefined preemption points (Preempi,l)
into a single bin. Tasks are presented as three-dimensional specified by the designer according to the known states in
boxes and the feasible packing is decided by the the task behavior and to the data dependency between
orthogonal packing problem within a given container. these states. The number of preemption points of Ti is
Their approach considers packing classes, precedence denoted by NbrPreempi. In addition, tasks are presented by
constraints and the edge orientation to solve the packing a set of reconfigurable resources called Reconfigurable
problem. Similarly, in [9], Teich et al. defines the task Blocs (RBk). The RBs are the required physical resources

30 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


in the reconfigurable hardware device to achieve task
execution and they define the RB-model of the hardware (3)
task as expressed by (1). The determination of the RB-
model of hardware tasks is well detailed in our work in Fig. 1 illustrates an example of potential RPBs
[11]. The RBs are the smallest reconfigurable units in the partitioned on reconfigurable area of the hardware device
hardware device. They are determined according to the for fitting RZ requiring two RB1 and one RB3.
available reconfigurable resources in the device and they The management of hardware resources in
match closely its reconfiguration granularity. Each type of reconfigurable hardware device to perform off-line
RB is characterized by specified cost RBCostk which is placement of hardware tasks consists of three levels
defined according to its frequency in the device, its power described by the three following sections.
consumption and its functionality. B. Level 1: Off-line Flow of Hardware Task Classification
(1) Level 1 takes the application tasks as input and
provides the types and the instances of RZs. It consists of
three steps.
2) Physical Level
1) Step 1: RZ Types Search: It gathers tasks sharing the
a) Reconfigurable Zones (RZ): RZs are virtual
same types of RBs under the same type of RZ by taking
blocs customized to model the classes of hardware tasks.
the maximum number of each RB type between tasks. Step
RZs separate hardware tasks from their execution units on
1 is achieved by Algorithm 1.
the reconfigurable device. They are determined through the
RB-model of hardware tasks during step 1 of the first level Algorithm 1. RZ types search or hardware task search.
of resource management. Hence, each RZ (RZj) is depicted
by its RB-model as described by (2). RZ-reference = 0 // references of RZs types
List-RZ // list of RZs types
n // natural
(2) For all tasks Ti Do // Ti_RB = {Xi,k RBk}
RZ=Create new RZ (Xi,k) //RZ = {Xi,k RBk}
b) Reconfigurable Physical Blocs (RPB): During If ((RZ-reference ≠ 0) and ( n, 1≤n≤ RZ-reference/ k ((Xi,k ≠ 0 and Zn,k ≠ 0)
the placement, the RB-model of RZs are fitted on RPBs or (Xi,k = 0 and Zn,k = 0))) then
partitioned on reconfigurable hardware device. RPBs are // this test checks whether the new created RZ type already exists in list -RZ
2D physical blocs representing the physical locations of For all k Do
RZs within the reconfigurable area. Each RPB is Zn,k= max (Xi,k, Zn,k ) // update RBs number of RZn
characterized by four fixed coordinates and is depicted by Else
its RB-model as presented by (3). As RZs are abstractions Increment RZ-reference
of hardware task classes, the RPBs are the execution units RZ RZ-reference = RZ // RZ RZ-reference = {Xi,k RBk}
where the tasks could be placed. Insert(list-RZ, RZ RZ-reference )
END If
END For
Y The maximum number of RZ types is the number of
Reconfigurable 0 1 2 3 4 5 6
Area hardware tasks. At the end of step 1, we obtain the tasks
classes (RZj).
0 RB1 RB4 RB3 RB1 RB2 RB3 RB1 2) Step 2: Classification of Hardware Tasks: Step 2
RZ RB-model
starts by computing cost D between tasks and each RZ
RB-model type resulting from step 1. Costs D represent the
fitting differences on RBs between tasks and RZs, consequently,
RB1 RB1 RB3
1 RB3 RB3 RB2 RB4 RB1 RB4 RB2 they express the resource wastage when task is mapped to
the RZ. Based on RB-models of task Ti and RZ RZj, cost D
is computed as follows according to two cases.
RPB1 = {(0,0) , (2,0)} We define by (4)
RPB2 = {(1,0) , (2,4)} 2 RB1 RB2 RB2 RB4 RB3 RB1 RB1
RPB3 = {(1,4) , (2,5)} (4)
.
. X
. RPB1 RPB2 RPB3
Case 1: k , di , j , k 0 , RZj contains a sufficient number of
Figure 1. Example of RPBs for RZ. each type of RB (RBk) required by Ti, cost D is equal to the
sum of differences in the number of each RB type between
Ti and RZj weighted by RBCostk as expressed in (5).

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 31


Algorithm 2. Decision of increasing the number of RZs.

(5) Loadm : the load (%) of overloaded RZm


Loadn : the load (%) of non-overloaded RZn
Loadn,i: the load (%) of non-overloaded RZn after adding a section of execution of Ti
Sectioni: the list of combinations of execution sections of task Ti
Case 2: k , di, j, k 0 , the number of RBs required by Ti Exe i: the execution section of T i
p,q,r,j,i,l: naturals
exceeds the number of RBs in the RZj or Ti needs RBk L1 = {loads of overloaded RZj}
which is not included in RZj. In this case cost D is infinite L2 = {loads of non-overloaded RZj}
L3: list of tasks
(see (6)).
Sort L1 in descending order
Sort L2 in ascending order, in case of equality ,
(6) Sort L2 in ascending order according to configuration overhead
For p = 1 to size of L1 Do // Browsing overloaded RZs RZm
RZm = L1(p)
Loadm = load(RZm)
Step 2 assigns each task to the RZ giving the lowest cost D q =1
and by using (7), computes the workload of each RZ While (q ≤ size of L2 and Loadm>100) Do // Browsing non-overloaded RZs RZn
RZn = L2(q)
according to this assignment. Loadn = load(RZn)
// Search {Ti } from RZm to migrate to RZn
{ i} assigned to RZm/D(Ti , RZn)≠∞) then
If ( {T
Sort {Ti} in ascending order according to D(Ti , RZn) in L3
(7) r=1
While ((r ≤ size of L3 ) and (Loadm > 100)) Do // Browsing {Ti }
Ti = L3(r)
l=1
While (l <= size(Sectioni) and Loadm >100) Do
Overheadj denotes the configuration overhead of RZj on // Checking the possibility of relocation of the sections of
the target technology. This overhead is computed by Ti by respecting the load of RZn
select the first execution section Exe i and discard it from Sectioni
conducting the whole Xilinx partial reconfiguration flow Loadn,i = Loadn + Exe i /Pi + Overheadn/P i
from the floorplanning of RZj on the device up to partial If (Loadn,i <= 100) then // Migration of Exe i from RZm to RZn is accepted
Loadm = Loadm – Exe i/P i - Overheadm/Pi // Removing Exe i from RZm
bitstream creation and by taking into account the Loadn = Loadn,i // Migration of Exe i to RZn
configuration frequency (frequency) and the width of the End If
l++
selected configuration port (port width) as expressed by End While
(8). r++
End While
End If
(8) q++
End While
If (Loadm > 100) then
3) Step 3: Decision of Increasing the Number of RZs New RZm * (┌Loadm/100┐- 1) // Adding new RZm
Reinitialize the load of {RZn}when it does not affect the number of added RZm
This step is performed when an overload (>100%) is End If
detected within some RZs after step 2. Step 3 lightens the End For

overload in RZs by migrating some execution sections D. Level 3: Two-level Fitting


(Exei) defined by the predefined preemption points of its
Level 3 consists of two independent sub-levels. The
assigned tasks to the non-overloaded RZs giving finite D
first one ensures the fitting of RZs on the most suitable
with them. When the overload persists, step 3 increments
non-overlapped RPBs in terms of resource efficiency. The
the number of overloaded RZ till covering its overload.
second sub-level performs the mapping of tasks to RZs
Step 3 is conducted by means of Algorithm 2.
according to their preemption points by avoiding the RZ
As generic placement, our off-line placement includes the
overload and by guaranteeing the achievement of task
main functions of placement: partitioning and fitting
execution. Task mapping is based on run-time partial
fulfilled by the two following levels of resource
reconfiguration and promotes solutions of lowest cost D
management.
and reducing overheads.
C. Level 2: Partitioning of RPBs on the Target Device As proved in our work [12], the placement problem is
In this level, for each RZ resulting from level 1, level 2 NP-complete problem. Its search space grows
searches all its potential physical sites partitioned on the exponentially with the number of tasks and RZs. Level 2
device which are RPBs. During RPB partitioning, we must and level 3 depict the principle functions of off-line
take into account the heterogeneity of the target device. In placement of hardware tasks and level 1 is a pre-placement
fact, the RPBs must contain all the types of RBs required analysis. Placement problem is a combinatory optimization
by the RZ and the number of RBs in RPBs must be greater problem, it uses discrete solution set, chooses the best
than or equal to the number of RBs in RZs. solution out of all possible combinations and aims the
optimization of multi-criteria function. Consequently, level
2 and level 3 are formulated as mixed integer programming

32 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


in the following section as it uses some binary and natural
(12)
variables.
Heterogeneity Constraint (CP2): During RPBs
IV. MIXED INTEGER PROGRAMMING FOR
partitioning and RZs fitting, this constraint claims that the
HARDWARE TASK PLACEMENT
number of RBs in RPBs is greater or equal to those in RZs
Level 2 and level 3 are modeled by the quadruplet as formulated by (13).
(V,X,C,F).
Constants (V)
NT, NZ, NP (13)
Task features, RZ features, RB features
Device features: width, height, Device_RB
Variables (X) Non-overlapping between RPBs (CP3): As expressed by
RPBj features for each RZj (14), this constraint restricts the fitting of RZs on non-
(Xj,Yj): The coordinates of the upper left vertex of RPBj overlapped RPBs.
WRPBj: The abscissa of the upper right vertex of RPBj
HRPBj: The ordinate of the bottom left vertex of RPBj
Task preemption points (14)
PreempUnicityj,i,l: Boolean variable controls whether the
mapping of Preempi,l of Ti is performed on RZj Non-overload in RZs (CM1): During fitting tasks Ti on
SumPreempj,i: The sum of preemption points of Ti RZj, each RZj must not be overloaded (see (15)).
performed within RZj is expressed by (9).
(9) (15)

Infeasibility of mapping for preemption points (CM2):


This constraint repeals the mapping of preemption points
of tasks to RZj giving infinite cost D (see (16)).
Occupationj,i: The mapping of preemption points of tasks
to RZs produces the occupation rates of tasks Ti in RZj (16)
which are computed as in (10).
Uniqueness of preemption points (CM3): As explained by
(17), each preemption point of task Ti must be mapped to
unique RZj.
(10)
(17)

This constraint guarantees also the achievement of task


execution and discards the problem of task rejection.
Minimization Objective Function
During our resolution, we considered two sub-problems:
the partitioning of RPBs on the device resolved
simultaneously with the fitting of RZs on the selected
RPBs and the mapping of tasks to their appropriate RZs.
AverageLoad: After preemption point mapping, the The selection of the best solutions for both sub problems is
average of RZ workloads is calculated by (11). guided by the following objective function.
F PlaceFunct
P ion M MappingFunction
(11) PlaceFunction focuses on the sub-problem of fitting RZs
on the most suitable RPBs partitioned on the device. By
respecting the heterogeneity and the non-overlapping
Constraints (C) constraints, PlaceFunction promotes the fitting of RZs on
RPBs that strictly contain the number and type of RBs
RPB coordinates domain (CP1): the values of RPB required by RZs. As expressed by (18), PlaceFunction
coordinates are limited by the width and the height of the evaluates the resources efficiency of the RZ fitting on the
device. selected RPBs on the heterogeneous device.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 33


application consists of microcontroller (T48) that guides
(18) the remainder part of the application and ensures the
hardware task configuration as well as the data flow
MappingFunction deals with the fitting of preemption synchronization. MDCT task computes the modified
points of hardware tasks on the RZs by respecting the three discrete cosine transform which is the main function in
last constraints and by optimizing the three following JPEG compression. JPEG task performs hardware
criteria of measuring mapping quality. compression of 24 frames per second by using the data
provided by MDCT task. AES (Advanced Encryption
MappingFunction M
Map1 M
Map2 M
Map3 Standard) encrypts the resulted information from JPEG
By means of (19), Map1 targets the full exploitation of task by processing blocs of 128 bits with 256bit-key.VGA
RZs by approaching their workloads to 100%. Map1 aims task drives VGA monitors and can display one picture on
also the load balancing of RZs by minimizing the variance the screen either of chars or color waveforms or color grid.
of their workloads towards the AverageLoad. We did not consider the communication latency between
the microcontroller and the other hardware tasks and we
focused only on finding efficient placement for the
(19) hardware tasks.

Data (8bits)
control Microcontroller
In (20), Map2 computes the overhead resulting from task (T48) MDCT
MDCT
mapping. Map2 takes into account all the possible RGB
Data
D
preemption points, even the successive ones within the Data
JPEG AES VGA (12bits)
same task, in order to obtain the worst case overhead. In (256 bits)
fact, the scheduler could preempt a task on these Memory RGB
successive preemption points in the same RZ in favor of a Figure 2. Hardware tasks of the application.
higher priority task. Minimizing Map2 promotes the
solutions of mapping tasks to RZs providing the lowest At design time, we synthesized the hardware resources
overhead. of these hardware tasks by means of ISE 11.3 Xilinx tool
and we choose Xilinx Virtex 5 SX50 as reconfigurable
hardware device. In Virtex 5 technology [13], there are four
(20) main resource types: CLBL, CLBM, BRAM and DSP. By
considering the reconfiguration granularity, the RBs in
The goal of Map3 expressed by (21) is to map tasks with Virtex 5 are vertical stacks composed of the same type of
high occupation rate to the RZs providing the lowest cost resources: RB1 (20 CLBMs), RB2 (20 CLBLs), RB3 (4
D. The benefit of Map3 is the optimization of resource use BRAMs) and RB4 (8 DSPs). We have assigned 20, 80, 192
since cost D considers the weight of each resource in terms and 340 as RBCost respectively for RB1, RB2, RB3 and RB4.
of its frequency on the device and the importance of its Virtex 5 FPGA and the hardware tasks are modeled with
functionality. As cost D reveals the resource wastage when their RB-models. Configuration overheads are determined
task is mapped to RZ, minimizing Map3 ensures the use by considering that each task defines an RZ. The partial
optimization of costly resources by mapping tasks with reconfiguration flow dedicated by PlanAhead 11.3 Xilinx
low occupation rates to RZs including costly resources. tool enables the floorplanning of hardware tasks on the
chosen device to create their bitstreams independently for
estimating configuration overheads. We rely on parallel
(21) 8bit-width configuration port and use 100 MHz as the
configuration clock frequency. Preemption points are
determined arbitrarily according to the granularity of
hardware tasks and their Ci. For all tasks, we consider that
V. RESULTS AND PLACEMENT QUALITY the first preemption point is equal to 0 μs. The features of
EVALUATION hardware tasks and their instances are presented in
A. Proposed Application TABLE I.
The experiments in this section deal with the effect of B. Obtained Results
three-level off-line resource management on an application The pre-placement analysis performed by level 1
composed of seven heterogeneous tasks. As shown in Fig. produce the set of RZ types according to the RBs
2, our application contains varied-size tasks with requirements in hardware tasks. Thus, following to step 1
heterogeneous resources which are considered the main of level 1, the RB-models of the obtained RZ types are
functions in the current real-time applications. The indicated on the RZs line in TABLE I.

34 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


TABLE I. FEATURES OF HARDWARE TASKS (www.aimms.com) that relies on the Branch and Bound
MDCT AES VGA T48 JPEG [14] method which guarantees the optimal solution. We
Instances T1,T2 T3 T4 T5 T6 have considered two independent sub-problems. The first
RB-model 2 RB1, 4 RB1, 2 RB1, 5 RB1, 8 RB1, one ensures the partitioning of RPBs on the device for all
(μs) 12 RB2, 7 RB2, 4 RB2, 4 RB2, 12 RB2, the RZs provided by level 1 combined with the fitting of
3 RB3, 1 RB3, 1 RB3, 0 RB3, 0 RB3, RZs on the most suitable RPBs by respecting the
0 RB4 1 RB4 0 RB4 0 RB4 2 RB4
WCET (μs) 40552 44540 2500 20000 350000
constraints CP1, CP2 and CP3 and by optimizing the
Period (μs) 416666 200000 10000 50000 416666 objective expressed by PlaceFunction. This first sub-
problem was modeled as mixed integer linear program and
Overhead 3215 1980 681 721 2968 was resolved after 3 minutes by CPU of 2 GHz with 2 GB
(μs) of RAM. Nevertheless, the second sub-problem consists on
Preemption 10000, 30000, 1650, 5000, 200000, mapping the tasks to the most appropriate RZs according
Points (μs) 20000, 40000 2000 10000, 300000 to their predefined preemption points by satisfying the
30000 15000
RZs RZ1 RZ2 RZ1 RZ3 RZ4
constraints CM1, CM2 and CM3 and by promoting the
solution that optimizes the objectives expressed by Map1,
WCET: Worst Case Execution Time of the task, Period: the period of the Map2 and Map3. The task mapping sub-problem was
task which is equal to the deadline, Overhead: configuration overhead of
the task in Virtex 5 SX50 with parallel 8bit-width port (100 MHz),
formulated as mixed integer non-linear program and was
Preemption points: points in time taken from WCET predefined by the resolved after 1 second.
designer, RZs: the assigned RZ for the task in step 1 of level 1. For the first sub-problem, TABLE III shows the RZ
In this application, RZ1 is created by MDCT type (T1 fitting on the selected RPBs defined by their coordinates.
and T2) and VGA type (T4). As explained by step 1, when TABLE III. RPBS FOR RZ FITTING
different types of tasks construct the RZ type, the
maximum number of each RB type must be taken between Xj WRPBj Yj HRPBj
RPB1 25 28 1 6
these tasks. In the case that the maximum numbers of 34 45 1 2
RPB2
distinct RB types are produced by different tasks, the RPB3 1 3 1 4
whole partial reconfiguration flow for this RZ type must be RPB4 34 45 3 5
performed to recompute its configuration overhead. For
RZ1, the RB-model and configuration overhead are taken In TABLE IV, the costs (Δ) expressing the differences
form MDCT type as it gives the maximum number of RBs. in RBk between RZs and their associated RPBs obtained
TABLE II describes the obtained RZ types, their after resolution depict the resource efficiency. The
workloads (%) and the costs D between tasks and RZs. The obtained results of RZ fitting provide an averaged resource
workloads of the obtained RZs are computed by assigning utilization of 30% of the available RBs in the
to each RZ the tasks giving lowest cost D with them heterogeneous device. This resource utilization achieves up
presented by the bold numbers in TABLE II. An overload 27 % of resource gain comparing to static design. The
in RZ2 is detected after step 2 of level 1 and is due to the static design is obtained by fitting each instance of each
execution times of T3 and T4 as well as to the configuration hardware task on its RPB without using the concept of
overhead of RZ2. Step 3 of level 1 resolves this overload in dynamic partial reconfiguration.
RZ2 by migrating the first execution section marked by the
TABLE IV. RESOURCE EFFICIENCY
first preemption point (0 μs) and the second preemption
point (1650 μs) of T4 to RZ1. RZ1 is the unique RZ type that RB1 RB2 RB3 RB4 Δ
could accept T4 as it is non-overloaded and it gives finite RPB1 6 12 6 0 4 RB1, 3 RB3
cost D (1024) with it. Step 3 decides this task relocation RPB2 12 8 2 2 8 RB1, 1 RB2, 1 RB3, 1 RB4
instead of adding another RZ2 to resolve efficiently its RPB3 8 4 0 0 2 RB1
overload. RPB4 18 12 3 3 10 RB1, 3 RB3, 1 RB4

Figure 3 shows the floorplanning of RPBs in Virtex 5


TABLE II. STEP 1 AND STEP 2 RESULTS
SX50 according to the obtained RPB coordinates.
MDCT AES VGA T48 JPEG
{T1,T2} {T3} {T4} {T5} {T6} The mapping of task preemption points are detailed in
RZ1 (26%) 0 ∞ 1024 ∞ ∞ TABLE V. Ti,x depicts the x-th execution section of Ti.
RZ2 (110%) ∞ 0 620 ∞ ∞ Tasks T1, T2, T3, T5 and T6 are efficiently mapped to their
RZ3 (46%) ∞ ∞ ∞ 0 ∞ optimal RZs by optimizing the objectives expressed by
RZ4 (86%) ∞ ∞ ∞ 1380 0
(20) and (21) of reducing overheads and the use of costly
The resolution of the level 2 and level 3 of our off-line resources. For T4, the analytic resolution assigns more
resource management are resolved by means of powerful execution sections of this task to its optimal RZ RZ2 (80%)
solvers dedicated by AIMMS environment than RZ1 (20%). The two first sections of T4 (T4,1,T4,2)

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 35


marked by the its preemption point (0 μs) and its third REFERENCES
preemption point (2000 μs) are mapped to RZ2. The last [1] K. Bazargan, R. Kastner, and M. Sarrafzadeh, “Fast
execution section of T4 (T4,3) is fitted on RZ1. On the same Template Placement for Reconfigurable Computing
RZ, the tasks are scheduled by respecting their deadlines Systems,” IEEE Design and Test, Special Issue on
and are preempted on their predefined preemption points. Reconfigurable Computing, vol. 17, pp. 68–83, January.
Moreover, we considered that execution sections, 2000.
delimited by the preemption points within task, are [2] H. Walder, C. Steiger, and M. Platzner, "Fast online task
independent. Effectively, there is need neither to exchange placement on FPGAs: free space partitioning and 2D-
data nor to send synchronization resource between these hashing," International Parallel and Distributed Processing
execution sections. Symposium (IPDPS'03), pp. 178, April 2003.
[3] A. Ahmadinia, C. Bobda, M. Bednara, and J. Teich, "A New
RPB2 Approach for On-line Placement on Reconfigurable
Devices," International Parallel and Distributed Processing
RPB3

Symposium (IPDPS'04), p. 134, April 2004.


[4] H. ElGindy, M. Middendorf, H. Schmeck, and B.Schmidt,
RPB1

"Task rearrangement on partially reconfigurable FPGAs


RPB4 with restricted buffer," Field Programmable Logic and
Applications, pp. 379-388, August 2000.
[5] A. Lodi, S. Martello, and M. Monaci, "Two-dimensional
packing problems: A survey," European Journal of
Operational Research, Vol 141, pp. 241-252, March 2001.
Figure 3. RZ fitting on Virtex 5 SX50. [6] A. Lodi, S. Martello, and D. Vigo, "Neighborhood search
algorithm for the guillotine non-oriented two-dimensional
TABLE V. MAPPING OF PREEMPTION POINTS bin packing problem," Meta-heuristics : advances and
trends in local search paradigms for optimization, pp. 125-
RZ1 T1,1,T1,2,T1,3,T1,4 100% of T1
T2,1,T2,2,T2,3,T2,4 100% of T2 139, July 1997.
T4,3 20 % of T4 [7] A. Lodi, S. Martello, and D. Vigo, "Heuristic and
RZ2 T3,1,T3,2,T3,3 100% of T3 metaheuristic approaches for a class of two-dimenional bin
T4,1,T4,2 80% of T4 packing problems," INFORMS journal on computing, Vol
RZ3 T5,1,T5,2,T5,3,T5,4 100% of T5 11, pp. 345-357, 1999.
RZ4 T6,1,T6,2,T6,3 100% of T6
[8] S.P. Fekete, E. Kohler, and J. Teich "Optimal FPGA module
After mapping of preemption points of tasks to RZs placement with temporal precedence constraints," Design
fitted on the reconfigurable device, the analytic resolution Automation and Test in Europe, pp. 658–665, March 2001.
produces 50623 μs of total configuration overhead which
[9] J. Teich, S.P. Fekete, and J. Schepers, "Optimization of
represents 10 % of total running time. To optimize the
dynamic hardware reconfiguration," The journal of
objective of load balancing and full exploitation of RZs
supercomputing, Vol 19, pp. 57–75, 2001.
expressed by (19) the Branch and Bound method
converges to an average workload of 70%. The problem of [10] K. Danne, S. Stuehmeier, "Off-line placement of tasks onto
task rejection is discarded as the mapping resolution reconfigurable hardware considering geometrical task
guarantees execution unit (RZ) for all execution sections of variants," International Federation for Information
tasks as expressed by CM3 in (17). Processing, 2005.
[11] I. Belaid, F. Muller, M. Benjemaa, "Off-line placement of
VI. CONCLUSION AND FUTURE WORK hardware tasks on FPGA," 19th International Conference on
In this paper, we proposed a new three-level resource Field Programmable Logic and Application (FPL'09), pp.
management targeting the enhancement of placement 591-595, September 2009.
quality for off-line placement of hardware tasks. By [12] I. Belaid, F. Muller, M. Benjemaa, "Off-line placement of
adopting run-time partial reconfiguration and mixed reconfigurable zones and off-line mapping of hardware tasks
integer programming, we improved the quality of on FPGA", Design and Architectures for Signal and Image
placement in terms of resource efficiency, configuration Processing, September 2009.
overhead and the exploitation of RZs. The problem of task [13] "Virtex-5 FPGA Configuration User Guide," Xilinx white
rejection is discarded. Future work targets the directed paper, August 2009.
acyclic graphs and involves adding precedence constraints
[14] J. Clausen, "Branch and Bound algorithms-principles and
as well as deadline and periodicity constraints to achieve
examples", March 1999.
an off-line mapping/scheduling of hardware tasks on the
reconfigurable hardware devices.

36 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Exploration of Heterogeneous FPGA Architectures
Umer Farooq, Husain Parvez, Zied Marrakchi and Habib Mehrez
LIP6, Université Pierre et Marie Curie
4, Place Jussieu, 75005 Paris, France
Email: umer.farooq@lip6.fr

Abstract—Heterogeneous FPGAs are commonly used in in- FPGA lies in its simple and compact layout generation. When
dustry and academia due to their superior area, speed and tile-based layout for an FPGA is required, the floor-planning
power benefits as compared to their homogeneous counterparts. of similar type blocks in a column simplifies the layout
The layout of these heterogeneous FPGAs are optimized by
placing hard-blocks in distinct columns. However, communication generation. The complete width of the entire column, having
between hard-blocks that are placed in different columns require same type of blocks, can be adjusted appropriately to generate
additional routing resources; thus overall FPGA area increases. a very compact layout. However, the column-based floor-
This problem is further aggravated when the number of different planning of FPGA architectures limits each column to support
types of hard-blocks, placed in distinct columns, increase in an only one type of HB. Due to this limitation, the architecture
FPGA. This work compares the effect of different floor-plannings
on the area of island-style FPGA architectures. A tree-based is bound to have at least one separate column for each type
architecture is also presented; unlike island-style architectures, of HB even if the application or a group of applications that
the floor-planning of heterogeneous tree-based architectures does is being mapped on it uses only one block of that particular
not affect its routing requirements. Different FPGA architectures type. This can eventually result in the loss of precious logic
are evaluated for three sets of benchmark circuits, which are and routing resources. This loss can become even more severe
categorized according to their inter-block communication trend.
The island-style column-based floor-planning is found to be 36%, with the increase in number of types of blocks that are required
23% and 10% larger than a near-ideal non-column-based floor- to be supported by the architecture.
planning for three sets of benchmarks. Column-based floor- This work generates FPGAs using different floor-planning
planning is also found to be 18%, 21% and 40% larger than techniques and then compares them to column-based floor-
the tree-based FPGA architecture for the same benchmarks. planning. Mainly six floor-planning techniques are explored,
four of which are column-based and two are non column-
I. I NTRODUCTION
based. Though, the column-based techniques are advantageous
During the recent past, embedded hard-blocks (HBs) in in terms of easy and compact layout generation but this advan-
FPGAs (i.e. heterogenous FPGAs) have become increasingly tage could be overshadowed by the poor resource utilization.
popular due to their ability to implement complex applications On the other hand non-column-based floor-planning can give
more efficiently as compared to homogeneous FPGAs. Previ- better resource utilization, but at the expense of a difficult
ous research [1][2][3][4] has shown that embedded HBs in layout generation. Also, compact layout generation is not
FPGAs have resulted in significant area and speed improve- feasible, because HB dimension need to be the multiple of
ments. The work in [5] shows that the use of HBs in FPGAs smallest block (usually a CLB); eventually some area loss.
reduces the gap between ASIC and FPGA in terms of area, This work also compares a tree-based heterogenous FPGA
speed and power consumption. Some of the commercial FPGA architecture [8] with different floor-planning techniques of
vendors like Xilinx [6] and Altera [7] are also using HBs (e.g. mesh-based heterogeneous FPGA architecture. Contrary to
multipliers, RAMs and DSP blocks). This trend has resulted mesh-based heterogenous FPGA, routability of a tree-based
in the creation of domain-specific FPGAs. Domain-specific FPGA is independent of its floor-planing and the number of
FPGAs are a trade-off between specialization and flexibility. types of HBs required to be supported by the architecture.
In domain-specific FPGAs, if an application design, that is So, tree-based heterogenous FPGA can be advantageous as
implemented on an FPGA, uses an embedded hard-block, compared to mesh-based FPGA. Two different techniques are
area, speed and power improvements are achieved. However, explored for tree-based FPGA architecture. First technique
if embedded-blocks remain unused, precious logic and routing respects the symmetry of hierarchy, which is one of the
resources are wasted. On the other hand, a homogeneous characteristics of tree-based architectures. However, in order
FPGA has no such problem but can result in higher area, lower to provide only the required amount of HBs, second technique
speed and more power consumption for the implementation of does not respect the symmetry of hierarchy.
same design. This paper presents only area results and power and timing
Almost all the work cited above considers island-style comparisons are not considered in this work. The remainder
FPGAs as the reference architecture where HBs are placed in of the paper is organized as follows: section II gives a brief
fixed columns; these columns of HBs are interspersed evenly overview of two FPGA architectures. Section III gives a brief
among columns of configurable logic blocks (CLBs). The overview of exploration environments of two FPGA archi-
main advantage of island-style, column-based heterogeneous tectures. Section IV presents the exploration flow. Section V

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 37


Fig. 1. Mesh-based Heterogeneous FPGA Fig. 2. Detailed Interconnect of a CLB with its Neighboring Channels

presents experimental results and section VI finally concludes


this paper.

II. R EFERENCE FPGA A RCHITECTURES


This section gives a brief overview of the two heterogeneous architecture exploits the locality of connections that is inherent
FPGA architectures that are used in this work. in most of the application designs. In this architecture, CLBs,
I/Os and HBs are partitioned into a multilevel clustered
A. Mesh-based Heterogenous FPGA Architecture structure where each cluster contains sub clusters and switch
First of all, mesh-based heterogeneous FPGA architecture blocks allow to connect external signals to sub-clusters. Tree-
is presented. A mesh-based heterogeneous FPGA architecture based architecture contains two unidirectional, single length
contains CLBs, I/Os and HBs that are arranged on a two interconnect networks: a downward network and an upward
dimensional grid. In order to incorporate HBs in a mesh-based network as shown in Figure 4. Downward network is based
FPGA, the size of HBs is quantized with size of the smallest on butterfly fat tree topology and allows to connect signals
block of the architecture i.e. CLB. The width and height of an coming from other clusters to its sub-clusters through a switch
HB is a multiple of the smallest block in the architecture. An block. The upward network is based on hierarchy and it
example of such FPGA is shown in Figure 1. In mesh-based allows to connect sub-cluster outputs to other sub-clusters in
FPGA, input and output pads are arranged at the periphery the same cluster and to clusters in other levels of hierarchy.
of the architecture. The position of different blocks in the Figure 3 shows a three-level, arity-4, tree-based architecture.
architecture depends on the used floor-planning technique. A In a heterogenous tree-based architecture, CLBs and I/Os are
block (referred as CLB or HB) is surrounded by a uniform normally placed at the bottom of hierarchy whereas HBs can
length, single driver, unidirectional routing network [9]. The be placed at any level of hierarchy to meet the best design fit.
input and output pins of a block connect with the neighboring For example, in Figure 3 HBs are placed at level 2 of hierarchy.
routing channel. In the case where HBs span multiple tiles, In a tree-based architecture, CLBs and HBs communicate
horizontal and vertical routing channels are allowed to pass with each other using switch blocks that are further divided
through them. An FPGA tile showing the detailed connection into downward and upward mini switch boxes (DMSBs &
of a CLB with its neighboring routing network is shown in UMSBs). These DMSBs and UMSBs are unidirectional full
Figure 2. A unidirectional disjoint switch box connects dif- cross bars that connect signals coming into the cluster to its
ferent routing tracks together. The connectivity of the routing sub-clusters and signals going out of a cluster to the other
channel with the input and output pins of a block, abbreviated clusters of hierarchy. A tree-based cluster showing the detailed
as Fcin and Fcout, is set to be 1. The channel width is varied connection of a CLB with its neighboring CLBs is shown
according to the netlist requirement but remains a multiple of in Figure 4. It can be seen from the figure that DMSBs
2 [9]. are responsible for downward interconnect and UMSBs are
responsible for upward interconnect and they are combined
B. Tree-based Heterogeneous FPGA Architecture together to form the switch block of a cluster. The number
A tree-based architecture is a hierarchical architecture hav- of signals entering into and leaving from the cluster can be
ing unidirectional interconnect. A generalized example of a varied depending upon the netlist requirement. However they
tree-based architecture is shown in Figure 3. A tree-based are kept uniform over all the clusters of a level.

38 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Fig. 3. Tree-based Heterogeneous FPGA Fig. 4. Detailed Interconnect of Level-1 Cluster of Tree-based FPGA

C. Characteristics of Mesh-based and Tree-based Heteroge- fully optimized; as the columns widths can only be reduced to
neous FPGAs maximum width of any tile in that column. There will remain
Both tree-based and mesh-based heterogeneous FPGAs have some unused area in smaller tiles. Such a problem does not
particular characteristics that are mainly dependant on the arise if a tile-based layout is not required. In such a case, an
basic interconnect structure and the arrangement of different FPGA hardware netlist can be laid out using an ASIC design
blocks in the architecture. For example, the major advantage flow.
of a tree-based heterogeneous FPGA is its predictable routing
and its independence of the types and position of blocks A. Exploration Environment of Mesh-based FPGA
supported by the architecture. In a tree-based architecture, This work uses the mesh-based architecture exploration
number of paths required to reach a destination are limited environment presented earlier in [10]. This work further im-
and hence the number of switches crossed by a signal to proves this environment by implementing Range Limiter [11]
reach from a source to a destination do not vary greatly. It and column-move operation for heterogeneous architectures.
can be seen from Figure 3 that any CLB can reach a HB by An FPGA architecture is initially defined using an architecture
traversing four switches. Unlike tree-based FPGAs, routability description file. BLOCKS of different sizes are defined, and
of mesh-based FPGA is greatly dependant upon the position later mapped on a grid of equally sized SLOTS, called as a
of different blocks on the architecture. In mesh-based FPGAs, SLOT-GRID. Each BLOCK occupies one or more SLOTS.
routability is not predictable and number of paths available to The type of the BLOCK and its input and output PINS
reach a destination are almost unlimited. Hence the number of are used to find the size of a BLOCK. In a LUT-4 based
switches crossed to reach a destination vary with respect to the FPGA, a CLB occupies one slot and 18x18 multiplier occupies
position of blocks in the architecture. For example, any CLB in 4x4 slots. Input and output PINS of a BLOCK are defined,
the left most column of Figure 1 crosses at least eight switches and CLASS numbers are assigned to them. PINS with the
to reach a HB in the second last column of the architecture. same CLASS number are considered equivalent; thus a NET
However, this number of switches is reduced to only one if targeting a receiver PIN of a BLOCK can be routed to any of
that CLB is placed beside the HB of the second last column the PINS of the BLOCK belonging to the same CLASS. Once
of architecture. So, floor-planning plays a very important the architecture of FPGA is defined, the benchmark circuit is
role in column-based island-style heterogeneous FPGAs. This placed on the architecture.
problem further aggravates for the communication of different Placer Operations: A simulated annealing based [12]
HBs placed in different columns. PLACER is used to perform different operations. The
PLACER either (i) moves an instance from one BLOCK
III. E XPLORATION E NVIRONMENTS to another, (ii) moves a BLOCK from one SLOT position
In this section, the exploration environments of two FPGA to another, (iii) rotates a BLOCK at its own axis, or (iv)
architectures are presented. We also describe different floor- moves a complete column of BLOCKS from one SLOT
planning techniques that are explored using these exploration position to another. After each operation the placement cost
environments. Floor-planning techniques can have major im- is recomputed for all the disturbed nets. Depending on the
plications on the area of a mesh-based FPGA. If a tile-based cost value and the annealing temperature the operation is
layout is required for an FPGA, the floor-planning of similar accepted or rejected. Multiple netlists can be placed together
type of blocks in columns can help optimize the tile area of to get a single architecture floor-planning for all the netlists.
a block. The complete width of a column can be adjusted For multiple netlist placement, each BLOCK allows multiple
according to the layout requirements of similar blocks placed instances to be mapped onto it, but multiple instances of the
in a column. On the other hand, if blocks of different types same netlist can not be mapped on a single block.
are placed in a column, the width of a column cannot be Placer performs move and rotate operations on a “source”

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 39


Fig. 5. Floor-Planning Techniques

and a “destination” block. When a source is to be moved from of the pins. Multiples of 90◦ rotation are performed for square
one slot position to another, any random SLOT is selected blocks while multiples of 180◦ are performed for rectangular
as its destination. The rectangular window starting from this blocks. A 90◦ rotation for rectangular blocks requires both
destination slot and having same size and shape as that of move and rotate operations; it is left for future work.
source is called destination window whereas the window By using different PLACER operations, six floor-planning
occupied by the source is called source window. Normally, technique are explored. The detail of these floor planning
source contains one block whereas destination window can techniques is as follows:
contain one or more blocks. Once the source and destination Floor-Planning Techniques: (i) Hard-blocks are placed in
windows are selected, the move operation is performed if (i) fixed columns, apart from the CLBs as shown in Figure 5 (a).
the destination window does not exceed the boundaries of Such kind of floor-planning technique can be beneficial for
SLOT-GRID, (ii) destination window does not contain any data path circuits as described by [13]. It can be seen from
block that exceeds the boundary of destination window and the figure that if all HBs of a type are placed and still there is
(iii) destination window does not overlap (partially or fully) space available in the column then in order to avoid wastage of
source window. However, if these conditions are not met, the resources, CLBs are placed in the remaining place of column.
procedure continues until a valid destination window is not (ii) Columns of HBs are evenly distributed among columns
found. When a block is to be rotated, same source position of CLBs, as shown in Figure 5 (b). (iii) Columns of HBs are
becomes its destination position and block is rotated around its evenly distributed among CLBs. Contrary to first and second
own axis. The block rotate operation becomes important when techniques, whole column contains only one type of blocks.
pins of a block have different classes. In such a case the size of This technique is normally used in commercial architectures
bounding box varies depending upon the position and direction and is shown in 5 (c). (iv) The HBs are placed in columns but

40 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


TABLE I
columns are not fixed, rather they are allowed to move through DSP B ENCHMARKS SET I
the column-move operation of PLACER. This technique is Circuit Name Inputs Outputs CLBs Mult Slansky Sff Sub Smux
shown in Figure 5 (d). (v) In this technique HBs are not (LUT4) (8x8) (16+16) (8) (8-8) (32:16)
FIR 9 16 32 4 3 4 - -
restricted to remain in columns; and they are allowed to move FFT 48 64 94 4 3 - 6 -
through block move operation as shown in Figure 5 (e). (vi) ADAC 18 16 47 - - 2 - 1
DCU 35 16 34 1 1 4 2 2
The blocks are allowed to move and rotate through block move
and rotate operations. This floor-planning technique is shown
in Figure 5 (f). TABLE II
O PEN C ORE B ENCHMARKS S ET II
B. Exploration Environment of a Tree-based FPGA No of No of No of No of No of
Circuit Name Inputs Outputs LUTs Multipliers Adders
A tree-based FPGA is defined using an architecture descrip- (16x16) (20+20)
tion file. The architecture description file contains different cf fir 3 8 8 42 18 159 4 3
cf fir 7 16 16 146 35 638 8 14
architectural parameters along with the definition of different cfft16x8 20 40 1511 - 26
BLOCKS used by the architecture. Once the architecture is cordic p2r 18 32 803 - 43
defined, PARTITIONER partitions the netlist using a top-down cordi r2p 34 40 1328 - 52
fm 9 12 1308 1 19
recursive partitioning approach. First, top level clusters are fm receiver 10 12 910 1 20
constructed, and then each cluster is partitioned into sub- lms 18 16 940 10 11
reed solomon 138 128 537 16 16
clusters, until the bottom of hierarchy is reached [14]. The
main objective of PARTITIONER is to reduce communication
between the clusters by absorbing maximum communication TABLE III
O PEN C ORE B ENCHMARKS S ET III
inside the clusters. Two simple techniques are explored for
No of No of No of No of
tree-based FPGA. (i) A generalized example of first technique
Circuit Name Inputs Outputs LUTs Multipliers
is shown in Figure 3. This technique is referred as symmetric (18x18)
(SYM). In this technique HBs can be placed at any level cf fir 3 8 8 42 22 214 4
which gives the best design fit. However in this technique diffeq f systemC 66 99 1532 4
diffeq paj convert 12 101 738 5
the symmetry of hierarchy is respected which can eventually fir scu 10 27 1366 17
results in wastage of HBs. For example in Figure 3, it can iir1 33 30 632 5
be seen that this architecture supports 4 HBs of a certain iir 28 15 392 5
type (because it is an arity 4 architecture). But, for a netlist rs decoder 1 13 20 1553 13
rs decoder 2 21 20 2960 9
requiring only two HBs other two HBs will remain unused
and will be wasted. (ii) This technique is same as SYM except
that in this technique the symmetry of hierarchy for HBs is
not respected and only that number of HBs are used that are and benchmarks shown in Table III are obtained from
needed. This technique is referred as asymmetric (ASYM). http://www.eecg.utoronto.ca/vpr/. The commu-
IV. E XPERIMENTAL F LOW nication between different blocks of a benchmark can be
mainly divided into the following four categories:
This section discusses different types of benchmarks, the
CLB-CLB: CLBs communicate with CLBs.
software flow and the experimental methodology used to
CLB-HB: CLBs communicate with HBs and vice versa.
explore two FPGA architectures.
HB-HB: HBs communicate with other HBs.
A. Benchmark Selection IO-CLB/HB: I/O blocks communicate with CLBs and HBs.
Generally in academia and industry, the quality of an In the SET I benchmarks, the major percentage of total com-
FPGA architecture is measured by mapping a certain set of munication is between HBs (i.e. HB-HB) and only a small part
benchmarks on it. Thus the selection of benchmarks plays of total communication is covered by the communication CLB-
a very important role in the exploration of heterogeneous CLB or CLB-HB. Similarly, in SET II the major percentage of
FPGAs. This work puts special emphasis on the selection total communication is between HBs and CLBs where either
of benchmark circuits, as different circuits can give different HBs are source and CLBs are destination or vice versa. In
results for different architecture floor-planning techniques. SET III, major percentage of total communication is covered
This work categorizes the benchmark circuits by the trend of by CLB-CLB and only a small part of total communication is
communication between different blocks of the benchmark. covered by CLB-CLB or CLB-HB. Normally the percentage
So, three sets of benchmarks are assembled having distinct of IO-CLB/HB is a very small part of the total communication
trend of inter-block communication. These benchmarks are for all the three sets of benchmarks.
shown in Tables I, II and III respectively. These benchmarks
are obtained from three different sources. For example the B. Software Flow
benchmarks shown in Table I are the designs developed at Uni- The software flow used to place and route different bench-
versité Pierre et Marie Curie. The benchmarks shown in Ta- marks (netlists) on the two heterogeneous FPGAs is shown
ble II are obtained from http://www.opencores.org/ in Figure 6. The input to the software flow is a VST file

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 41


This placement file along with netlist file is then passed to
another software module called ROUTER. The ROUTER uses
a pathfinder algorithm [18]. Pathfinder uses a negotiation based
iterative approach to route all the NETS in the netlist. In
order to optimize the FPGA architecture, a binary search
algorithm is used. This algorithm determines the minimum
number of signals required to route a netlist on FPGA. Once
the optimization is over, area of the architecture is estimated
using an area model which is based on symbolic standard
cell library SXLIB [19]. The area of FPGA is estimated by
combining areas of CLBs, HBs, multiplexors of downward and
upward interconnect, and all associated programming bits.
For mesh-based architecture, the netlist file is passed to a
software module called PLACER that uses simulated anneal-
ing algorithm [12] [11] to place CLBs, HBs and I/Os on their
respective blocks in FPGA. The bounding box (BBX) of a
NET is a minimum rectangle that contains the driver instance
and all receiving instances of a NET. The PLACER optimizes
the sum of half-perimeters of the bounding boxes of all NETS.
It moves an instance randomly from one block position to
another; the BBX cost is updated. Depending on cost value and
Fig. 6. Software Flow annealing temperature, the operation is accepted or rejected.
After placement, a software module named ROUTER routes
the netlist on the architecture. The router uses a pathfinder
(structured vhdl). This file is converted into BLIF format [15] algorithm [18] to route the netlist using FPGA routing re-
using a modified version of VST2BLIF tool. The BLIF file sources. In order to optimize the FPGA resources, a binary
is then passed through PARSER-1 which removes HBs from search algorithm similar to the one used for tree-based FPGA
the file and adds temporary inputs and outputs to the file to is used to determine the smallest channel width required to
preserve the dependance between HBs and rest of the netlist. route a netlist. Once this optimization process is over, area is
The output of PARSER-1 is then passed through SIS [16] that estimated in the same manner as for tree-based architecture.
synthesizes the blif file into LUT format which is later passed
through T-VPACK [17] which packs and converts it into .NET C. Experimental Methodology
format. Finally the netlist is passed through PARSER-2 that In this work, experiments are performed individually for
adds previously removed HBs and also removes temporary each netlist. The architecture definition, floor-planning, place-
inputs and outputs. The final netlist in .NET format contains ment, routing and architecture optimization is performed in-
CLBs, HBs and I/O instances that are connected to each other dividually for each netlist. Although, such an approach is
via NETS. Once the netlist is obtained in .NET format, it is not applicable to real FPGAs, as their architecture, floor-
placed and routed separately on the two architectures. The planning and routing resources are already defined. In order
benchmarks shown in Table III do not follow this flow and to make our results more comparable to real FPGAs, we even
they are obtained directly in the synthesized blif format with generated and optimized a single maximum FPGA architecture
HBs. These benchmarks are passed through T-VPACK for for each group of netlist. However, it was noticed that the
conversion into .NET format. Once the conversion to .NET floor-planning and routing resources were mainly decided by
format is done, they follow the same flow as other two sets the largest netlist in each group. Thus the results for three
of benchmarks. sets of netlists corresponded roughly to the results for three
After obtaining the netlists in the .NET format, the largest netlists in each of the three sets of netlists. So, to get
netlists are mapped on the FPGA architecture. For tree-based more realistic results, the architecture and floor-planning is
architecture, the netlist is first partitioned using a software optimized individually for each netlist; later average of all
module called PARTITIONER. This module partitions CLBs, netlists gives more realistic results.
HBs and I/Os into different clusters in such a way that
the inter-cluster communication is minimized. By minimizing V. E XPERIMENTAL R ESULTS
inter-cluster communication we obtain a depopulated global Experiments are performed for six different types of floor-
interconnect network and hence reduced area. PARTITIONER planning techniques of mesh-based FPGA and two techniques
is based on hMetis [14]; hMetis generates a good solution in of tree-based FPGA (described in section III). Experimental
a short time because of its multi-phase refinement approach. results obtained for three sets of benchmarks are shown in
Once partitioning is done, placement file is generated that Figures 7, 8 and 9. In these figures, the results for benchmarks
contains positions of different blocks on the architecture. 1 to 4, 5 to 13 and 14 to 21 correspond to SET I, SET II and

42 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Fig. 7. Placement Cost Normalized to BMR Floor-Planning

Fig. 8. Channel Width Normalized to BMR Floor-Planning

Fig. 9. Area Results Normalized to BMR Floor-Planning

SET III respectively. The avg1, avg2 and avg3 in the Figures 7, Decrease in placement cost does not always give channel
8 and 9 correspond to the geometric average of these results for width advantages. However, channel-width gain are achieved
SET I, SET II and SET III respectively. The avg corresponds in many benchmarks. On average, CF requires 13%, 22% and
to the average of all netlists. 9% more channel width than BMR for SET I, SET II and SET
Figure 7 shows the placement cost for different floor- III respectively. The increase in channel width increases the
planning techniques of mesh-based FPGA, normalized against overall area of the architecture, as shown in Figure 9. In this
the placement cost of BMR floor-planning technique. Place- figure, the area results of CF, BM floor-planning techniques
ment cost is the sum of half perimeters of bounding boxes of of mesh-based FPGA and, SYM and ASYM techniques of
all the NETS in a netlist. It can be seen from the figure that, tree-based FPGA are normalized against the area results of
on average, BMR gives equal or better results as compared to BMR floor-planning technique of mesh-based FPGA. For the
other techniques for all three sets of benchmarks. On average, sake of clarity, the results for A, CP and CM floor-planning
CF gives 35%, 35% and 11% more placement cost than techniques are not presented. On average, CF requires 36%,
BMR, for SET I, SET II and SET III benchmark circuits 23% and 10% more area than BMR for SET I, SET II and SET
respectively. Figure 8 shows channel-width requirements for III respectively. For SET I benchmark circuits, SYM requires
different floor-planning technique, normalized against BMR. 35% more area than BMR, and ASYM requires 10% more

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 43


area than BMR. However for SET II benchmark circuits, on sets of netlists. The tree-based architecture is independant
average BMR is almost equal to SYM and ASYM. For SET of its floor-planning, however its layout is relatively less
III benchmark circuits BMR is worse than SYM and ASYM scalable than the mesh architecture. The column-fixed floor-
by 14% and 18% respectively. planning is on average 18%, 21% and 40% larger than the tree-
The results show that BMR technique produces least place- based FPGA architectures for the same benchmark. These area
ment cost, smallest channel width and hence smallest area gains will decrease due to layout inefficiencies of tree-based
for mesh-based heterogeneous FPGA. However, BMR floor- architecture. Hardware layout efforts are required to maintain
planning technique is dependant upon target netlists to be the area benefits on tree-based FPGAs. A mesh containing
mapped upon FPGA. Such an approach is not suitable for smaller trees architecture can be designed to resolve scalability
FPGA production; as floor-planning need to be fixed before issues of tree architecture.
mapping application designs on them. Moreover, the hardware
R EFERENCES
layout of BMR might be un-optimized. In this work, the
BMR floor-planning serves as a near ideal floor-planning with [1] M. Beauchamp, S. Hauck, K. Underwood, and K. Hemmert, “Embedded
floating-point units in FPGAs,” in Proceedings of the 2006 ACM/SIGDA
which other floor-planning techniques are compared. It can 14th international symposium on Field programmable gate arrays.
also be noted that results of CF compared to BMR vary ACM New York, NY, USA, 2006, pp. 12–20.
depending upon the set of benchmarks that are used. For [2] C. Ho, P. Leong, W. Luk, S. Wilton, and S. Lopez-Buedo, “Virtual
embedded blocks: A methodology for evaluating embedded elements in
SET I benchmark circuits, where the types of blocks for each FPGAs,” in Proc. FCCM, 2006, pp. 35–44.
benchmark are two or more than two and communication is [3] G. Govindu, S. Choi, V. Prasanna, V. Daga, S. Gangadharpalli, and
dominated by HB-HB type of communication, CF produces V. Sridhar, “A high-performance and energy-efficient architecture for
floating-point based LU decomposition on FPGAs,” in Parallel and Dis-
worse results than the other two sets of benchmarks. This is tributed Processing Symposium, 2004. Proceedings. 18th International,
because columns of different HBs are separated by columns of 2004.
CLBs and HBs need extra routing resources to communicate [4] K. Underwood and K. Hemmert, “Closing the gap: CPU and FPGA
trends in sustainable floating-point BLAS performance,” in 12th Annual
with other HBs. However in BMR there is no such limitation; IEEE Symposium on Field-Programmable Custom Computing Machines,
HBs communicating with each other can always be placed 2004. FCCM 2004., 2004, pp. 219–228.
close to each other. For other two sets the gap between CF and [5] I. Kuon and J. Rose, “Measuring the gap between FPGAs and ASICs,”
in Proceedings of the 2006 ACM/SIGDA 14th international symposium
BMR is relatively less. The reduced HB-HB communication on Field programmable gate arrays. ACM New York, NY, USA, 2006,
in SET II and SET III benchmark circuits is the major cause pp. 21–30.
of reduction in the gap between CF and BMR. However 23% [6] Xilinx, “Xilinx,” http://www.xilinx.com, 2010.
[7] Altera, “Altera,” http://www.altera.com, 2010.
and 10% area difference for SET II and SET III is due to the [8] Z. Marrakchi, U. Farooq, and H. Mehrez, “Comparison of a tree-based
placement algorithm. In CF, the simulated annealing placement and mesh-based coarse-grained fpga architecture,” in ICM ’09, 2009.
algorithm is restricted to place hard-block instances of a netlist [9] G. Lemieux, E. Lee, M. Tom, , and A. Yu, “Directional and single-
driver wires in fpga interconnect,” in IEEE Conference on FPT, 2004,
at predefined positions. This restriction for the placer reduces pp. 41–48.
the quality of placement solution. Decreased placement quality [10] H. Parvez, Z. Marrakchi, U. Farooq, and H. Mehrez, “A New
requires more routing resources to route the netlist; thus more Coarse-Grained FPGA Architecture Exploration Environment,” in Field-
Programmable Technology, 2008. FPT 2008. International Conference
area is required. on, 2008, pp. 285–288.
For tree-based FPGA, ASYM produces best results in terms [11] V. Betz and J. Rose, “VPR: A New Packing Placement and Routing
of area and it is better than the best technique of mesh-based Tool for FPGA research,” International Workshop on FPGA, pp. 213–
22, 1997.
FPGA (i.e. BMR) by an average of 5% for a total of 21 [12] C. C. Skiścim and B. L. Golden, “Optimization by simulated annealing:
benchmarks. The major advantage of a heterogeneous tree- A preliminary computational study for the tsp,” in WSC ’83: Proceedings
based FPGA is that the maximum number of switches required of the 15th conference on Winter Simulation. Piscataway, NJ, USA:
IEEE Press, 1983, pp. 523–535.
to route a connection between CLB-HB or HB-HB remain [13] D. Cherepacha and D. Lewis, “DP-FPGA: An FPGA architecture
relatively constant. However, the netlist that contains more optimized for datapaths,” VLSI Design, vol. 4, no. 4, pp. 329–343, 1996.
HB-HB communication (such as SET I), the constant switch [14] G.Karypis and V.Kumar, “Multilevel k-way hypergraph partitioning,”
1999.
requirement does not mean minimum switch requirement. The [15] “Berkeley logic synthesis and verification group,university of
architecture floor-planning of tree-based FPGA does not effect california, berkeley. berkeley logic interchange format (blif),
the switch requirement of the architecture. However, the floor- http://vlsi.colorado.edu/vis/blif.ps.”
[16] E. M. Sentovich and al, “Sis: A system for sequential circuit analysis,”
planning of mesh-based FPGA causes drastic impact on the Tech. Report No. UCB/ERL M92/41, University of California, Berkeley,
switching requirement of the architecture. 1992.
[17] A. Marquardt, V. Betz, and J. Rose, “Using cluster based logic blocks
VI. C ONCLUSION and timing-driven packing to improve fpga speed and density,” in
Proceedings of the International Symposium on Field Programmable
This paper has explored two heterogeneous FPGA archi- Gate Arrays, 1999, pp. 39–46.
tectures. Different mesh-based floor-plannings are compared. [18] L. McMurchie and C. Ebeling, “Pathfinder: A Negotiation-Based
Performance-Driven Router for FPGAs,” in Proc.FPGA’95, 1995.
The floor-plannings of mesh-based FPGA influence the routing [19] A. Greiner and F. Pecheux, “Alliance: A complete set of cad tools for
network requirement of the architecture. The column-fixed teaching vlsi design,” 3rd Eurochip Workshop, 1992.
floor-planning is on average 36%, 23% and 10% more area
consuming than the block-move-rotate floor-planning for three

44 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


DYNAMIC ONLINE RECONFIGURATION OF DIGITAL CLOCK MANAGERS ON XILINX
VIRTEX-II/VIRTEX-II-PRO FPGAS: A CASE STUDY OF DISTRIBUTED POWER
MANAGEMENT

Christian Schuck, Bastian Haetzer, Jürgen Becker

Institut für Technik der Informationsverarbeitung - ITIV


Karlsruher Institut für Technologie (KIT)
Vincenz-Prießnitz-Strasse 1 76131 Karlsruhe

Heterogeneous Array of Organic Processing Cells (OPCs)


ABSTRACT
OPC with common
structure but with
Xilinx Virtex-II family FPGAs support an advanced low FPGA DSP I/O
specific functionality
Cell Cell Cell
skew clock distribution network with numerous global clock
nets to support high speed mixed frequency designs. Digital
Clock Managers in combination with Global Clock Buffers
are already in place to generate the desired frequency and Memory Monitor FPGA
to drive the clock networks with different sources Cell Cell Cell
respectively. Currently almost all designs run at a fixed artNoc
clock frequency determined statically during design time. •broadcast
•real time
Such systems cannot take the full advantage of partial and •adaptive routing
dynamic self reconfiguration. Therefore, this paper FPFA µProc I/O
Cell Cell Cell
introduces a new methodology that allows the implemented
hardware to dynamically self adopt the clock frequency
during runtime by dynamically reconfiguring the Digital
Clock Managers. Inspired by nature self adoption is done Peripheral
Devices
completely decentralized. Figures for reconfiguration
performance and power savings will be given. Further, the Fig. 1. DodOrg organic hardware architecture
tradeoffs for reconfiguration effort using this method will
be evaluated. Results show the high potential and lead to high waste of either processing power or energy
importance of the distributed DFS method with little [9][11].
additional overhead. Considering the feature of partial and dynamic self-
reconfiguration of Xilinx Virtex FPGAs during runtime a
1. INTRODUCTION high dynamic and flexibility arises. Static analysis methods
are no longer able to sufficiently determine an adjusted
Xilinx Virtex FPGAs have been designed with high clock frequency during design time. At the same time a new
performance applications in mind. They feature several partial module is reconfigured onto the FPGA grid, its
dedicated Digital Clock Managers (DCMs) and Digital critical path changes and in turn the clock frequency has to
Clock Buffers for solving high speed clock distribution be adjusted as well during runtime to fit the new critical
problems. Multiple clock nets are supported to enable path. On the other side the throughput requirement of the
highly heterogeneous mixed frequency designs. Usually all application or the environmental conditions may change
clock frequencies for the single clock nets and the over time making an adjustment of clock frequency
parameters for the DCMs are determined during design necessary.
time through static timing analysis. Targeting maximum Therefore, a new paradigm of system design is necessary to
performance these parameters strongly depend on the efficiently utilize the available processing power of future
longest combinatorial path (critical path) between two chip generations. To address this issue in [1] the Digital on
storage elements of the design unit they are driving. For Demand Computing Organism (DodOrg) was proposed,
minimum power the required throughput of the design unit which is derived from a biological organism.
determines the lower boundary of the possible clock Decentralisation of all system instances is the key feature to
frequency. In both cases non adjusted clock frequencies reach the desired goals of self-organisation, self-adoption,
self-healing in short the self-x features. Hence the hardware

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 45


architecture of the DodOrg system consist of many PC. However, in order to enable the system to self adapt its
heterogeneous so called Organic Processing Cells (OPCs), clock frequency on chip solutions are required.
that communicate through the artNoC [3] router network as In [7] the authors proposed an online solution for clock
shown in Figure 1. In general, all OPCs are made of the gating. They propose a feedback multiplexer with control
same blueprint. On the one side they contain a highly logic in front of the registers. So it is possible to keep the
adaptive heterogeneous data path and on the other side register value and to prevent the combinatorial logic behind
several common units that are responsible for the organic the register to toggle. But simultaneously they highlight that
behaviour of the architecture. Among them a Power- clock gating on FPGAs could have a much higher power
Management-Unit is able to perform dynamic frequency saving efficiency if it would be possible to completely gate
scaling (DFS) on OPC level. Therefore, it can control and the FPGA clock tree. To overcome this drawback in [8] the
adjust performance and power consumption of the cell authors provide an architectural block that is able to
according to the actual computational demands of the perform DFS. However this approach leads to low speed
application and the critical path of the cells data path. DFS designs and clock skew problems as it is necessary to insert
has a high potential, as it decreases the dynamic power user logic into the clock network.
consumption by decreasing the switching activity of flip We show that on Xilinx Virtex-II no additional user logic is
flops, gates in the fan-out of flip flops and the clock trees. necessary to efficiently and reliably perform a fine grained
Therefore, the cell’s clock domain is decoupled by the self adaptive DFS. All advantages of the high speed clock
network interface and can operate independently from the distribution network could be maintained.
artNoC and the other OPCs of the organic chip.
In [5] we presented a prototype implementation of the 3. XILINX CLOCK ARCHITECTURE
DodOrg architecture on a Virtex FPGA, where it is possible
to dynamically change the cells data path through 2D-
This section gives a brief overview over the Xilinx Virtex-
partial and dynamic reconfiguration. Therefore, a novel IP
II clock architecture as our work makes extensive use of the
core, the artNoC-ICAP-Interface was developed in order to
provided features.
perform fast 2 dimensional self- reconfiguration and
provide a virtual decentralisation of the internal FPGA
configuration access port (ICAP). This paper enhances the
methodology by enabling the partial and dynamic self-
reconfiguration of the Virtex DCMs, which is inherently
not given, through the artNoC-ICAP-Interface. Therefore,
the desired self adoption with respect to a fine grained
power management could be achieved.
The rest of the paper is organized as follows. Section 2
reviews several other proposals for DFS on FPGAs while
section 3 summarizes important aspects of the Xilinx Virtex
II FPGA clock net infrastructure. Section 4 describes the
details of our approach to dynamically reconfigure the
DCMs during runtime before section 5 shows
reconfiguration performance and power saving figures.
Finally, section 6 concludes the work and gives an outlook
to future work.

2. RELATED WORK

Recently, several work has been published dealing with


power management and especially clock management on
Fig. 2. Xilinx Virtex Clock distribution network
FPGAs. All authors agree that there is a high potential for
using DFS method in both ASIC and FPGA designs [7]
[11]. 3.1. Clock Network Grid
In [9] the authors show that even because of FPGA process
variations and because of changing environmental Besides normal routing resources Xilinx Virtex-II FPGAs
conditions (hot, normal, cold temperature) dynamically have a dedicated low skew high speed clock distribution
clocking designs can lead to a speed improvement of up to network [4][6]. They feature 16 global clock buffers
86% compared to using a fixed, statically estimated clock (BUFGMUX, see section 3.3) and support up to 16 global
during design time. The authors use an external clock domains (Figure 2). The FPGA grid is partitioned
programmable clock generator that is controlled by a host into 4 quadrants (NW, SE, SW, and SE) with up to 8 clocks

46 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


per quadrant. Eight clock buffers are in the middle of the 4. ORGANIC SYSTEM ARCHITECTURE
top edge and eight are in the middle of the bottom edge. In
principle any of these 16 clock buffer outputs can be used Compared to μC ASIC solutions SRAM based FPGAs like
in any quadrant as long as opposite clock buffers on the top Virtex-II consume a multiple of power. This is due to the
and on the bottom are not used in the same quadrant, i.e. fine grained flexibility and adaptability and the involved
there is no conflict [6]. In addition, device dependant, up to overhead. By just using these features during design time to
12 DCMs are available. They can be used to drive clock create a static design, most of the potential remains unused.
buffers with different clock frequencies. In the following Instead dynamic and partial online self-reconfiguration
important features of the DCMs and clock buffers will be during runtime is a promising approach to exploit the full
summarized. potential and even to close the energy gap. Therefore, in [5]
we proposed to implement the OPC based organic
computing organisms on a Virtex-II Pro FPGA as shown in
3.2. Digital Clock Managers
Figure 4.
Besides others, frequency synthesis is an important feature This paper focuses on the power related issues of the cell
of the DCMs. Therefore, 2 main different programmable based DodOrg architecture on the FPGA prototype.
outputs are available. CLKDV provides an output Important aspects to reach the desired goal of a fine
frequency that is a fraction (÷1.5, ÷2, ÷2.5… ÷7, ÷7.5, ÷8, grained, decentralized self adaptive power management will
÷9… ÷16) of the input frequency CLKIN. be discussed in the subsequent sub-sections.
CLKFX is able to produce an output frequency that is
synthesised by combination of a specified integer multiplier
M  {2…32} and a specified integer divisor D  {1…32}
by calculation CLKFX = M÷D*CLKIN.

3.3. Global Clock Buffer

Global Clock Buffers have three different functionalities. In


addition to pure clock distribution, they can also be
configured as a global clock buffer with a clock enable
(BUFGCE). Then the clock can be stopped at any time at
the clock buffer output.
Further clock buffers can be configured to act as a “glitch-
free” synchronous 2:1 multiplexer (BUFGMUX). These
multiplexers are capable of switching between two clock
sources at any time, by using the select input that can be
driven by user logic. No particular phase relations between
the two clocks are needed. For example as shown in Figure
3 they can be configured to switch between two DCM clock
CLKFX outputs.
As we will see in the next section our design makes use of Fig. 4. DodOrg FPGA Floorplan / Clock Architecture
this feature.

4.1. Clock Partitioning

Depending on the size of the device several OPCs are


mapped onto a single FPGA (Figure 4). The clock net of
the highly adaptive data path (DP) of every OPC is
connected to a BUFGMUX that is driven by a pair of
DCMs. Inside every OPC has its own power management
unit (PMU) that is connected to the select input of the
BUFGMUX. So it can quickly choose between the two
DCM clock sources. The DP-clock is decoupled from the
artNoC clock by using a dual ported dual clock FIFO
buffer. Further the PMU is connected to the artNoC. Thus it
is able to exchange power related information with the
other PMUs. Beyond that it has access to the artNoC-ICAP-

Fig. 3. Example BUFGMUX /DCM configuration

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 47


Interface. Therefore, during runtime every PMU can
dynamically adapt the DCM CLKFX output clock
frequency through partial self-reconfiguration by using the
features of the artNoC-ICAP-Interface.

4.2. artNoC-ICAP-Interface

The artNoC-ICAP-Interface is a small and lightweight IP,


which on the one side acts as a wrapper around the native
hardware ICAP and on the other side connects to the
artNoC network. It provides a virtual decentralisation of the
ICAP as well as an abstraction of the physical
reconfiguration memory. Its main purpose is to perform the
Readback-Modify-Writeback (RMW) method in hardware.
Therefore, a fast and true 2 dimensional reconfiguration of
all FPGA resources is possible, i.e. reconfiguration is no
Fig. 5 DCM reconfiguration performance
longer restricted to columns. Due to it’s partitioning into
two clock domains, one clock domain for the artNoC
controller side and one clock domain for the ICAP side, and write operation. It is an indicator for the overall
maximal reconfiguration performance could be achieved duration of the reconfiguration procedure. It strongly
depends on the device size or rather on the configuration
[5].
As every bit within the reconfiguration memory can be frame length. In this case a Virtex-II Pro XC2VP30 device
reconfigured independently the configuration of the DCMs was used with a frame length of 824 Byte. For
can be altered as well. However, a special procedure is reconfiguration of the DCM just a single configuration
necessary that is described in the next paragraph. frame has to be processed. From the beginning of the
icap_enable low phase to the spike in Figure 5a) the
configuration frame is read back from the configuration
4.3. DCM Reconfiguration Details memory. Then, the ICAP is configured to write mode and
the zero configuration to shut off the DCM is written
During reconfiguration of DCMs it is important that a followed by a dummy frame to flush the ICAP input
glitchless switching from one clock frequency to another register. As soon as the writing of the dummy frame is
can be guaranteed. In general, after initial setup CLKDV finished the DCM stops. Figure 5 c) shows a zoom of the
and CLKFX outputs are only enabled when a stable clock is DCM_CLKFX output (Figure 5 b)) at this point in time.
established. After that, the DCM is locked to the configured We see that the DCM_CLKFX was running at 6.25 MHz
frequency, as long as the jitters of the input clock CLKIN and stops without any jitter or spikes. Immediately after the
stays in a given tolerance range [6]. For our scenario we dummy frame, the read back frame which has been merged
assume that the input clock is stable. with the new DCM parameters is written back to the ICAP
If we change the DCM configuration (D, M) in followed by a second dummy frame. As soon as the dummy
configuration memory to switch from one clock frequency frame is processed the DCM_CLKFX output runs with the
to a different frequency while the DCM is locked, it loses new frequency in this case 8.33 MHz. Figure 5 d) shows a
the lock and no stable output, i.e. no output can be zoom of this point in time. Again no glitches or spikes
guaranteed. Therefore, to ensure a consistent locking to the occur. The overall processing time for a complete DCM
new frequency the following steps have to be performed: reconfiguration in this case is 60,7 μsec. In general the
Stop the DCM by writing a zero configuration (D= 0, M = reconfiguration time for a different Virtex-II family device
0) is given by:
Write the new configuration (D = Dnew, M = Mnew). tDCM~ frame length [Byte] * 2 / 67,7 [Byte/μs] +
To simplify the handling of the DCM reconfiguration this frame_length [Byte] * 4 / 90,5 [Byte/μs]
two step procedure is internally executed by the artNoC- (@ ICAP_CLK = 100 MHz)
ICAP-Interface. It therefore features a special DCM The two summands in the formula are resulting from the
addressing mode, for an easy access to the DCM fact that ICAP has different throughputs for reading and
configuration. Figure 5 shows a plot of a DCM writing reconfiguration data [5].
reconfiguration procedure performed by the artNoC-ICAP- Therefore, this procedure presents a save method to
Interface. The plots were recorded by a 4 channel digital dynamically reconfigure DCMs during runtime. However,
oscilloscope with all important signals routed to FPGA even if self-adaptive decentralized DFS can be realized with
pins. Figure 5 a) shows the ICAP enable signal that is the presented method two main drawbacks are obvious:
asserted by the artNoC-ICAP-Interface during ICAP read

48 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


5.1. Test Setup Power Measurement
Fig. 6 BUGMUX clock switching
We calculated the power consumption by measuring the
x Relatively long setup delay until the new
voltage drop over an external shunt resistor (0.4 Ohm) on
frequency is valid (in this example: 60,7μs).
the FPGA core voltage (FGPA_VINT). As a test system the
x Interruption of clock frequency during Xilinx XUP board with a Virtex-II Pro (XC2VP30) device
reconfiguration (in this example: 18,2μs) was used. For all measurements the board source clock of
This means that the method is appropriate for reaching long 100 MHz was used as an input clock to the design.
term or intermediate term power management goals, i.e. a To isolate the portions of power consumption, as shown in
new data path is configured and the clock frequency is Table 1, several distinct designs have been synthesised.
adapted to its critical path and then stays constant until a For DCM measurement an array of toggle flip-flops at 100
new data path is required. But if a frequent and immediate MHz with and without a DCM in the clock tree have been
switching is necessary, e.g. when data arrives in burst and recorded and the difference of both values has been taken.
between burst the OPC wants to toggle between shut off For extracting ICAP power consumption a system
(fDP = 0Hz) and maximal performance (fDP = fmax) the consisting of PM, artNoC-ICAP-Interface and ICAP
method needs to be extended. instance and a second identical system but without ICAP
In this case a setup consisting of two DCMs and a instance have been implemented. After activation the PM
BUFGMUX, as shown in Figure 3 can be chosen. The sends bursts of two complete alternating configuration
select input of the BUFGMUX is connected to the PMU of
the OPC. Therefore, it is able to toggle between two “Passive” Power [mW] “Active” Power [mW]
frequencies immediately without any delay as shown in
static_offset - 11
Figure 6. Further the interruption of clock frequency during
reconfiguration can be hidden. By a combination of both DCM - 37
techniques a broad spectrum of different clock frequencies PM <1 <1
as well as an immediate uninterrupted switching is
available. artNoC-ICAP-IF <1 9
ICAP 69 76
5. RESULTS Tab. 1 Component Power Consumption

In the preceding section results for reconfiguration times frames targeting the same frame in configuration memory.
and tradeoffs have already been presented. This section The ratio of toggling bits between the two frames is 80%
evaluates the potential of power savings and performance and is considered to be representative for a partial
enhancements in the context of module based partial online reconfiguration. Therefore, before PM activation the
reconfiguration. Especially, the overhead in terms of area “passive” power and after activation the “active” power
and power consumption introduced by the approach (PM, could be measured. Again the difference in power
artNoC-ICAP-Interface, DCM) is taken into account. consumption of the two systems was taken to extract ICAP
portion. The other components were measured with the
same methodology. Therefore, e.g. all components
necessary to implement the approach presented in section

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 49


4.3 with two DCMs + BUFGMUX consume 196mW when power consumed by the components of the proposed
active, i.e. 180mW when passive. But it has to be hardware framework, especially the DCMs itself, is not
considered that artNoC-ICAP-Interface as well as ICAP is negligible and has to be counterweighted. With DCM
also used for partial 2D reconfiguration. reconfiguration times in the range of 60μs long term power
management goals can be reached. We also provide figures
for reconfiguration times as well resource utilization.
5.2. Area and Resource Utilization Future work is targeting towards the examination of the
system level power saving effect resulting from distributed
Resource Number Percentage power management with multiple PM and multiple clock
Slices 364 2% domains.
Slice FlipFlops 177 0%
4 input LUTs 664 2% 7. REFERENCES
BRAMs 1 0%
MULT18x18s 2 1% [1] Jürgen Becker et. al. “Digital On-Demand Computing
Tab. 2 Resource Utilization artNoC-ICAP-Interface Organism for Real-Time Systems”.In Wolfgang Karl et. al,
The resource requirement for the artNoC-ICAP-Interface Workshop Proceedings of the 19th International Conference
with DCM reconfiguration mode is shown in Table 2. on Architecture of Computing Systems(ARCS’06)
[2] U.Brinkschulte, M.Pacher and A. von Renteln. „An
Artificial Hormone System for Self-Organizing Real-Time
5.3. Power Performance Evaluation Task Allocation in Organic Middleware”. Springer, 2007

To put the previous power figures into a context we [3] C. Schuck, S. Lamparth and J. Becker “artNoC- A novel
determined the power consumption of a Microblaze soft Multi-Functional Router Architecture for Organic
core processor at different clock frequencies as shown in Computing” International Conference on Field
Programmable Logic and Applications 2007, FPL 2007
Figure 7. As we can see there is a high potential for power
savings (for example the difference in power consumption [4] Xilinx Virtex-II Pro and Virtex-II Pro X Platform FPGAs:
in idle state between 100 MHz and 50 MHz is 170mW). Complete Data Sheet; DS083 (v4.7) November 5, 2007
The overhead (ICAP+artNoC-ICAP-IF) for DCM [5] C. Schuck; B. Haetzer, and J. Becker „An interface for a
reconfiguration in a static design is in the range of a MB dezentralized 2D-reconfiguration on Xilinx Virtex-FPGAs
operating at 20 MHz. As expected, we see that there is a for organic computing” Proc. Reconfigurable
linear dependency between clock frequency and power Communication-centric SoCs, 2008. ReCoSoC 2008, ISBN:
consumption. Therefore, the energy consumed per clock- 978-84-691-3603-4
cycle: E= P/fclk; fclk= c*P is constant for all clock [6] Xilinx Virtex-II Pro and Virtex-II Pro X FPGA User Guide,
frequencies. This means, in terms of power savings for a UG012 (V4.2) 5 November 2007
static data path there is no point for using reconfiguration of
[7] Y. Zhang, J. Roivainen, and A. Mämmelä „Clock-Gating in
DCMs. A setup of DCMfmax and BUFGCE to toggle
FPGAs: A Novel and Comparative Evaluation” Proc. 9th
between f=fmax and f=0 is most appropriate. In terms of
Euromicro Conference on Digital System Design (DSD’06)
performance, DCM reconfiguration can be used to evaluate
maximum clock frequency during runtime. [8] I. Brynjolfson, and Z. Zilic “Dynamic Clock Management
In turn, in a dynamic scenario, where the data path and for Low Power Applications in FPGAs” Proc. Custom
therefore also the critical path changes, DCM Integrated Circuits Conference 2000
reconfiguration is necessary to achieve maximum module [9] J.A. Bower, W. Luk, O. Mencer, M.J. Flynn, M. Morf
performance. It also comes without any additional overhead “Dynamic clock-frequencies for FPGAs” Microprocessors
as ICAP + artNoC-ICAP-IF +DCM are already needed for and Microsystems, Volume 30, Issue 6, 4 September 2006,
reconfiguration. The capability of DCM reconfiguration Pages 388-397, Special Issue on FPGA’s
together with BUFGMUX provides the basis for fine- [10] B. Fechner “Dynamic delay-fault injection for
grained short or long term power management strategies. reconfigurable hardware” Proc. 19th International Parallel
and Distributed Processing Symposium, 2005. IPDPS 2005,
6. SUMMARY AND FUTURE WORK [11] I. Brynjolfson, and Z. Zilic “FPGA Clock Management for
Low Power” Proc. International Symposium on FPGAs,
In this paper we have presented a novel methodology to 2000
dynamically reconfigure Digital Clock Managers on Xilinx [12] M. Huebner, C. Schuck, and J. Becker “Elementary block
Virtex-II devices through ICAP. On one side optimal based 2-dimensional dynamic and partial reconfiguration for
performance of partial modules and on the other side the Virtex-II FPGAs” 20th International Parallel and Distributed
goal of uniform power consumption can be achieved Processing Symposium, 2006. IPDPS 2006, Volume , Issue ,
without external hardware. Our measurements show that 25-29 April 2006.

50 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Practical Resource Constraints for Online Synthesis
Stefan Döbrich Christian Hochberger
Chair for Embedded Systems Chair for Embedded Systems
University of Technology University of Technology
Dresden, Germany Dresden, Germany
stefan.doebrich@inf.tu-dresden.de christian.hochberger@inf.tu-dresden.de

Abstract—Future chip technologies will change the way we Thus, synthesis of accelerating circuits takes place during
deal with hardware design. First of all, logic resources will be the applications execution. No hand crafted adaptation of the
available in vast amount. Furthermore, engineering specialized source code shall be required, although it is clear that manual
designs for particular applications will no longer be the general
approach as the non recurring expenses will grow tremendously. fine-tuning of the code can lead to better results.
Reconfigurable logic in the form of FPGAs and CGRAs has In this contribution we want to show that even a relatively
often been promoted as a solution to these problems. We believe small amount of reconfigurable resources is sufficient for a
that online synthesis that takes place during the execution substantial application acceleration.
of an application is one way to broaden the applicability of The remainder of this paper is organized as follows. In
reconfigurable architectures as no expert knowledge of synthesis
and technologies is required. In this paper we show that even section II we will give an overview of related work. In section
a moderate amount of reconfigurable resources is sufficient to III we will present the model of our processor which allows an
speed up applications considerably. easy integration of synthesized functional units at runtime. In
Index Terms—online synthesis, adaptive computing, reconfig- section IV we will detail how we figure out the performance
urable architecture, CGRA, AMIDAR sensitive parts of the application by means of profiling. Section
V explains our online synthesis approach. Results for some
I. I NTRODUCTION
benchmark applications are presented in section VI. Finally,
Following the road of Moore’s law, the number of transistors we give a short conclusion and an outlook onto future work.
on a chip doubles every 24 months. After being valid for more
than 40 years, the end of Moore’s law has been forecast many II. R ELATED W ORK
times, but technological advances have kept the progress intact. Reconfigurable logic for application improvement has been
Further shrinking of the feature size of traditionally man- used for more than two decades. A speedup of 1000 and more
ufactured chips will lead to exponentially increased mask could be achieved consistently during this period. Examples
costs. This makes it prohibitively expensive to produce small range from the CEPRA-1X for cellular automata simulation
quantities of chips for a particular design. Also, the question [10] to implementations of the BLAST algorithm [15]. Un-
comes up, how to make use of the vast amounts of resources fortunately, these speedups require highly specialized HW
without building individual chip designs for each application. architectures and domain specific modelling languages.
Reconfigurable logic in different granularities has been A survey of the early approaches using reconfigurable logic
proposed to solve both problems [16]. It allows us to build for application speed up can be found in [3].
large quantities of chips and yet use them individually. Field Static transformation from high level languages like C into
programmable gate arrays (FPGAs) are in use for this purpose fine grain reconfigurable logic is still the research focus of
for more than two decades. Yet, it requires much expert a number of academic and commercial research groups. Only
knowledge to implement applications or part of them on an very few of them support the full programming language [11].
FPGA. Also, reconfiguring FPGAs takes a lot of time due to Efficient static transformation from high level languages
the large amount of configuration information. into CGRAs is also investigated by several groups. The
Coarse Grain Reconfigurable Arrays (CGRAs) try to solve DRESC [14] tool chain targeting the ADRES [13][17] ar-
this last problem by working on word level instead of bit chitecture is one of the most advanced tools. Yet, it requires
level. The amount of configuration information is dramatically hand written annotations to the source code and in some cases
reduced and also the programming of such architectures can even some hand crafted rewriting of the source code. Also, the
be considered more software style. The problem with CGRAs compilation times easily get into the range of days.
is typically the tool situation. Currently available tools require The RISPP architecture [1] lies between static and dynamic
an adaptation of the source code and typically have very high approaches. Here, a set of candidate instructions are evaluated
runtime so that they need to be run by experts and only for at compile time. These candidates are implemented dynam-
very few selected applications. ically at runtime by varying sets of so called atoms. Thus,
Our approach tries to make the reconfigurable resources alternative design points are chosen depending on the actual
available for all applications in the embedded systems domain. execution characteristics.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 51


Dynamic transformation from software to hardware has Communication Structure

been investigated already by other researchers. Warp proces-


sors dynamically transform assembly instruction sequences FU 1 FU 2 FU n
code token
memory generator
into fine grain reconfigurable logic[12]. Furthermore, dynamic
synthesis of Java bytecode has been evaluated [2]. Nonethe-
less, this approach is only capable of synthesizing combina- Token Distribution Network

tional hardware.
The token distribution principle of AMIDAR processors has Fig. 1. Abstract Model of an AMIDAR Processor
some similarities with Transport Triggered Architectures[4].
Yet, in TTAs an application is transformed directly into a set
of tokens. This leads to a very high memory overhead and processed and the destination port of the result (DP ) are part
makes an analysis of the executed code extremely difficult. of the token. Finally, every token contains a tag increment
flag (IN C). By default, the result of an operation is tagged
III. T HE AMIDAR PROCESSING MODEL equally to the input data. In case the T AG-flag is set, the
In this section, we will give an overview of the AMIDAR output tag is increased by one. The token generator can be
processor model. We describe the basic principles of operation, built such that every functional unit which shall receive a
which includes the architecture of an AMIDAR processor in token is able to receive it in one clock cycle. A functional
general, as well as specifics of its components. Furthermore, unit begins the execution of a specific token as soon as the
we discuss the applicability of the AMIDAR model to different data ports received the data with the corresponding tag. Tokens
instruction sets. Finally, several mechanisms of the model, that which do not require input data can be executed immediately.
allow the processor to adapt to the requirements of a given Once the appropriately tagged data is available, the operation
application at runtime are shown. starts. Upon completion of an operation the result is sent to
the destination that was denoted in the token. An instruction
A. Overview is completed, when all of its tokens are executed. To keep
An AMIDAR processor consists of three main parts. A set the processor executing instructions, one of the tokens must
of functional units, a token distribution network and a commu- trigger the sending of a new instruction to the token generator.
nication structure. Two functional units, which are common A more detailed explanation of the model can be found in [7].
to all AMIDAR implementations are the code memory and
C. Applicability
the token generator. As its name tells, the code memory
holds the applications code. The token generator controls In order to apply the presented model to an instruction
the other components of the processor by means of tokens. set, a composition of microinstructions has to be defined for
Therefore, it translates each instruction into a set of tokens, each instruction. Overlapping execution of instructions comes
which is distributed to the functional units over the token automatically with this model. Thus, it can best be applied if
distribution network. The tokens tell the functional units what dependencies between consecutive instructions are minimal.
to do with input data and where to send the results. Specific The great advantage of this model is that the execution of
AMIDAR implementations may allow the combination of the an instruction depends on the token sequence, and not on the
code memory and the token generator as a single functional timing of the functional units. Hence, functional units can be
unit. This would allow the utilization of side effects like in- replaced at runtime with other versions of different charac-
struction folding. Functional units can have a very wide range terizations. The same holds for the communication structure,
of meanings: ALUs, register files, data memory, etc. Data which can be adapted to the requirements of the running
is passed between functional units over the communication application. Intermediate virtual assembly languages like Java
structure. This data can have various meanings: instructions, bytecode, LLVM bitcode or the .NET common intermediate
address information, or application data. Figure 1 sketches the language are good candidates for instruction sets.
abstract structure of an AMIDAR processor. The structure of a minimum implementation of an AMIDAR
based Java processor is sketched in figure 2. Firstly, it contains
B. Principle of Operation the mandatory functional units token generator and code
Execution of instructions in AMIDAR processors dif- memory. In case of a Java machine, the code memory holds
fers from other execution schemes. Neither microprogram- all class files and interfaces, as well as their corresponding
ming nor explicit pipelining are used to execute instruc- constant pools and attributes. Additionally, the processor con-
tions. Instead, instructions are broken down to a set of tains several memory functional units. These units realize the
tokens which are distributed to a set of functional units. operand stack, the object heap, the local variable memory and
These tokens are 5-tuples, where a token is defined as the method stack. In order to process arithmetic operations,
T = {U ID, OP, T AG, DP, IN C}. It carries the information the processor shall contain at least one ALU functional unit.
about the type of operation (OP ) that shall be executed by Nonetheless, it is possible to separate integer and floating
the functional unit with the specified id (U ID). Furthermore, point operations into two disjoint functional units, which
the version information of the input data (T AG) that shall be improves the throughput. Furthermore, the processor contains

52 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Communication Structure

object local operand jump method code token


IALU FALU
heap variables stack unit stack memory generator

Token Distribution Network

Fig. 2. Model of a Java (non)Virtual Machine on AMIDAR Basis

a jump unit which processes all conditional jumps. Therefore, functional unit regarding chip size, latency or throughput
the condition is evaluated, and the resulting jump offset is may be available. The most appropriate implementation is
transfered to the code memory. chosen dynamically at runtime and may change throughout the
The range of functional unit implementations and commu- lifetime of the application. Secondly, the number of instances
nication structures is especially wide if the instruction set of a specific functional unit may be increased or decreased
has a very high abstraction level and/or basic operations are dynamically. This is an alternative way to respond to chang-
sufficiently complex. Finally, the data driven approach makes ing throughput requirements. Finally, dynamically synthesized
it possible to easily integrate new functional units and create functional units may be added to the processors datapath. It
new instructions to use these functional units. is possible to identify heavily utilized instruction sequences
D. Adaptivity in the AMIDAR Model of an application at runtime (see section IV). Suitable code
sequences can be transformed into functional units by means
The AMIDAR model exposes different types of adaptivity. of online synthesis. These functional units would replace the
All adaptive operations covered by the model are intended software execution of the related code.
to dynamically respond to the running applications behavior.
Therefore, we identified adaptive operations that adapt the G. Synthesizing Functional Units in AMIDAR
communication structure to the actual interaction scheme be- AMIDAR processors need to include some reconfigurable
tween functional units. As a functional unit may be the bottle- fabric in order to allow the dynamic synthesis and inclusion of
neck of the processor, we included similar adaptive operations functional units. Since fine grained logic (like FPGAs) requires
for functional units. The following subsections will give an large amount of configuration data to be computed and also
overview of the adaptive operations provided by the AMIDAR since the fine grain structure is neither required nor helpful
model. Most of the currently available reconfigurable devices for the implementation of most code sequences, we focus on
do not fully support the described adaptive operations (e.g. CGRAs for the inclusion into AMIDAR processors.
addition or removal of bus structures). Yet, the model itself The model includes many features to support the integration
contains these possibilities, and so may benefit from future of synthesized functional units into the running application. It
hardware designs. In previous work [7] we have given a allows bulk data transfers from and to data memories, as well
detailed overview of these adaptive operations, so this paper as the synchronization of the token generator with operations
provides a short overview only. that take multiple clock cycles. Finally, synthesized functional
E. Adaptive Communication Structures units are able to inject tokens in order to influence the data
The communication structure can minimize the bus con- transport required for the computation of a code sequence.
flicts that occur during the data transports between functional IV. RUNTIME A PPLICATION P ROFILING
units. In order to react to the communication characteris-
tics of any given application, functional units may be con- A major task in synthesizing hardware functional units for
nected/disconnected to/from a bus structure. This can happen AMIDAR processors is runtime application profiling. This
as part of an evasion to another bus structure in case of allows the identification of candidate instruction sequences for
congestion, as well as the creation of a completely new hardware acceleration. Plausible candidates are the runtime
interconnection. Furthermore, bus structures may be split or critical parts of the current application.
folded with the objective of a more effective communication. In previous work [8] we have shown a profiling algorithm
In [9] we have shown how to identify the conflicting bus and corresponding hardware implementation which generates
taps and we have also shown a heuristics to modify the bus detailed information about every executed loop structure.
structure to minimize the conflicts. Those profiles contain the total number of executed instruc-
tions inside the affected loop, the loops start program counter,
F. Adaptive Functional Units its end program counter and the total number of executions
Three different categories of adaptive operations may be of this loop. The profiling circuitry is also capable to profile
applied to functional units. Firstly, variations of a specific nested loops, not only simple ones.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 53


A profiled loop structure becomes a synthesis candidate the graph. These structures occur at the entry of loops or in
in case its number of executed instructions surmounts a typical if-else-structures. Furthermore, the graph is annotated
given threshold. The size of this threshold can be configured with branching information. This will allow the identification
dynamically for each application. of the actually executed branch and the selection of the valid
Furthermore, an instruction sequence has to match specific data when merging two or more branches by multiplexers. For
constraints in order to be synthesized. Currently, we are if-else-structures, this approach reflects a speculative execution
not capable of synthesizing code sequences containing the of the alternative branches. The condition of the if-statement
following instruction types, as our synthesis algorithm has not is used to control the selection of one set of result values.
evolved to this point yet. Loop entry points are treated differently, as no overlapping or
• memory allocation operations software pipelining of loop kernels is employed.
• exception handling In the next step the graph is annotated with a virtual stack.
• thread synchronization This stack does not contain specific data, but the information
• some special instructions, e.g. lookupswitch about the producing instruction that would have created it.
• access operations to multi-dimensional arrays This allows the designation of connection structures between
• method invocation operations the different instructions as the predecessor of an instruction
From this group only access to multi-dimensional arrays and may not be the producer of its input.
method invocations are important from performance aspect. Afterwards an analysis of access operations to local vari-
Multi-dimensional arrays do actually occur in compute ables, arrays and objects takes place. This aims at loading data
kernels. Access operations on these arrays are possible in into the functional unit and storing it back to its appropriate
principle in the AMIDAR model. Yet, multi-dimensional ar- memory after its execution. Therefore, a list of data that has
rays are organized as arrays of arrays in Java. Thus, access to be loaded and a list of data that has to be stored is created.
operations need to be broken down into a set of stages The next step transforms the instruction graph into a
(one for each dimension), which is not yet supported by our hardware circuit. This representation fits precisely into our
synthesis algorithm. Nevertheless, a manual rewrite of the code simulation. All arithmetic or logic operations are transformed
is possible to map multi-dimensional arrays to one dimension. into their abstract hardware equivalent. The introduced Φ-
Similarly, method inlining can be used to enable the synthe- nodes are transfered to multiplexer structures. The annotated
sis of code sequences that contain method invocations. Tech- branching information helps to connect the different branches
niques for the method inlining are known from JIT compilers correctly and to determine the appropriate control signal.
that preserve the polymorphism of the called method. Yet, Furthermore, registers and memory structures are introduced.
these techniques require the abortion of the execution of the Registers hold values at the beginning and the end of branches
HW under some conditions, which is not yet supported by our in order to synchronize different branches. Localization of
synthesis algorithm. memory accesses is an important measure to improve the
performance of potential applications. In general, SFUs could
V. O NLINE S YNTHESIS OF A PPLICATION S PECIFIC also access the heap to read or write array elements, but this
F UNCTIONAL U NITS access would incur an overhead of several clocks. The memory
The captured data of the profiling unit is evaluated peri- structures are connected to the consumer/producer components
odically. In case an instruction sequence exceeds the given of their corresponding arrays or objects. A datapath equivalent
runtime threshold the synthesis is triggered, and runs as a low to the instruction sequence is the result of this step.
priority process concurrently to the application. Thus, it only Execution of consecutive loop kernels is strictly separated.
occurs if spare computing time remains in the system, and Thus, all variables and object fields altered in the loop kernel
also cannot interfere with the running application. are stored in registers at the beginning of each loop iteration.
Arrays and objects may be accessed from different branches
A. Synthesis Algorithm that are executed in parallel. Thus, it is necessary to syn-
An overview of the synthesis steps is given in figure 3. The chronize access to the affected memory regions. Furthermore,
parts of the figure drawn in grey are not yet implemented. only valid results may be stored into arrays or objects. This
Firstly, an instruction graph of the given sequence is created. is realized by special enable signals for all write operations.
In case an unsupported instruction is detected the synthesis is The access synchronization is realized through a controller
aborted. Furthermore, a marker of a previously synthesized synthesis. This step takes the created datapath and all informa-
functional unit may be found. If this is the case it is necessary tion about timing and dependency of array and object access
to restore the original instruction information and then proceed operations as input. The synthesis algorithm has a generic
with the synthesis. This may happen if an inner loop has been interface which allows to work with different scheduling
mapped to hardware before, and then the wrapping loop shall algorithms. Currently we have implemented a modified ASAP
be synthesized as well. scheduling which can handle resource constraints and list
Afterwards, all nodes of the graph are scanned for their scheduling. The result of this step is a finite state machine
number of predecessors. In case a node has more than one (FSM) which controls the datapath and synchronizes all array
predecessor it is necessary to introduce specific Φ-nodes to and object access operations. Also the FSM takes care of the

54 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Start Synthesis results. (3) the token set that stores the results of the SFU
operation in the corresponding memory.
Create Instruction Graph Mark SFU for release
In a next step it is necessary to make the SFU accessible
to the other processor components. This requires to register
it in the bus arbiter and to update the token generator with
Insert Φ −Nodes Reinsert Bytecode the computed token sets. The token set will be triggered by a
Previously
reserved bytecode instruction.
Virtual Stack Annotation mapped Finally, the original bytecode sequence has to be replaced
inner loop by the reserved bytecode instruction. To allow multiple SFUs
to co-exist, the reserved bytecode carries the ID of the targeted
Analysis of Input Data Unsupported SFU. Patching of the bytecode sequence is done in such a way
Bytecode that the token generator can continue the execution at the first
Mapping to Physical HW Remove Markers
instruction after the transformed bytecode sequence. Also, it
must be possible to restore the original sequence in case a
embracing loop nesting level shall be synthesized.
Predicate Memory Access Abort Now, the sequence is not processed in software anymore
but by a hardware SFU. Thus, it is necessary to adjust the
Scheduling
profiling data of the affected code sequence.
In [5], we have given further information and a more
detailed description of the integration process.
Place & Route
VI. E VALUATION
Unregister marked SFUs Unregister marked SFUs In previous research [6] we have evaluated the poten-
tial speedup of a simplistic online synthesis with unlimited
ressources. To be more realistic, int this work we assume a
Register new SFU Register new SFU
CGRA with a limited number of processing elements, and a
single shared memory for all arrays and objects. The schedul-
Configure HW Simulate ing itself has been calculated by a longest path list scheduling.
The following data-set was gained for every benchmark:
Fig. 3. Overview of synthesis steps • its runtime, and therewith the gained speedup
• the number of states of the controlling state machine
• the number of different contexts regarding the CGRA
appropriate execution of simple and nested loops. • the number of complex operations in those contexts
As mentioned above, we do not have a full hardware im-
plementation yet. Thus, placement and routing for the CGRA The reference value for all measurements is the plain software
are not required. We use a cycle accurate simulation of the execution of the benchmarks. Note: The mean execution time
abstract datapath created in the previous steps. of a bytecode in our processor is 3-4 clock cycles. This is in
In case the synthesis has been successful, the new functional the same order as JIT-compiled code on IA32 machines.
unit needs to be integrated into the processor. If marker A. Benchmark Applications
instructions of previously synthesized FUs were found, the
original instruction sequence has to be restored. Furthermore, We have chosen applications of four different domains as
the affected SFUs have to be unregistered from the processor benchmarks, in order to test our synthesis algorithm.
and the hardware used by them has to be released. The first group contains the cryptographic block ciphers
Rijndael, Twofish, Serpent and RC6. We evaluated the round
B. Functional Unit Integration key generation out of a 256 bit master key, as well as the
The integration of the synthesized functional unit (SFU) into encryption of a 16 byte data block.
the running application consist of three major steps. (1) a token Another typical group of algorithms used in the security
set has to be generated which allows the token generator to domain are hash algorithms and message digests. We chose
use the SFU. (2) the SFU has to be integrated into the existing the Message Digest 5 (MD5), and two versions of the Secure
circuit and (3) the synthesized code sequence has to be patched Hash Algorithm (SHA-1 and SHA-256) as representatives, and
in order to access the SFU. evaluated the processing of sixteen 32bit words.
The token set consist of three parts: (1) the tokens that Thirdly, we chose the Sobel convolution operator, a
transport input data to the SFU. These tokens are sent to the grayscale filter and a contrast filter as representatives of image
appropriate data sources (e.g. object heap). (2) the tokens that processing kernels. These three filters operate on a dedicated
control the operation of the SFU, i.e. that start the operation pixel of an image, or on a pixel and its neighbours. Thus, we
(which happens once the input data is available) and emit the measured the appliance of these filters onto a single pixel.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 55


TABLE I
RUNTIME ACCELERATION OF B ENCHMARK A PPLICATIONS

Rijndael Twofish RC6 Serpent


Configuration
Clock Ticks Speedup Clock Ticks Speedup Clock Ticks Speedup Clock Ticks Speedup
Round Key
Generation

plain software 17760 - 525276 - 61723 - 44276 -


4 operators 4602 3.86 43224 12.15 3725 16.57 6335 6.99
8 operators 4284 4.15 35130 14.95 3459 17.84 6245 7.09
12 operators 4337 4.09 34280 15.32 3459 17.84 6230 7.11
16 operators 4337 4.09 34112 15.40 3459 17.84 6230 7.11

Rijndael Twofish RC6 Serpent


Configuration
Single Block

Clock Ticks Speedup Clock Ticks Speedup Clock Ticks Speedup Clock Ticks Speedup
Encryption

plain software 21389 - 12864 - 17371 - 34855 -


4 operators 6230 3.43 8506 1.51 2852 6.09 3278 10.63
8 operators 6181 3.46 8452 1.52 2810 6.18 3273 10.65
12 operators 6167 3.47 8452 1.52 2768 6.28 3273 10.65
16 operators 6167 3.47 8452 1.52 2768 6.28 3273 10.65

SHA-1 SHA-256 MD5


Configuration
Hash & Digest

Clock Ticks Speedup Clock Ticks Speedup Clock Ticks Speedup


Algorithms

plain software 23948 - 47471 - 11986 -


4 operators 4561 5.25 3619 13.12 1485 8.07
8 operators 4561 5.25 3484 13.63 1485 8.07
12 operators 4561 5.25 3484 13.63 1485 8.07
16 operators 4561 5.25 3484 13.63 1485 8.07

Sobel Filter Grayscale Filter Contrast Filter


Configuration
Clock Ticks Speedup Clock Ticks Speedup Clock Ticks Speedup
Processing

plain software 6930 - 236 - 608 -


Image

4 operators 1110 6.24 59 4.00 90 6.76


8 operators 1110 6.24 59 4.00 90 6.76
12 operators 1110 6.24 59 4.00 90 6.76
16 operators 1110 6.24 59 4.00 90 6.76

JPEG–Encoder Color Space Transformation 2-D Forward DCT Quantization


Configuration
Clock Ticks Speedup Clock Ticks Speedup Clock Ticks Speedup Clock Ticks Speedup
Encoding

plain software 17368663 - 3436078 - 23054 - 7454 -


JPEG

4 operators 4737944 3.67 323805 10.61 2743 8.40 1816 4.10


8 operators 4645468 3.74 292889 11.73 2572 8.96 1816 4.10
12 operators 4620290 3.76 277431 12.39 2545 9.06 1816 4.10
16 operators 4612561 3.77 269702 12.74 2545 9.06 1816 4.10

Finally, we evaluated a complete application and encoded a C. Schedule Complexity


given 160x48x24 bitmap into a JPEG image. The computation
In a next step, we evaluated the complexity of the con-
kernels of this application are the color space transformation,
trolling units that were created by the synthesis. Therefore
2-D forward DCT and quantization. We did not downsample
we measured the size of the finite state machines, that are
the chroma parts of the image.
controlling every synthesized functional unit. Every state is
related to a specific configuration of the reconfigurable array.
B. Runtime Acceleration In the worst case, all of those contexts would be different.
Except from the contrast and grayscale filter, all applications Thus, the size of a controlling state machine is the upper bound
contained either method invocations or access to multidimen- for the number of different contexts.
sional arrays. As we mentioned above, the synthesis does Afterwards, we created a configuration profile for every
not support these instruction types yet. In order to show the context, which reflects every operation that is executed within
potential of our algorithm we inlined the affected methods and the related state. Accordingly, we removed all duplicates from
flattened the multidimensional arrays to one dimension. the set of configurations. The number of remaining elements
The subsequent evaluations have shown sophisticated re- is a lower bound for the number of contexts that are necessary
sults. Speedups between 3.5 and 12.5 were achieved for most to drive the functional unit. The effective number of necessary
kernels. The encryption of the Twofish cipher is an outlier, configurations lies between those two bounds, as it depends
being caused by a large communication overhead. Analogous, on the place-and-route results of the affected operations.
several applications, e.g. SHA-256, gained better results orig- The context informations for the benchmarks are presented
inating from a benefiting communication/computation ratio. in table II. It shows the size of the controlling finite state
The JPEG encoding application as a whole has gained a machine (States), and the number of actually different contexts
speedup of 3.77, which fits into the overall picture. The (Contexts) for every of our benchmarks. It shows, that only
runtime results for all benchmarks are shown in table I. three of eighteen state machines on an array with 16 processing

56 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


TABLE II
C OMPLEXITY OF THE S CHEDULES OF B ENCHMARK A PPLICATIONS

Rijndael Twofish RC6 Serpent


Configuration
Round Key
Generation

States Contexts States Contexts States Contexts States Contexts


4 operators 57 42 230 110 48 21 124 37
8 operators 55 31 148 113 44 20 106 43
12 operators 55 31 130 91 44 20 103 42
16 operators 55 31 122 83 44 20 103 42

Rijndael Twofish RC6 Serpent


Single Block

Configuration
Encryption

States Contexts States Contexts States Contexts States Contexts


4 operators 78 37 46 33 25 17 153 54
8 operators 71 31 40 26 23 20 152 54
12 operators 69 26 40 22 23 19 152 54
16 operators 69 23 40 20 23 19 152 54

SHA-1 SHA-256 MD5


Hash & Digest

Configuration
Algorithms

States Contexts States Contexts States Contexts


4 operators 138 29 107 28 531 20
8 operators 138 29 92 31 531 20
12 operators 138 29 92 31 531 20
16 operators 138 29 92 30 531 20

Sobel Filter Grayscale Filter Contrast Filter


Configuration
Processing

States Contexts States Contexts States Contexts


Image

4 operators 17 13 13 9 56 18
8 operators 17 13 13 9 56 18
12 operators 17 13 13 9 56 18
16 operators 17 13 13 9 56 18

JPEG–Encoder Color Space Transformation 2-D Forward DCT Quantization


Configuration
States Contexts States Contexts States Contexts States Contexts
Encoding
JPEG

4 operators 132 64 22 17 89 43 16 11
8 operators 109 61 18 15 70 42 16 11
12 operators 110 60 17 15 67 39 16 11
16 operators 105 55 17 14 67 36 16 11

elements consist of more than 128 states. Furthermore, the gained speedup. The following subsection shows the influence
bigger part of the state machines contains a significant number of such a limitation on the runtime and speedup, with the help
of identical states regarding the executed operations. Thus, the of small modifications to the constraints of our measurements.
actual number of contexts is well below the number of states.
E. Exemplified Resource Limitation
D. Resource Utilization The results in the preceding subsections suggest the use
Another characteristic of the synthesized control units, is of a heterogeneous array, as more than 90% of the contexts
the distribution of multi-cycle operations like multiplication that were created by our synthesis algorithm used two or less
or division (complex operations) within the created contexts. complex operators. This array would provide a full-fledged
Table III shows the aggregate distribution of complex oper- functionality on a small number of processing elements. All
ations within the schedules. It shows a total number of 1887 other operators could be cut down to combinational functions.
contexts for all of our benchmarks, as we scheduled them We extended our implemented scheduling algorithms to show
for a reconfigurable array with four operators. Furthermore, the influence of such a limitation, by confining the number of
it can be seen that a large set of 1265 contexts did not complex processing elements inside the array.
contain any complex operation. Furthermore, the bigger part The Twofish benchmark, the 2-D forward DCT of the JPEG
of the remaining contexts utilized only one or two complex encoding, and the JPEG encoding itself have been chosen
operations, which sums up to 1725 contexts utilizing two or to show the influence of the described resource limitations
less complex operations. Hence, only 165 contexts used more regarding a reconfigurable array with 16 processing elements,
than two complex operators. as they utilized the largest numbers of complex operators.
Entirely, it can be seen, that the 1-quantile covers more than Limiting the number of full-fledged processing elements to
84% of all contexts, regardless of the reconfigurable arrays four did not result in a noteworthy drop of the speedup. How-
size. Furthermore, the 2-quantile contains more than 91% of ever, a further confinement to two complex operators inside
the contexts. Thus, it is reasonable to reduce the complexity of the array delivered noteworthy results. The achieved speedups
the reconfigurable array, as a full-fledged homogeneous array dropped 4.5% (JPEG-Encoding), 12.1% (2-D forward DCT)
structure may not be necessary. Hence, the chipsize of the and 10.7% (Twofish round key generation). The results of the
array would shrink. Nonetheless, this would also decrease the resource limited benchmarks are shown in table IV.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 57


TABLE III
OVERALL U TILIZATION OF C OMPLEX P ROCESSING E LEMENTS IN S YNTHESIZED F UNCTIONAL U NITS

Configuration Contexts 0 ≤1 ≤2 >2


4 operators 1887 1265 67% 1591 84% 1725 91% 162 9%
8 operators 1722 1206 70% 1471 85% 1563 91% 159 9%
12 operators 1699 1211 71% 1474 87% 1557 92% 142 8%
16 operators 1684 1219 72% 1476 88% 1556 92% 130 8%

TABLE IV
I NFLUENCE OF R ESOURCE C ONSTRAINTS ON S PEEDUP OF JPEG-E NCODING AND S ELECTED A PPLICATION K ERNELS

JPEG-Encoding 2-D Forward DCT Twofish Round Key Generation


Complex Operators
Clock Ticks Speedup Clock Ticks Speedup Clock Ticks Speedup
without synthesis 17368663 - 23054 - 525276 -
unrestricted 4612561 3.77 2545 9.06 34112 15.40
4 operators 4694575 3.70 2644 8.72 34726 15.13
2 operators 4819664 3.60 2897 7.96 38230 13.74

VII. C ONCLUSION effects of instruction chaining in synthesized functional units.


Furthermore, we are planning to overlap the transfer of data to
In this article we have shown a online-synthesis algorithm
a synthesized functional unit and its execution. Also, we will
for AMIDAR processors. The displayed approach targets max-
introduce an abstract description layer to our synthesis. This
imum simplicity and runtime efficiency of all used algorithms.
will allow easier optimization of the algorithm itself and will
It is capable of synthesizing functional units fully automated
open up the synthesis for a larger number of instruction sets.
at runtime regarding given resource constraints. The target
technology for our algorithm is a coarse grain reconfigurable R EFERENCES
array. Initially, we assumed a reconfigurable fabric with homo- [1] L. Bauer, M. Shafique, S. Kramer, and J. Henkel. RISPP: Rotating
geneously formed processing elements and one single shared instruction set processing platform. In DAC, pages 791–796, 2007.
memory for all objects and arrays. Furthermore, we used list [2] A. C. S. Beck and L. Carro. Dynamic reconfiguration with binary
translation: breaking the ILP barrier with software compatibility. In
scheduling as scheduling algorithm. DAC, pages 732–737, 2005.
We evaluated our algorithm by examining four groups of [3] K. Compton and S. Hauck. Reconfigurable computing: a survey of
benchmark applications. On average across all benchmarks, a systems and software. ACM Comput. Surv., 34(2):171–210, 2002.
[4] H. Corporaal. Microprocessor Architectures: From VLIW to TTA. John
speedup of 7.78 was achieved. Wiley & Sons, Inc., New York, NY, USA, 1997.
Comparing the runtime of the benchmarks, regarding the [5] S. Döbrich and C. Hochberger. Towards dynamic software/hardware
underlying reconfigurable fabrics size, shows notably results. transformation in AMIDAR processors. it - Information Technology,
pages 311–316, 2008.
An array of eight processing elements delivers the maximum [6] S. Döbrich and C. Hochberger. Effects of simplistic online synthesis in
speedup for most benchmarks. The improvements gained AMIDAR processors. In ReConFig, pages 433–438, 2009.
through the use of a larger array are negligible. Thus, the [7] S. Gatzka and C. Hochberger. A new general model for adaptive
processors. In ERSA, pages 52–62, 2004.
saturation of the speedup was achieved with a surprisingly [8] S. Gatzka and C. Hochberger. Hardware based online profiling in
moderate hardware effort. AMIDAR processors. In IPDPS, page 144b, 2005.
Furthermore, we displayed the complexity of the synthe- [9] S. Gatzka and C. Hochberger. The organic features of the AMIDAR
class of processors. In ARCS, pages 154–166, 2005.
sized finite state machines. This evaluation showed, that most [10] C. Hochberger, R. Hoffmann, K.-P. Völkmann, and S. Waldschmidt. The
of our benchmarks could be driven by less than 128 states, and cellular processor architecture CEPRA-1X and its conguration by CDL.
that more than 90% of these corresponding contexts contained In IPDPS, pages 898–905, 2000.
[11] A. Koch and N. Kasprzyk. High-level-language compilation for recon-
two or less complex operations. figurable computers. In ReCoSoC, pages 1–8, 2005.
Hence, we constrained the number of complex processing [12] R. L. Lysecky and F. Vahid. Design and implementation of a microblaze-
elements inside our array, to show the influence of such a based WARP processor. ACM Trans. Embedded Comput. Syst., 8(3):1–
22, 2009.
limitation onto the speedup. A limit of four complex operations [13] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins.
per context nearly did not affect the speedup, while a limit ADRES: An architecture with tightly coupled VLIW processor and
of two complex operations decreased the speedup of the coarse-grained reconfigurable matrix. In FPL, pages 61–70, 2003.
[14] B. Mei, S. Vernalde, D. Verkest, H. D. Man, and R. Lauwereins. Exploit-
evaluated benchmarks by approximately 5% - 12%. ing loop-level parallelism on coarse-grained reconfigurable architectures
using modulo scheduling. In DATE, pages 10296–10301, 2003.
VIII. F UTURE W ORK [15] E. Sotiriades and A. Dollas. A general reconfigurable architecture for
the BLAST algorithm. J. VLSI Signal Process. Syst., 48(3):189–208,
As the full potential of our synthesis algorithm has not 2007.
been reached, future work will concentrate on improving [16] S. Vassiliadis and D. Soudris, editors. Fine- and Coarse-Grain Recon-
figurable Computing. Springer, 2007.
it in multiple ways. This contains the implementation of [17] K. Wu, A. Kanstein, J. Madsen, and M. Berekovic. MT-ADRES:
access to multidimensional arrays and inlining of invoked Multithreading on coarse-grained reconfigurable architecture. In ARC,
methods at synthesis time. Additionally, we will explore the pages 26–38, 2007.

58 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


ISRC: a runtime system for heterogeneous
reconfigurable architectures
Florian Thoma, Jürgen Becker
Institute for Information Processing Technology
Karlsruhe Institute of Technology
Germany
Email: {florian.thoma, juergen.becker}@kit.edu

Abstract—By definition, runtime systems bridge the gap be- conclusion of this and an outlook on future work are given
tween the application and the operation system layer on a in section VII.
processor based hardware. Novel paradigms like reconfigurable
computing and heterogeneous multi-core system on chips require
additional services provided by the runtime system in compar- II. RELATED WORK
ison to the traditional approaches which are well established Reconfigurable Computing is an active and dynamic field of
in single- or homogeneous multi-core systems. Especially the
different characteristics of the target architectures provided in research. Systems like Berkeley Emulation Engine 2 (BEE2)
a multi-core System-on-Chip can be exploited for a power and [1] and Erlangen Slot Machine (ESM) [2] are examples of
performance efficient task execution. Furthermore, the algorithm homogeneous partial reconfigurable systems using FPGAs.
for application task scheduling for that kind of hardware archi- MORPHEUS [3] is the first truly heterogeneous reconfigurable
tecture has to consider varying timing of the task realizations computing architecture by exploiting the strengths of several
on the different hardware and additionally, dynamic effects if
runtime reconfiguration is used for loading functional blocks on different reconfigurable architectures within a Configurable-
the hardware modules on demand. The Intelligent Services for System-on-Chip (CSoC). During the last years it has become
Reconfigurable Computing (ISRC) approach shows the feasibility apparent that the use of RC-specific operating systems is
to handle the complex system environment of a heterogeneous required for efficient utilization of the inherent computing
hardware architecture and presents first results with the MOR- power by application designers [4], [5], [6]. Hthreads [7],
PHEUS chip, consisting of a processor, coarse-, medium- and fine-
grained reconfigurable hardware and a complex communication [8] covers only FPGA targets without use of reconfiguration
infrastructure. whereas its successor [9] is focused on heterogeneous many-
core processors. In the field of Heterogeneous Computing [10],
I. INTRODUCTION [11], [12] there has been work done on scheduling but since
these machines are typically still processor-based it does not
During the last years the field of Reconfigurable Computing
take reconfiguration overhead which is inherent in RC into
(RC) has expanded and the architectures have gone beyond
account. There has been work [13] on scheduling between
purely FPGA based systems. At the same time the usage
μP and FPGA but it focuses on the area optimization on the
scenarios have extended to other application domains which
FPGA side. ReconOS [14] shows a way to integrate hardware
have higher requirements in terms of adaptivity to user input or
tasks into a real-time operating system but does not handle
other environmental influences. This dynamic system behavior
scheduling and reconfiguration of tasks as the underlying
impedes the static partition, allocation and scheduling of the
hardware model is static.
application during design time. Hence the necessity to make
as many decisions as possible at runtime. Partitioning and
III. SYSTEM ARCHITECTURE
retargeting to different engines are not feasible during runtime
but an alternative is to provide multiple implementations for The main components of the MORPHEUS System-on-Chip
each partition at design time. Scheduling and allocation of (SoC) are three different heterogeneous reconfigurable engines
these alternatives can then be done by a runtime system (HREs), which enable high flexibility for application design
depending on system state. and an ARM9 embedded RISC processor, which is responsible
In this paper we present our intelligent services for recon- for triggering data, control and configuration transfers between
figurable computing (ISRC) as a implementation of this ap- all resources in the system [15]. Resources are memory
proach. The subsequent section II places this work in relation units, IO peripherals, and several HREs each residing in its
to current work in the field of reconfigurable computing. The own clock domain with a programmable clock frequency
following section III provides an overview of the MORPHEUS (see figure 1). All system modules are interconnected via
SoC architecture. The corresponding MORPHEUS toolchain multilayer AMBA buses and/or a Network-on-Chip (NoC).
is presented in section IV. The main part of the paper which All data transfers between HREs, on- and off-chip memories
describes the runtime system is formed by section V. An may be either DNA (Direct Network Access) triggered, DMA
exemplary usage scenario is described in section VI. The triggered or managed directly by the ARM.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 59


OffChip C-based application description
DMA
ARM9 Mem

AMBA (Master/Data bus)


Molen approach compilation ARM processor code
IC
DNA Bridge
Mem

NoC Contr. Accelerated Formal


Mem function specification
PCM
description
DREAM XPP
FlexEOS Dynamic
Mem
Memory-oriented mapping reconfiguration
AMBA (Configuration bus) RTOS
Architectural and physical Synthesis
Fig. 1. MORPHEUS architecture

RU configurations
The DREAM core is a medium-grained reconfigurable
array consisting of 4-bit oriented ALUs, where up to four Fig. 2. Integrated design flow
configurations may be kept concurrently in shadow registers.
This component is mostly targeting instruction level paral-
lelism, which can be automatically extracted from a C-subset the HREs. The third level is provided by external SRAM.
language called Griffy-C. The Predictive Configuration Manager (PCM) [18] manages
The FlexEOS is a lookup table based fine-grain reconfig- the reconfiguration overhead. By caching and prefetching
urable device – also known as embedded Field Programmable configurations, it minimizes the reconfiguration latencies and
Gate Array (eFPGA). As any FPGA, it is capable to map offers a unified interface for heterogeneous engines as well as
arbitrary logic up to a certain complexity if the register and high-level services for the runtime system (see section V).
memory resources are matching the specifics of the imple-
mented logic. IV. TOOLCHAIN
XPP-III is a data processing architecture based on a hi-
erarchical array of coarse-grained, adaptive computing ele- The objectives of the Morpheus toolchain are to satisfy
ments called Processing Array Elements (PAEs) and a packet- embedded computing requirements within the context of Re-
oriented communication network. An XPP-III core contains a configurable Computing architecture solutions. These objec-
rectangular array of ALU-PAEs and RAM-PAEs for data-flow tives are portability (avoiding platform adherence), computing
processing. Crossbars tightly couple the array with a column performance efficiency, flexibility and application program-
of Function-PAEs (FNC-PAEs) for control-flow oriented code. ming productivity. The toolchain (figure 2) combines the
More regular streaming algorithms like filters or transforms large set of following technologies that are necessary to
are efficiently implemented on the data flow part of the XPP- obtain an integrated design flow from high level application
III array. Flow graphs of arbitrary shape can be directly programming (such as C language and graphical interfaces)
mapped to ALUs and routing connections, resulting in a paral- to hardware and software implementation: compilation, real-
lel, pipelined implementation. Events enable also conditional time operating system (RTOS), data parallel reorganization,
operation and loops. One of the strengths of the XPP array architectural and physical synthesis, formal specification. The
originates from fast dynamic runtime reconfiguration. toolchain entry point for the application programmer follows
The integration of the ST NoC with the various reconfig- an enhanced version of the Molen paradigm [19]. According
urable components requires an innovative interconnect infras- to this paradigm, the programmer describes its application in a
tructure [16]. Processing cores and storage units are connected classical C-language program and just annotates the functions
through the spidergon topology [17] that promises to deliver identified as preferably implemented on a HRE. The compiler
optimal cost/performance trade-off for multi core designs. In then generates code for the host processor of the system,
this proprietary topology, the IP blocks are arranged in a statically optimizing the scheduling of configurations and exe-
sort of ring where each is connected to its clockwise and cutions of the accelerated function on the HRE. The compiler
its counter-clockwise neighbours as in a polygonal structure. also generates a Configuration Call Graph that will be used
In addition, each IP block is also connected directly to its as a basis for a more precise dynamic scheduling (see section
diagonal counterpart in the network, which allows the routing V). Concerning the accelerated function implementation, the
algorithm to minimize the number of nodes that a data packet toolchain offers a high level graphical capture. The accelerated
has to traverse before reaching its destination. function is supposed to be of data-streaming computing type.
The MORPHEUS SoC features three levels of memory. The graphical interface thus offers a way to best express
Each HRE has several dual-clock Data Exchange Buffers the parallelism inherent to the function. The benefit of such
(DEB) for local storage and communication with the NoC. integrated toolchain is to permit a fast and thus efficient
The buffers can be accessed as normal RAM or as FIFO. The design space exploration: partitioning trials (identification of
second level is the on-chip SRAM used by the ARM core and accelerated parts) and allocation trials (implementation unit

60 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


choice) with quick feedback since implementation results are Application
obtained with the toolchain itself.
The spatial design part of the toolchain concerns the ac- Intelligent Services for
celerated function implementation on a HRE from high level Reconfigurable Computing
specification to retargetable implementation synthesis [20].

RTOS
This includes data movements management since the HRE are
defined to work on the Main Memory Data Structure (MMDS): eCos Real-time Operating System Core
data circulates on a loop from MMDS to local memories,
then back to MMDS. This data movement management is
eCos Hardware Abstraction Layer
produced by DMA engines that transfer data to local memories
according to predefined address patterns.

XR

XR

XR
The design flow starts with the high level specification of

Predictive Configuration Manager


an accelerated function required by Molen. This specification

General Purpose Processor


(usually in C language) is currently translated manually into

Dream/ PiCoGA Array


an array transformation formalism [21] through the interactive

FlexEOS-Array
Network on Chip

XPP III Array


SPEAR [22] graphical interface as entry point. This graphical
view contains processes (elementary functions) connected to-
gether to form the accelerated function. The SPEAR tool gen-
erates automatically the appropriate communication processes
between the computing processes. A mechanism implements
the communication processes and manages specific data arrays
manipulation, communication and scheduling. SPEAR ad-
dresses the programming of the DMA unit through parameters
definition. The synthesis serves to finalize the design with the
implementation on the chip. In order to flow down from this
high level specification to the architecture and then physical Fig. 3. Structure of the runtime system
levels we use a common high level synthesis intermediate
description Control Data Flow Graph (CDFG) defined for our
specific purpose. The global CDFG is the representation of the and others like POSIX and μITRON. The RTOS has a layered
complete application mapped and scheduled on a particular structure which is shown in figure 3. The bottom layer is the
HRE. This CDFG includes the connectors mentioned above HAL which provides a more uniformed access to the reconfig-
and some elementary CDFG representing the computation pro- urable hardware and the system infrastructure. For the transfer
cesses written in C language. The front-end analysis module of the parameters, it provides virtual exchange registers (XR)
of CASCADE [23] elaborates those CDFG. for the compiler which are mapped to the parameter registers
in the HREs. It also provides the basis for a pipeline service
An important challenge is the adaptation of the computation
between the HREs. The middle layer is the RTOS Core. It
kernels to different reconfigurable units. The adaptation will
provides the basic operating system services including multi-
result in a similar graph carrying nodes denoting hardware
threading that are not related to dynamic reconfiguration and
primitives known to exist in a target architecture (low level
is based on an already existing eCos RTOS. The top layer
CDFG). Reconfigurable architectures are specified using a
is formed by the dynamic reconfiguration framework called
generic model extrapolated from fine grain FPGAs (Madeo
Intelligent Services for Reconfigurable Computing (ISRC). It
[24]) and an extensible set of classes for resources (LUT,
provides the services for the configuration and execution of
operators, memories, ...). Process based structure is allowed,
operations on the HREs.
with additional access to local memories.
V. RUNTIME SYSTEM B. Intelligent services for reconfigurable computing
This section describes the mechanisms used to control the 1) ISRC overview: The dynamic control of the reconfigura-
dynamic reconfiguration aspects of the MORPEUS system. tion units is performed by the ISRC layer on top of the RTOS
The base is formed by a RTOS and is topped by an allocation and the PCM, which is a HW-implemented unit supporting
and scheduling system for reconfigurable operations. the RTOS and is described in section III. ISRC performs the
following actions with information received from the PCM
A. Real-time operating system and the application / compiler:
ECos was chosen for this project as the base for hardware • Allocation decision (choice of the implementation on the
abstraction layer (HAL) and RTOS core as a compromise various reconfigurable units FlexEOS, XPP, DREAM).
between the rich set of features of a Linux kernel and the If there are functionally equivalent implementations of
simplicity of minimalist kernels like TinyOS or μC/OS-II. It the reconfigured operation it allows to make a choice at
is highly configurable and offers a choice between its own API runtime depending on the platform / application status

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 61


• Priority calculation of pending operations of processing elements optimized for pipelining algorithms.
• Task execution status management With the difference in architecture comes a big difference
• Resource request to the PCM for fine dynamic scheduling in the configuration and execution control mechanisms. This
Information needed from the compiler is contained in the has to be transparent for the user of design tools higher
Configuration Call Graph (CCG). Other information used by up the tool-flow. The difference of the configuration mech-
the RTOS consists of: anisms is handled by a dedicated PCM unit. The services
for reconfigurable computing sit on top of this to provide
• Result of execution branching decisions
a uniform interface for the design tools. The control of the
• Synchronization points
HREs extends the SET / EXECUTE concept of the Molen
• Task parameters
compiler. The extensions have been necessary with regard
The configuration is handled by ISRC and there is no direct to concurrent running threads competing for resources. In
communication between the application and the PCM. The the original approach by Molen the instruction set of the
communication between the application and the reconfigurable processor is extended with special machine instructions. These
units is also only indirect and handled by ISRC. instructions are replaced and extended by operating system
The control of an operation is handled with the help of the calls.
HRE status register by writing commands like start execution
or stop in this register. The operation is updating its status • SET: the compiler notifies the operating system as soon as
in the register whenever a state change occurs and indicates possible about the next pending operation to configure.
the change by an interrupt. The dynamic scheduler not only The operating system uses this information to prepare
schedules the different threads but also computes the priorities configuration. This allocation is done dynamically by
of pending and near-future operations and schedules them for ISRC during runtime using the alternative implemen-
configuration and execution on the HREs. The priorities are tations of the operation, which are provided by the
communicated to the PCM to allow the speculative prefetching designer, on different reconfigurable units, depending on
of configurations. The PCM is commanded by the scheduler available resources, execution priority and configuration
to configure the reconfigurable units. The allocation of the alternatives. The designer can also provide several im-
heterogeneous reconfigurable engine (HRE) to the different plementations on the same heterogeneous reconfigurable
operations is closely linked to the scheduling to consider the engine, which are optimized for different criteria. The
configuration time / execution time trade-off if the preferred SET system call is non-blocking and returns immediately.
heterogeneous reconfigurable engine is already in use. As it • EXECUTE: the compiler demands the execution of an
is preferable (for implementation and control reasons) that the operation. The scheduler of ISRC decides then according
HREs do not perform memory access to the system memory, to current state of the system and the scheduling policy
the operating system controls the DMA-controller of the sys- when to execute this operation. The EXECUTE system
tem. This allows feeding the HREs with data and transferring call is non-blocking and returns immediately.
the result back to the system memory. The information about • BREAK: wait for the completion of a operation for
the memory arrays are provided by the application. At the synchronization. The parallel running of operations which
same time, the runtime system ensures the data consistency are mapped on the HREs with the code on the general
by configuring the NoC to use available buffers. The NoC is purpose processor leads to synchronization issues. There-
then controlling the data flow and the buffer switching on its fore the RTOS provides a BREAK system call which
own. waits for the completion of the referenced operations. The
2) ISRC inputs: Input for the RTOS consists of the con- compiler has to assure data consistency between parallel
figuration call graph (CCG), the bitstream library and the operations by using this system call. For sequential con-
implementation / configuration matrix. The bitstream library sistency, there is always a need for a break, corresponding
contains the available operations, their implementations and to an EXECUTE system call before its data is further
the properties of these implementations like throughput, delay, processed. It follows that the BREAK system call has to
size and power. The implementation / configuration matrix be a blocking operation which returns when all indicated
contains the list of the implementations used within a con- operations are completed.
figuration and also indicates which configuration contains • MOVTX and MOVFX: transfer data from the ARM
which implementation. Output of the RTOS are the prefetching processor to a specific exchange register of the HRE and
priorities and configuration commands for the configuration reverse. These instructions are non-blocking and return
manager, execution commands for the operations on HREs immediately.
and control information for the NoC and the DMA-controller. • RELEASE: the configured operation is no longer needed
3) Controlling HREs: The MORPHEUS SoC contains and can be discarded. If an operation is no longer used,
three vastly different reconfigurable units. The FlexEOS from for example the calling loop has been left, the allocated
M2000 is a fine-grained embedded FPGA. The DREAM from resources have to be freed by the compiler with the
ARCES is a middle-granular reconfigurable unit with very fast RELEASE system call. The release system call is non-
context switches. The XPP from PACT is coarse-granular array blocking and returns immediately.

62 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


The usage of these system calls can be best demonstrated available, otherwise the scheduler waits till the needed HRE is
with the following code example freed. When the operation is available for several HREs they
are all tried in sequence of falling throughput.
#pragma MOLEN_FUNCTION 1
b) First Come First Free: Instead of one global queue
void func1(int para1, int para2){...}
this methodology uses one queue per HRE. At request time
int p1,p2;
the operation is added to each HRE queue where it is available.
...
As soon as a HRE is finished it is used by the next operation in
func1(p1,p2);
its queue and the operation is removed from all other queues.
...
This method maximizes the overall capacity utilization of each
which is replaced by the compiler with the following code HRE but can result in frequent reconfigurations.
... c) Shortest Configuration First: The delay for configu-
SET(1) ration depends on two factors. First one is the size of the
... configuration bitstream. It also depends on transfer speed of
MOVTX(1,0,&p1) the bitstream to the HRE which again depends on the used
MOVTX(1,1,&p2) memory of the system. The off-chip sram and flash memories
EXEC(1) are significantly slower than the on-chip configuration sram
... managed by the Predictive Configuration Manager. The sched-
BREAK(1) ule requests the prefetching state from the PCM and calculates
MOVFX(1,0,&p1) a configuration time. An available HRE is then configured
MOVFX(1,1,&p2) for the operation with the shortest configuration time. This
... methodology minimizes idle times due to preferring recurring
RELEASE(1) and short operations.
... d) Maximum Throughput First: This methodology uses
the operation with the highest throughput waiting for a specific
4) Dynamic scheduling and allocation: The MORPHEUS HRE. This maximizes overall data throughput but can harm
platform is intended for many conceivable applications. The responsiveness.
range goes from purely static data stream processing to highly e) Minimal Delay First: This methodology uses the
dynamic reactive control systems. These control systems react operation with the lowest delay waiting for a specific HRE.
on changes of the environment like user interaction, requested This maximizes responsiveness but can harm data throughput.
quality of service, radio signal strength or battery level in f) Weighted Score: All the above methods for scheduling
mobile applications. These changes can significantly change and allocation focus on one criterion while completely ignor-
the execution path of the application and the priority of the ing the influence of the other factors for the performance of
various operations. Static scheduling and allocation is not the system. With the exception of First Come First Serve and
sufficient for such reactive systems. The topics of scheduling First Come First Free they also suffer from the chance that
and allocation are tightly related on reconfigurable systems. operations starve and never get processed. The solution used
For this reason the runtime system provides a combined here is to compute a score for each operation by weighted
dynamic scheduler and allocator. It determines during runtime sum of all criteria including a waiting-time. The formula for
schedule and allocation of the operations requested within one the score can be computed as
thread and in parallel threads. The combined scheduler / allo-
cator determines which alternative implementations, which can 
3
Conf igurationSpeedi
P =KC + KT T hroughput
include a fall back software implementation, of the requested
i=0
BitstreamSizei
operations are available and their associated costs. The usage
1
of configuration call graphs (see section 3.4) provides knowl- + KD + KW W aitingT ime
Delay
edge about the structure of the application. All this information
is used to update the schedule of pending operations with the The weighting factors KC , KT , KD and KW are adjustable
goal to improve metrics like overall throughput and latency. and allow tailoring of the scheduling to the specific needs of an
When reasonable the scheduler moves operations from an application. The Σ sum is necessary to consider the different
over-loaded HRE to another HRE between consecutive calls. characteristics of the different levels of the configuration
A requirement for the usefulness of dynamic allocation is that memory hierarchy. The index 0 refers to the configuration
the application designer offers a choice of implementations for exchange buffer (CEB) and 1 refers to the on-chip configura-
as many operations as possible. The user has a choice between tion memory. The external configuration memory is divided
various scheduling and allocation strategies. between sram (index 2) and flash memory (index 3). The
a) First Come First Serve: This is the most basic equation can be easily extended for other storages as hard disk
scheduling methodology. The only criterion for scheduling is drives or network storage. The values of BitstreamSize0
the sequence of requests. When a task is finished the next and BitstreamSize1 depend on the current prefetching state
operation from the queue is examined if the needed HRE is and can be determined by polling the PCM. Throughput

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 63












 





 


 





Fig. 4. Configuration Call Graph Fig. 5. Ogg Vorbis data flow

and delay are two dimensions of performance measurement. destination address of data transfers which make it essential to
W aitingT ime is increased every time an operation is not handle the communication by the operating system. It provides
scheduled to execute and is used to prevent starvation. an API with a unified interface for transferring data with the
5) Configuration Call Graphs: The dynamic allocation NoC or the DMA for application programmers or upstream
is improved by using foresight of coming operations. The tools like SPEAR. The linking of a transfer to an operation
runtime system uses for this purpose the configuration call allows to migrate the transfer automatically to the new HRE.
graphs provided by the Molen compiler. The compiler provides
one CCG per thread with an example in figure 4. It contains VI. SCENARIO AND EXAMPLE
the sequence, including branches and loops, of the operations
It is planned to demonstrate the potential of ISRC with
and their configuration. During runtime the scheduler traces
an implementation of the Ogg Vorbis [25] audio codec. Base
the running of the different application threads through their
implementation and reference benchmark are provided by the
configuration call graphs to have a global estimate about which
integer-only Tremor implementation of said codec. The data
heterogeneous reconfigurable engine is going to be available
flow of the decoder can be seen in figure 5.
in the near future for use and the next pending operations.
This probability information is communicated to the pre- Profiling has shown that the inverse modified cosine trans-
dictive configuration manager which uses it for prefetching formation (IMDCT) is the most computation intensive kernel
configuration bitstreams from external memory to the on-chip of the decoding algorithm. Implementation for FPGA [26],
configuration memory which results in a significant reduction XPP [27] and PiCoGA are already done which leaves the
of reconfiguration overhead. The PCM feeds back information implementation of the other kernels and adaptation for the
about the prefetching state of the bitstreams to the runtime MORPHEUS toolchain to be done.
system. The scheduler uses the prefetching state for allocation
VII. C ONCLUSION AND FUTURE WORK
decisions, e.g. it can favour a slower implementation which is
already in the on-chip configuration memory against a faster In this paper we introduced the intelligent services for re-
implementation which at first has to be loaded from external configurable computing as a runtime system for heterogeneous
memory. reconfigurable systems. We explained how the system calls
6) DMA / DNA: The DMA and DNA provide communi- for controlling the HREs relate to the Molen compiler. Finally
cation mechanism between various HREs and memory and we presented a selection of algorithms available for dynamic
between HREs for data transfers. The dynamic allocation of scheduling and allocation and showed the advantages of the
operations can make it necessary to change the source or Weighted Score algorithm over other algorithms.

64 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


The next steps is the completion of the application exam- [12] T. Braun, H. Siegel, N. Beck, L. Boloni, M. Maheswaran, A. Reuther,
ple and benchmarking of this application. Additionally the J. Robertson, M. Theys, B. Yao, D. Hensgen et al., “A comparison study
of static mapping heuristics for a class of meta-tasks on heterogeneous
benchmark base could be extended with a broader selection of computing systems,” in Proceedings of the 8th Heterogeneous Comput-
applications. It would also be interesting to show the versatility ing Workshop (HCW’99). IEEE Computer Society Washington, DC,
of ISRC by porting it to other reconfigurable architectures or USA, 1999, pp. 15–29.
[13] P. Saha and T. El-Ghazawi, “Extending Embedded Computing Schedul-
RTOS kernels. ing Algorithms for Reconfigurable Computing Systems,” Programmable
Logic, 2007. SPL’07. 2007 3rd Southern Conference on, pp. 87–92,
ACKNOWLEDGMENT 2007.
The authors would like to thank the European union for [14] E. Lübbers and M. Platzner, “ReconOS: An RTOS supporting Hard-
and Software Threads,” in 17th International Conference on Field
their support of this work in the scope of the FP6 project Programmable Logic and Applications (FPL), Amsterdam, Netherlands,
MORPHEUS [28]. 2007.
[15] D. Rossi, F. Campi, A. Deledda, S. Spolzino, and S. Pucillo, “A
R EFERENCES heterogeneous digital signal processor implementation for dynamically
[1] C. Chang, J. Wawrzynek, and R. Brodersen, “BEE2: A high-end reconfigurable computing,” in Custom Integrated Circuits Conference,
reconfigurable computing system,” IEEE Design & Test, pp. 114–125, 2009. CICC ’09. IEEE 13-16 Sept. 2009, 2009, pp. 641–644.
2005. [16] M. Kuehnle, M. Huebner, J. Becker, A. Deledda, C. Mucci, F. Ries,
[2] J. Angermaier, D. Göhringer, M. Majer, J. Teich, S. Fekete, and A. M. Coppola, L. Pieralisi, R. Locatelli, G. Maruccia, T. DeMarco, and
J. van der Veen, “The Erlangen slot machine — a platform for in- F. Campi, “An interconnect strategy for a heterogeneous, reconfigurable
terdisciplinary research in dynamically reconfigurable computing,” it – soc,” Design & Test of Computers, IEEE, vol. 25, no. 5, pp. 442–451,
Information Technology, vol. 49, pp. 143–149, 2007. Sept.-Oct. 2008.
[3] F. Thoma, M. Kuhnle, P. Bonnot, E. M. Panainte, K. Bertels, S. Goller, [17] M. Coppola, R. Locatelli, G. Maruccia, L. Pieralisi, and A. Scandurra,
A. Schneider, S. Guyetant, E. Schuler, K. D. Muller-Glaser, and “Spidergon: a novel on-chip communication network,” System-on-Chip,
J. Becker, “Morpheus: Heterogeneous reconfigurable computing,” Field 2004. Proceedings. 2004 International Symposium on, p. 15, 2004.
Programmable Logic and Applications, 2007. FPL 2007. International [18] S. Chevobbe and S. Guyetant, “Reducing Reconfiguration Overheads in
Conference on, pp. 409–414, 27-29 Aug. 2007. Heterogeneous Multi-core RSoCs with Predictive Configuration Man-
[4] H. Walder and M. Platzner, “Reconfigurable hardware operating sys- agement,” in ReCoSoC 2008, Barcelona, 2008.
tems: From design concepts to realizations,” in Proceedings of the 3rd [19] S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, and
International Conference on Engineering of Reconfigurable Systems and E. Panainte, “The MOLEN polymorphic processor,” Computers, IEEE
Architectures (ERSA). Citeseer, 2003, pp. 284–287. Transactions on, vol. 53, no. 11, pp. 1363–1375, 2004.
[5] C. Steiger, H. Walder, M. Platzner et al., “Operating systems for re- [20] B. Pottier, T. Goubier, and J. Boukhobza, “An integrated platform
configurable embedded platforms: Online scheduling of real-time tasks,” for heterogeneous reconfigurable computing (invited paper),” in Proc.
IEEE Transactions on Computers, vol. 53, no. 11, pp. 1393–1407, 2004. ERSA’07, Jun. 2007.
[6] H. So and R. Brodersen, “Improving usability of FPGA-based re- [21] A. Demeure and Y. Del Gallo, “An array approach for signal processing
configurable computers through operating system support,” in Field design,” Sophia-Antipolis conference on Micro-Electronics (SAME),
Programmable Logic and Applications, 2006. FPL ’06. International France, October, 1998.
Conference on. Citeseer, 2006. [22] E. Lenormand and G. Edelin, “An industrial perspective: Pragmatic
[7] W. Peck, E. Anderson, J. Agron, J. Stevens, F. Baijot, and D. Andrews, high end signal processing design environment at thales,” Proceedings
“Hthreads: A computational model for reconfigurable devices,” in Field of the 3rd International Samos Workshop on Synthesis, Architectures,
Programmable Logic and Applications, 2006. FPL’06. International modelling, and Simulation, 2003.
Conference on, 2006, pp. 1–4. [23] http://www.criticalblue.com.
[8] J. Agron, W. Peck, E. Anderson, D. Andrews, R. Sass, F. Baijot, and [24] L. Lagadec, B. Pottier, and O. Villellas-Guillen, An LUT-Based high
J. Stevens, “Run-time services for hybrid cpu/fpga systems on chip,” level synthesis framework for reconfigurable architectures. Dekker,
in In Proceedings of the 27th IEEE International Real-Time Systems Nov. 2003, pp. 19–39.
Symposium, Rio De Janeiro, 2006. [25] http://www.vorbis.com.
[9] J. Agron and D. Andrews, “Building Heterogeneous Reconfigurable [26] R. Koenig, A. Thomas, M. Kuehnle, J. Becker, E. Crocoll, and M. Siegel,
Systems Using Threads,” in Proceedings of the 19th International “A mixed-signal system-on-chip audio decoder design for education,”
Conference on Field Programmable Logic and Applications (FPL), RCEducation 2007, Porto Alegre, Brasil, 2007.
2009. [27] R. Koenig, T. Stripf, and J. Becker, “A novel recursive algorithm
[10] M. Maheswaran and H. Siegel, “A dynamic matching and scheduling for bit-efficient realization of arbitrary length inverse modified cosine
algorithm for heterogeneous computing systems,” Proceedings of the transforms,” in Design, Automation and Test in Europe, 2008. DATE
Seventh Heterogeneous Computing Workshop, p. 57, 1998. ’08, March 2008, pp. 604–609.
[11] H. Siegel and S. Ali, “Techniques for mapping tasks to machines in [28] http://www.morpheus ist.org.
heterogeneous computing systems,” Journal of Systems Architecture,
vol. 46, no. 8, pp. 627–639, 2000.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 65


66 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010
A Self-Checking HW Journal for a Fault Tolerant
Processor Architecture
Mohsin A MIN, Camille D IOU, Fabrice M ONTEIRO, Abbas R AMAZANI, Abbas DANDACHE
LICM Laboratory, University Paul Verlaine – Metz, 7 rue Marconi, 57070-Metz, France
{amin, diou, ramazani.a,}@univ-metz.fr, {fabrice.monteiro, abbas.dandache}@ieee.org

Abstract—In this paper we are presenting a Dependable


Journal that filters the error(s) from entering into the main
memory. It provisionally stores the data before sending to Dependable Long term
the main memory. The possible errors provoked inside the MPSoC Objective
Journal are detected and corrected using Hamming codes
while the error produced in processor are detected and
corrected using Concurrent Checking Mechanism (CCM)
and Rollback Mechanism respectively. In case of error Journalized Global
Dependable
detection in the processor, it rollbacks and re-executes from Processor
Architecture
the last sure state. In this way only Validated Data (VD)
is written in the main memory.
The VHDL-RTL of the journalized processor has been
developed from which it is evident that the depth of the Present
Rollback
journal has an important impact on the overall area of the Based Journal Objective
processor. So, the depth of the journal has been varied to
find the optimal processor area vs. depth of the journal.

I. I NTRODUCTION Figure 1. Precise scope of the propose methodology


With ever decreasing trend of devices size, they are
becoming more sensitive to the effects of radiation [1].
Soft errors, produced due to single event strikes are
the most commonly occurring errors both in ASIC and system in [7], [8]. The reasons of choosing and de-
FPGA implementations. signing of stack processor can be found in the previous
Error Detection and Correction (EDC) have been used work [9]. This work is an extension of our previous
to overcome the problem of Single Event Upsets (SEUs) work presented in [9] at ReCoSoC’07, in which we have
caused by external radiations in [2]. In the near future, presented the stack processor development methodology
the small size and high frequency trend will further for Dependable Applications. In this paper will we
increase the failure rate in the modern systems [3] present the Journaling Mechanism employed in the Stack
because we have reached the performance bottleneck. Processor to make it fault tolerant. The paper will
By changing such parameters we cannot further increase be focus on: why we need Journaling? And how the
the performance. dependable journal is working?
That’s why we need to develop architectures having As a first step, we have developed an architecture of
reasonably good performances. Moreover, it should be dependable stack processor based on HW Journaling as
dependable so that the processed data can be trusted. a core processor for MPSoC. This paper is focusing on
In this context Dependable Multi-Processor System on Journalizing Mechanism employed to make the proces-
Chip (MPSoC) seems to be a convincing approach. We sor dependable as shown in Fig. 1.
are in the phase of developing an MPSoC while keeping We have an Error Detecting Fault Processor working
in mind future additional design constraints of power on Rollback Mechanism [10], [11]. For better under-
consumption and time-to-market [4]. standing the reader is invited to consult the previous
To do so we have chosen the canonical stack pro- work presented in [12].
cessor [5], has been made fault tolerant by adding some This paper is divided in four sections; in section two
extra HW in [6], respecting the constrains of dependable we will highlight the need of Journaling, the Journal

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 67


Data_out_mem
At ‘VPn-1’ At ‘VPn’
addr_out_mem
No-error detected Error(s) detected
Addr_rd (Data Validated) Rollback to VPn-1
(Data Un-Validated)
Data_out

Error injected
Addr_wr MISS
Dependable
Journal Store Instruction(s) Execution in current Restore
wr_to_mem SD
SEs SEs
Data_in
VP n
‘e’
VP n-1

Last SD Sequence Duration (SD) Next SD


RESET

MODE rd wr
Note: VP is Validation Point
SE is State-determining Element(s) of the Processor

Figure 3. Configuration of Journal

Figure 4. If processor detects error(s) it Rollbacks; else data


Management will be discussed in section three. The WRITTEN in Journal during current SD is Validated at VP
section four contains the results. The conclusions have
been discussed in section five.
reader. In actual architecture they constitute a single
II. W HY J OURNALING block, as shown in Fig. 3 which furthermore, elaborates
We are supposing that a Dependable Main Memory is the IN/OUT-Ports configuration.
attached to our system. It means that no error can itself The Journal has internally two parts; one containing
be generated in it. All the efforts should be done to filter the Validated data (No-error Detection at VP) and second
the external errors from entering into the Dependable containing the data written in the present SD (Un-
Main Memory. That’s why we are temporarily storing Validated Data), as discussed in [12]. The Journal has
the data (until validation) in journal before WRITING an internal mechanism to differentiate between Validated
into the Dependable Main Memory. and Un-Validated Data.
In this regard, we are proposing a Dependable Journal Globally, all the data to be written in the main
to provide a temporary storage location for effective memory pass through Journal until being validated. The
Rollback. It has a fault detection and correction mecha- validation process in the processor is based on a Rollback
nism to overcome its internal temporary faults. Mechanism. The Validation Point (VP) occurs at the end
The technique has a little overhead in terms of area for of each Sequence Duration (SD) as shown in Fig. 4.
overall architecture but a great advantage that our main The fundamental architecture of Journal, SD and VP is
memory remains always sure. There is no time overhead discussed in [12].
or MISS in the Journal as processor always checks data There are three possible data-flows from/to the Jour-
simultaneously in Journal and main memory. If data is nal.
found both in Journal and main memory then data from WRITE to Journal:
Journal will be preferred as it is a more recent written The Processor can WRITE data directly into the Jour-
data. nal and not in the Dependable Main Memory. The data
stays temporarily in the Journal until being validated.
The Processor can detect an internal error during the
III. J OURNAL M ANAGEMENT current Sequence Duration (SD). If no error is detected
A VHDL-RTL model of the Journalized processor has at the end of SD, the data is validated and then it can be
been implemented in Quartus II on a Stratix II family transferred to the Main Memory.
FPGA. The architectural details of stack processor have Journal to Dependable Main Memory:
been discussed in our previous work [9]. The overview Only Validated data is transferred to the main memory.
of dependable stack processor is shown in Fig. 2. The As data before transferring stays temporarily in the
dotted square surrounding the three blocks represents the Journal and there is a possibility of provoking error
Dependable Journal. The Dependable Journal has been due to internal and external temporarily and intermittent
sub-divided into three components for the ease of the faults (the permanent faults are not addressed in this

68 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Dependable
Main Memory
READ

WRITE

Un-Validated Data
Error detecting WRITE
processor
core

Journal

Validated Data
Error
Error Detection
Detection and
READ READ WRITE
Correction

Figure 2. Block Diagram of overall architecture

work.). Hence this data is checked for error. There is


a mechanism based on Hamming Codes [13], [14] for
detection and correction of error(s).
If an error is detected in the data, then it is corrected
and sent to the main memory. If a non-corrigible error is
detected then in this case the processor is sent a RESET Figure 6. Effective Area utilization on FPGA
request as shown in the flow diagram in Fig. 5.
READ from the Journal
The last possibility occurs when Processor want to
READ data from the Journal. As shown in Fig. 2, the them area utilization of the architecture has been dis-
required data before sent to the processor is checked for cussed here.
errors. If error is detected the Rollback Mechanism is In this paper we are focusing on: why we need
activated. The processor re-executes the Sequence of Journal? How it is working? And the effect of Journal-
instructions from the last sure state. Otherwise data is depth on the overall area of the processor. From the
sent to the Processor. implementation in Fig. 6, we have found that the size of
the Journal is limiting the overall area of the processor.
IV. R ESULTS The Journal is occuping around 50% of the total Area.
The VHDL-RTL model of the Journalized-Processor Moreover, we are designing this processor core to in-
has been implemented in Altera Quartus-II on a Stratix-II tegrate with in an MPSoC, where there will be multicores
family FPGA. The architectural details of the processor, so, small area will be preferred.
error modeling and its performance in presence of errors Accordingly, we have varied the depth of the Journal
have been calculated for overall architecture. Among and observed its impact on the Area of the Processor

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 69


READ from Journal Validated DATA towards Main Memory

Validated data
In Journal
READ from
Journal

Error
No Detection
Data towards Error
Processor No Detection
Validated Data Yes
Towards Memory
Yes
Yes Error
Rollback Correction
Mechanism
No
(Non corrigible)

RESET

Figure 5. Dependable Journal Management

in the Fig 7. By increasing the depth from 16 blocks R EFERENCES


to 64 blocks, the area of the Processor has increased
[1] L. Lantz, “Soft errors induced by alpha particles,” IEEE
exponentially as shown in the Graph. From the [12]
Transactions on Reliability, vol. 45, pp. 174-179, June 1996.
we have observed that for small SD small depth of the [2] J. A. Fifield and C. H. Stapper, “High-speed on Chip ECC
journal are feasible, which means small overall Area of For Synergistic Fault-Tolerant Memory Chips,” in Proc. IEEE
the Processor. Journal of Solid State Circuits, vol. 26, no. 10, pp. 1449-1452,
Oct. 1991.
[3] D. G. Mavis and P. H. Eaton, “Soft Error Rate Mitigation
Techniques For Modern Microcircuits,” Proc. Intl. Reliability
V. C ONCLUSIONS Physics Symposium, pp. 216-225, 2002.
[4] A. A. Jerraya, A. Bouchhima, F. Petrot, “Programming models
The presence of the Journal facilitates the rollback and HW − SW Interfaces, Abstraction for Multi-Processor
mechanism on one hand and on other filters all possible SoC,” in ACM (DAC’2006), California, USA, 2006.
errors from entering into the main memory. As data [5] Philip J. Koopman,“Stack Computers: The New Wave", Cali-
temporarily reside in the Journal until validation so error fornia : Ed. Mountain View Press, 1989.
[6] A. Ramazani, M. Amin, F. Monteiro, C. Diou, A. Dan-
detecting and correcting based on hamming code provide dache,“A Fault Tolerant Journalized Stack Processor Archi-
effective double detection and single error correction. tecture,” 15th IEEE International On–Line Testing Symposium
The effective area is quite small which favors our proces- (IOLTS’09), Sesimbra–Lisbonne, Portugal, 24–27 June 2009.
sor to become the active core of the Dependable MPSoC. [7] A. Avizienis, J.C. Laprie, B. Randell and C. Landwehr, “Basic
concepts and taxonomy of dependable and secure computing,”
The depth of the journal is a limiting factor for overall IEEE Transactions on Dependable and Secure Computing, vol.
area of the Journalized Dependable Processor. 1, issue no. 1, pp.11-33, January-March 2004.

70 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


35
Area of the Journalized-Dependable

30

25

20
Processor

15

10

0
16 24 32 40 64
Depth of the Dependable Journal

Figure 7. Effect of Journal Depth on the overall Area

[8] D.K. Pradhan, “Fault-Tolerant Computer System Design”,


Prentice Hall, 1996.
[9] M. Jallouli, C. Diou, F. Monteiro, A. Dandache, “Stack
processor architecture and development methods suitable for
Dependable Applications,” Reconfigurable Communication-
centric SoCs (ReCoSoC’07), Montpellier, France, June 18 -
20, 2007.
[10] D.B. Hunt and P.N. Marinos, “A General Purpose Cache-
Aided Rollback Error Recovery (CARER) Technique,” In
Proceedings of 17th Annual Symposium on Fault-Tolerant
Computing, pp. 170–175, 1987.
[11] N.S. Bowen, D.K. Pradhan,“Virtual checkpoints: architecture
and performance,” IEEE Transactions on Computers, vol. 41,
issue no. 5, pp. 516–525, May 1992.
[12] M. Amin, F. Monteiro, C. Diou, A. Ramazani, A. Dandache,
“A HW/SW Mixed Mechanism to Improve the Dependability
of a Stack Processor”, 16th IEEE International Conference
on Electronics, Circuits, and Systems, ICECS’09, Hammamet,
Tunisia, December 13-16, 2009.
[13] J.F. Wakerly, “Error Detection Codes, Self-Checking Circuits
and Applications”, North Holland, 1978.
[14] R.W. Hamming, “Error Detecting and Error Correcting
Codes”, Bell System Tech. Jour., vol. 29, pp. 147-160, 1950.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 71


72 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010
A Task-aware Middleware for Fault-tolerance and
Adaptivity of Kahn Process Networks on
Network-on-Chip
Onur Derin, Erkan Diken
ALaRI, Faculty of Informatics
University of Lugano
Lugano, Switzerland
derino@alari.ch, dikene@usi.ch

Abstract—We propose a task-aware middleware concept and Our eventual goal is to run a KPN application directly on a
provide details for its implementation on Network-on-Chip NoC platform with self-adaptivity and fault-tolarence features.
(NoC). We also list our ideas on the development of a simulation It requires us to implement a KPN run-time environment that
platform as an initial step towards creating fault-tolerance
strategies for Kahn Process Networks (KPN) applications running will run on the NoC platform and support adaptation and fault-
on NoCs. In doing that, we extend our SACRE (Self-adaptive tolerance mechanisms for KPN applications on such platforms.
Component Run-time Environment) framework by integrating it We propose a self-adaptive run-time environment (RTE) that is
with an open source NoC simulator, Noxim. We also hope that this distributed among the tiles of the NoC platform. It consists of a
work may help in identifying the requirements and implementing middleware that provides standard interfaces to the application
fault tolerance and adaptivity support on real platforms.
Index Terms—fault tolerance; KPN; middleware; NoC; self- components allowing them to communicate without knowing
adaptivity; about the particularities of the network interface. Moreover
the distributed run-time environment manages the adaptation
I. I NTRODUCTION of the application for high-level goals such as fault-tolerance,
Past decade has witnessed a change in the design of high performance and low-power consumption by migrating
powerful processors. It has been realized that running pro- the application components between the available resources
cessors at higher and higher frequencies is not sustainable and/or increasing the parallelism of the application by instan-
due to unproportional increases in power consumption. This tiating multiple copies of the same component on different
led to the design of multi-core chips, usually consisting of resources [3], [4]. Such a self-adaptive RTE constitutes a fun-
multi-processor symmetric Systems-on-Chip (MPSoCs), with damental part in order to enable system-wide self-adaptivity
limited numbers of CPU-L1 cache nodes interconnected by and continuity of service support [5].
simple bus connections and capable in turn of becoming In view of the goal stated above, we propose to use
nodes in larger multiprocessors. However, as the number our SACRE framework [4] that allows creating self-adaptive
of components in these systems increases, communication KPN applications. In [3], we listed platform level adaptations
becomes a bottleneck and it hinders the predictability of and proposed a middleware-based solution to support such
the metrics of the final system. Networks-on-chip (NoCs) adaptations. In the present paper, we define the details of the
[1] emerged as a new communication paradigm to address self-adaptive middleware particularly for NoC platforms. In
scalability issues of MPSoCs. Still, achieving goals such as doing that we choose to integrate SACRE with the Noxim
easy parallel programming, good load balancing and ultimate NoC simulator [6] in order to realize functional simulations
performances, dependability and low-power consumption pose of KPN applications on NoC platforms. An important issue
new challanges for such architectures. regarding the NoC platform is the choice of the communica-
In addressing these issues, we adopted a component-based tion model. Depending on the NoC platform, we may have a
approach based on Kahn Process Networks (KPN) for spec- shared memory space with the Non-Uniform Memory Access
ifying the applications [2]. KPN is a stream-oriented model (NUMA) model or we may rely on pure message passing
of computation based on the idea of organizing an application with the No Remote Memory Access (NORMA) model [7].
into streams and computational blocks; streams represent the Implementing KPN semantics on NORMA presents itself as
flow of data, while computational blocks represent operations the main challange. In the NUMA case, it is straightforward as
on a stream of data. KPN presents itself as an acceptable long as the platform provides some synchronization primitives.
trade-off point between abstraction level and efficiency versus Section III explains details of the middleware in the NORMA
flexibility and generality. It is capable of representing many case. Section IV presents our ongoing effort to integrate
signal and media processing applications that occupy the SACRE and Noxim. Section V, VI list the requirements for the
largest percentage of the consumer electronics in the market. implementation of fault tolerance and adaptivity mechanisms.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 73


II. R ELATED W ORK C input port
A B E F output port
The trend from single core design to many core design D connector
has forced to consider inter-processor communication issues
for passing the data between the cores. One of the emerged Fig. 1: A simple KPN application with application components
message passing communication API is Multicore Associa- A, B, C, D, E, F
tion’s Communication API (MCAPI) [8] that targets the inter-
core communication in a multicore chip. MCAPI is the light-
weight (low communication latencies and memory footprint) to program for the platform by allowing the development of
implementation of message passing interface APIs such as application components in isolation and running them without
Open MPI [9]. modifications. This can be achieved by separating the KPN
However, the communication primitives available with these library that will be used to program the application from the
message passing libraries don’t support the blocking write communication primitives of the platform. Middleware will
operation as required by KPN semantics. Main features in link the KPN library to the platform specific communication
order to implement KPN semantics are blocking read and, in issues.
the limited memory case, blocking write. Key challenge is the In line with the above requirement, we would like that
implemention of the blocking write feature. There are different application components are not aware of the mapping of
approaches addressing this issue. In [10], a programming components on the platform. They should only be concerned
model is proposed based on the MPI communication primitives with communicating tokens to certain connectors. Therefore
(MPI send() and MPI receive()). MPI receive blocks the task the middleware should enable mapping-independent token
until the data is available while MPI send is blocking until the routing. These requirements are of great importance if we want
buffer is available on the sender side. Blocking write feature is to achieve fault tolerance and adaptivity of KPN applications
implemented via operating system communication primitives on NoC platforms in such a way that assures separation of
that ensure the remote processor buffer has enough space concerns. This means that it is the platform that provides fault-
before sending the message. Another approach is presented tolerance and adaptivity features to the application and not the
in [11], a network end-to-end control policy is proposed to application developer.
implement the blocking write feature of the FIFO queues.
Our approach is based on a novel implementation of KPN B. Middleware implementation in the NORMA case
semantics on NoC platforms. We propose an active middle-
ware layer that implements the blocking write feature through In the NORMA model, tasks only have access to the local
virtual channels that are introduced in opposite directions to memory and there is no shared address space. Therefore tasks
the original ones. on different tiles have to pass the tokens among each other via
message passing routines supported by the network interface
III. TASK - AWARE M IDDLEWARE (NI).
A KPN application consists of a set of parallel running tasks In order to address the middleware requirements previously
(application components) connected through non-blocking listed, our key argument is the implementation of an ac-
write, blocking read unbounded FIFO queues (connectors) that tive middleware layer that appears as a KPN task and gets
carry tokens between input and output ports of application connected to other application tasks with the connectors of
components. A token is a typed message. Figure 1 shows a the specific KPN library that is adopted by the application
simple KPN application. Running a KPN application on a NoC components. Opposedly, a passive middleware layer would be
platform would require to map the application components on a library with a platform specific API for tasks to receive and
the several tiles of the NoC platform. send tokens.
We build our middleware on top of MPI recv()
A. Requirements and MPI send() primitives. These methods allow
When deciding on the implementation details of a mid- sending/receiving data to/from a specific process regardless
dleware that will support running of a KPN application on of which tile the process resides on. MPI recv blocks the
a NoC platform, we came up with some requirements for process until the message is received from the specified
the middleware. The most fundamental requirement for a process-tag pair. MPI send is non-blocking unless there is
middleware to support KPN semantics is the ability to transfer another process on the same tile that has already issued an
tokens among tiles assuring blocking read. Since unbounded MPI send.
FIFOs cannot be realized on real platforms, FIFOs have to Every tile has a middleware layer that consists of mid-
be bounded. Parks’ algorithm [12] provides bounds on the dleware sender and receiver processes. Figure 2 shows the
queue sizes through static analysis of the application. In the middleware layers and a possible mapping of the example
case of bounded queues, blocking write is also required to be pipeline on four tiles of a 2x2 mesh NoC platform. There
supported. is a sender process for each outgoing connector. An outgoing
Another requirement is that we would like to have platform connector is one that is connected to an input port of the
independent application components. This will make it easier application component that resides on a different tile. Similarly

74 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


PE PE PE 0 PE 1
A B C D
B MW MW C
MW MW
sender receiver

TILE 0
NI NI
TILE 1 virtual channel
MPI_send
R R MPI_receive
Hardware
NoC
Software
R R Fig. 3: Middleware details for the connector between B and C
TILE 2 TILE 3
NI NI

PE PE
loop
t ← input port.read()
E F
M P I recv(tvc , PR , vc)
MW MW
M P I send(t, PR , oc)
end loop
Fig. 2: KPN example mapped on 2x2 mesh NoC platform Fig. 4: Sender middleware task per outgoing connector (t:
data token, tvc : dummy token for virtual channel, PR : process
identifier of the remote middleware task, vc: virtual channel
there is a receiver process for each incoming connector. These tag, oc: original channel tag)
processes are actually KPN tasks with a single port. This is
an input port for a sender process and an output port for for i = 1 to channel bound do
a receiver process. The job of a sender middleware task is M P I send(tvc , PR , vc)
to transmit the tokens from its input over the network to i←i+1
the receiver middleware task on the corresponding tile (i.e. end for
the tile containing the application component to receive the loop
token). Similarly, a receiver middleware task should receive the M P I recv(t, PR , oc)
tokens from the network and put them in the corresponding M P I send(tvc , PR , vc)
queue. Figure 3 shows the sender and receiver middleware output port.write(t)
tasks between the ports of application components B and C. end loop
We need to implement a blocking write blocking-read
bounded channel that has its source in one processor and Fig. 5: Receiver middleware task per incoming connector
its sink in another one. MPI send as described above does
not implement blocking write operation. It can be modified
and be implemented in such a way that it checks whether With the middleware layer, an outgoing blocking queue of
the remote queue is full or not by using low-level support bound b in the original KPN graph is converted into three
from the platform [10], [11]. In order to do this in a way that blocking queues: one with bound b1 between the output port
wouldn’t require changes to the platform, we make use of the of the source component and the sender middleware task; one
virtual channel concept. A virtual channel is a queue that is with bound b2 between the sender and receiver middleware
connected in the reverse direction to the original channel. For tasks; one again with bound b2 between the receiver middle-
every channel between sender and receiver middleware tasks, ware task and the input port of the sink component. Values
we add virtual channels that connects the receiver middleware b1 and b2 can be chosen freely such that b1 + b2 ≥ b and
task to the sender middleware task. Figure 3 shows the virtual b1, b2 > 0.
channel along with the sender and receiver middleware tasks If the middleware layer is not implemented as an active
for the outgoing connector from application component B to layer, then the application tasks would need to be modified to
C. The receiver task initially puts as many tokens to the virtual include the virtual channels. Moreover, use of virtual channels
channel as the predetermined bound of the original channel. enables us to not require changes to the NoC for custom
The sender has to read a token from the virtual channel before signalling mechanisms.
every write to the original channel. Similarly the receiver has Another benefit of having virtual channels is avoiding dead-
to write a token to the virtual channel after every read from the locks. Since MPI send can be issued by different middleware
original channel. Effectively, virtual channel enables the sender tasks residing on the same tile in a mutually exclusive way,
to never get blocked on a write. The read/write operations there may be deadlock situations for some of the task mapping
from/to original and virtual channels can thus be done using decisions. For example, consider the case (see Fig. 1) where
MPI send and MPI receive as there is no more need for an application task (C) is blocked on a call to MPI send until
blocking write in presence of virtual channels. Figure 4 and the queue on the receiver end is not full. It may be that the
5 show the pseudocodes for sender and receiver middleware application task on the receiver end (E) is also blocked waiting
tasks, respectively. for a token from an application task (D) on the tile where

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 75


MWPacket dst component dst port token
TABLE I: Port connection table for Figure 1
Source Destination
NIPacket src dst ts payload (MWPacket) component port component port
A out1 B in1
Fig. 6: Structure of packets at the NI and middleware levels B out1 C in1
B out2 D in1
(src: source tile, dst: destination tile, ts: timestamp, dst com- C out1 E in1
ponent: destination component, dst port: destination port) D out1 E in2
E out1 F in1

C resides. Since tasks on the same tile has to wait until the
TABLE II: Component mapping table for Figure 2
MPI send call of the other task returns, D cannot write the Component Tile
token to be received by E. Therefore we have a deadlock A 0
situation where C is blocked on E, E is blocked on D, D is B 0
C 1
blocked on C. With virtual channels, it is guarenteed that an D 1
MPI send call will not ever be blocked. E 2
The problem of deadlocking can be solved also without F 3
using virtual channels. However that would require implement-
ing expensive distributed conditional signalling mechanisms
on the NoC or inefficient polling mechanisms. Then the MWPacket object is sent via the NI to the destina-
tion tile by wrapping it in a NIPacket object. NIPacket has the
IV. S IMULATION WITH SACRE AND N OXIM structure shown in Figure 6. The destination tile identifier is
Noxim [6] is an open source and flit-accurate simulator looked up from the component mapping table stored in the
developed in SystemC. It consists of tunable configuration middleware. This table stores which components reside on
parameters (network size, routing algorithm, traffic type etc.) each tile as shown in Table II. Currently NI transfers packets
in order to analyze and evaluate the set of quality indices such spliting them into several flits through wormhole routing.
as delay, throughput and energy consumption. The receiver process is also activated on every positive
SACRE [4] is a component framework that allows creating edge of the clock. The network interface receives the flits and
self-adaptive applications based on software components and reconstructs the MWPacket object. Then the receiver process
incorporates the Monitor-Controller-Adapter loop with the extracts the token and puts it in the right queue by looking at
application pipeline. The component model is based on KPN. the header of the MWPacket.
It supports run-time adaptation of the pipeline via parametric
and structural adaptations. V. FAULT- TOLERANCE SUPPORT
We started integrating SACRE and Noxim in order to be Having isolated the application tasks from the network
able to simulate KPN applications on NoC platforms. We interface, we believe it will be easier to implement fault
aim to implement the proposed middleware for the NORMA tolerance mechanisms. Until now, we have only considered
case. However we don’t have the MPI send and MPI receive task migration and semi-concurrent error detection as fault
primitives in Noxim. Actually it doesn’t even come with a NI. tolerance mechanisms.
We implemented the transport layer such that we can send
data and reconstruct the data on the other end. In absence of A. Task migration
MPI primitives in SACRE-Noxim, we propose to implement In the case when a tile fails, the tasks mapped on that
the task-aware middleware over the transport layer of the NoC tile should be moved to other tiles. Therefore we will have
network interface as described below. a controller implementing a task remapping strategy. For now,
We conceived the middleware as a KPN task by extending we don’t focus on this but rather deal with the implementation
it from SACREComponent in order to be able to connect it to of task migration mechanisms.
the queues of the local application tasks. Middleware is also Moving an application component from one tile to another
inherited from the base SystemC module class (i.e. sc module) requires the ability to start the task on the new tile, update the
with send and receive processes. component mapping tables on each tile, create/destroy MW
The send process is activated on every positive edge of the tasks for outgoing and incoming connectors of the migrated
clock and reads the input ports in a non-blocking manner. components and transfer the tokens already available in the
When there is a token to be forwarded, it wraps the token connectors of the migrated components along with those
into a MWPacket object as shown in Figure 6 by adding the components. In case of a fault, the tokens in the queues
destination task name and destination port name as the header pending to be forwarded by the middleware tasks in the failed
information. tile may be lost along with the state of the task if it had any.
This data is looked up from the port connection table. Similarly, there may be some number of received flits that
This table represents the KPN application and shows which haven’t been reconstructed to make up a packet yet. We may
components and which ports are connected with each other as need to put in measures to checkpoint the state of the task and
shown in Table I. the middleware queues. As a rollback mechanism, we should

76 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


a) b) a) b)
C1 C1

C Multiplicator C2 Majority voter C Router C2 Merger

C3 C3

Fig. 7: Semi-concurrent error detection at application level. Fig. 8: Adaptation pattern for parallelization. Component in
Component in (a) is replicated by three as shown in (b) (a) is parallelized by three as shown in (b)

be able to transfer both the state of the tasks and the queues A. Adapting the level of parallelism
on the faulty tile to the new tiles. The flits already in the
Parallelization of a component is one type of structural
NoC buffers destined to the faulty tile should be re-routed to
adaptation that can be used to increase the throughput of
the new tiles accordingly. This may be easier to achieve if
the system as shown in Figure 8. This is done by creating
we implement the task-awarenes feature in the NoC routers.
parallel instances of a component and introducing a router
Otherwise, it should be the NI or the router of the faulty
before and a merger after the component instances for each of
tile that should resend those flits back in the network with
the input and output ports. A router is a built-in component
correct destinations. We need to futher analyze the scenarios
in our framework that can work in a load-balancing or round-
according to the extend of faults (e.g. only the processing
robin fashion; this component routes the incoming messages to
element is faulty or whole tile is faulty). However, thanks to
either one of the instances depending on its policy. If there is
the middleware layer, application tasks won’t need to know
no ordering relation between incoming and outgoing messages,
that there has been a task migration.
the merger components simply merge the output messages
B. Semi-concurrent error detection at application level from the output ports of the instances into one connector
We propose to employ semi-concurrent error detection [13] disregarding the order of messages on the basis of whichever
as a dependability pattern for self-adaptivity. The run-time message is first available. However, for the general class of
environment can adapt the dependability at the application KPN applications, semantics require that the processes comply
level. This enables adaptive dependability levels for different with the monotonicity property. In that case, the ordering
parts of the application pipeline at the granularity of an relation has to be preserved. For that purpose, the router
application component. component tags every message that would originally go to
In the case of a single component, parallel instances of the component with an integer identifier that counts up for
the component are created on different cores along with each message from value 1. Then the merger components have
multiplicator components and majority voter components for to queue up the output messages so as to achieve an order
each input and output ports respectively as shown in Figure in terms of their tags. If there are multiple processor cores
7. Multiplicator component creates a copy of the incoming available, this mechanism would increase the parallelism of the
message for each redundant instance and forwards it to them application. However the condition for applicability of such an
along with a unique tag identifying the set of copied messages. adaptation is the absence of inter-message dependencies.
Majority voter component queues up all the output messages
VII. C ONCLUSION AND F UTURE WORK
until it has as many messages with the same tag as the number
of redundant instances. Then it finds out the most recurrent We propose an active middleware layer to accommodate
message and sends it to its output connector. A time-out KPN applications on NoC-based platforms. Besides satisfying
mechanism can also be put in place to tolerate when a core is KPN semantics, the middleware allows platform independent
faulty and no message is being received by a component. application components with regard to communication. It
is solely based on MPI send and MPI recv communication
VI. A DAPTIVITY SUPPORT primitives, thus it doesn’t require any modification to the
In [3], [4], we had listed possible run-time adaptations of NoC platform. The middleware is an initial step towards
KPN applications at application and platform level. Applica- implementing a self-adaptive run-time environment on the
tion programmer provides a set of goals to be met by the NoC platform.
application. These goals are translated into parameters to be As future work, the performance implications of additional
monitored by the platform. The adaptations are driven by an middleware tasks should be investigated. The impact of virtual
adaptation control mechanism that tries to meet the goals by channel tokens on overloading of the NoC should be assessed.
monitoring those parameters. We need to elaborate on the Although it may not always be a matter of choice, the
implementation of these adaptations on the NoC platform. performances of KPN applications in NUMA and NORMA
An example of a structural adaptation in order to meet architectures need to be evaluated. We have established a
performance and low-power goals is the parallelization pattern collaboration with University of Cagliari in order to implement
explained below. the presented middleware on their FPGA-based NoC platform

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 77


[14]. This will allow us to evaluate the proposed middleware [6] “Noxim NoC simulator.” [Online]. Available:
in a real setting and obtain results for the active vs. passive http://noxim.sourceforge.net
[7] E. Carara, A. Mello, and F. Moraes, “Communication models in
middleware and NORMA vs. NUMA cases. networks-on-chip,” in RSP ’07: Proceedings of the 18th IEEE/IFIP
International Workshop on Rapid System Prototyping. Washington,
ACKNOWLEDGMENT DC, USA: IEEE Computer Society, 2007, pp. 57–60.
This work was funded by the European Commission under [8] “Multicore associations communication api.” [Online]. Available:
http://www.multicore-association.org
the Project MADNESS (No. FP7-ICT-2009-4-248424). The [9] “A high performance message passing library.” [Online]. Available:
paper reflects only the authors’ view; the European Com- http://www.open-mpi.org/
mission is not liable for any use that may be made of the [10] Gabriel Marchesan Almeida and Gilles Sassatelli and Pascal Benoit
and Nicolas Saint-Jean and Sameer Varyani and Lionel Torres and
information contained herein. Michel Robert, “An Adaptive Message Passing MPSoC Framework,”
International Journal of Reconfigurable Computing, vol. 2009, p. 20,
R EFERENCES 2009.
[11] A. B. Nejad, K. Goossens, J. Walters, and B. Kienhuis, “Mapping kpn
[1] G. De Micheli and L. Benini, Networks on Chips: Technology and Tools.
models of streaming applications on a network-on-chip platform,” in
Morgan Kaufmann, 2006.
ProRISC 2009: Proceedings of the Workshop on Signal Processing,
[2] G. Kahn, “The semantics of a simple language for parallel program-
Integrated Systems and Circuits, November 2009.
ming,” in Information Processing ’74: Proceedings of the IFIP Congress,
[12] T. M. Parks, “Bounded scheduling of process networks,” Ph.D. disser-
J. L. Rosenfeld, Ed. New York, NY: North-Holland, 1974, pp. 471–475.
tation, University of California, Berkeley, CA 94720, December 1995.
[3] O. Derin and A. Ferrante, “Simulation of a self-adaptive run-time
[13] A. Antola, F. Ferrandi, V. Piuri, and M. Sami, “Semiconcurrent error
environment with hardware and software components,” in SINTER ’09:
detection in data paths,” IEEE Trans. Comput., vol. 50, no. 5, pp. 449–
Proceedings of the 2009 ESEC/FSE workshop on Software integration
465, 2001.
and evolution @ runtime. New York, NY, USA: ACM, August 2009,
[14] P. Meloni, S. Secchi, and L. Raffo, “Exploiting FPGAs for technology-
pp. 37–40.
aware system-level evaluation of multi-core architectures,” in Pro-
[4] O. Derin and A. Ferrante, “Enabling self-adaptivity in component-
ceedings of the 2010 IEEE International Symposium on Performance
based streaming applications,” SIGBED Review, vol. 6, no. 3, October
Analysis of Systems and Software, March 2010.
2009, special issue on the 2nd International Workshop on Adaptive and
Reconfigurable Embedded Systems (APRES’09).
[5] O. Derin and A. Ferrante and A. V. Taddeo, “Coordinated management
of hardware and software self-adaptivity,” Journal of Systems Architec-
ture, vol. 55, no. 3, pp. 170 – 179, 2009.

78 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Dynamic Reconfigurable Computing:
the Alternative to Homogeneous Multicores under Massive Defect Rates

Monica Magalhães Pereira and Luigi Carro


Instituto de Informática
Universidade Federal do Rio Grande do Sul
Porto Alegre, Brazil
{mmpereira, carro}@inf.ufrgs.br

Abstract— The aggressive scaling of CMOS technology has systems, such as reliability. The scaling process shrinks the
increased the density and allowed the integration of multiple wires’ diameter, making them more fragile and susceptible to
processors into a single chip. Although solutions based on break. Moreover, it is also harder to keep contact integrity
MPSoC architectures can increase the application’s speed between wires and devices [3]. According to Borkar [4], in a
through task level parallelism, this speedup is still limited to the 100 billion transistor device, 20 billion will fail in the
amount of parallelism available in the application, as manufacture and 10 billion will fail in the first year of
demonstrated by Amdahl’s Law. Another fundamental aspect is operation.
that for new technologies, very aggressive defect rates are
expected, since the continuous shrink of device features makes At these high defect rates it is highly probable that the
them more fragile and susceptible to break. At high defect rates a defects affect most of the processors of the MPSoC (or even all
large amount of processors of the MPSoC will be susceptible to the processors), causing yield reduction and aggressively
defects, and consequently will fail, reducing not only yield, but affecting the expected performance. Furthermore, in cases
also severely affecting the expected performance. In this context, when all the processors are affected, this makes the MPSoC
this paper presents a run-time adaptive architecture design that useless. To cope with this, one solution is to include some fault
allows software execution even under aggressive defect rates. The tolerance approach. Although there exists many solutions
proposed architecture can accelerate not only highly parallel proposed to cope with defects [5], most of these solutions do
applications, but also sequential ones, and it is a heterogeneous not cope with high defect rates predicted to future technologies.
solution to overcome the performance penalty that is imposed to
Moreover, the proposed solutions present some additional cost
homogeneous MPSoCs under massive defect rates. In the
experimental results we compare performance and area of the
that causes a high impact on area, power or performance, or
proposed architecture to a homogeneous MPSoC solution. The even in all three [6].
results demonstrate that the architecture can sustain software In this context, this paper presents a reconfigurable
acceleration even under a 20% defect rate, while the MPSoC architecture as an alternative to homogeneous multicores that
solution does not allow any software to be executed under a 15% allows software execution even under high defect rates, and
defect rate for the same area. accelerates execution of parallel and sequential applications.
The architecture uses an on-line mechanism to configure itself
Homogeneous MPSoC; Amdahl’s Law; run-time adaptive
architecure; defect tolerance.
according to the application, and its design provides
acceleration in parallel as well as in sequential portions of the
applications. In this way, the proposed architecture can be used
I. INTRODUCTION to replace homogeneous MPSoCs, since it sustains
The scaling of CMOS technology has increased the density performance even under high defect rates, and it is a
and consequently made the integration of several processors in heterogeneous approach to accelerate all kinds of applications.
one chip possible. Although the use of multicores allows task Therefore, its performance is not limited to the parallelism
level parallelism (TLP) exploitation, the speedup achieved by available in the applications.
these systems is limited to the amount of parallelism available To validate the architecture, we compare the performance
in the applications, as already foreseen by Amdahl [1]. and area of the system to a homogeneous MPSoC with
One solution to overcome Amdahl’s law and sustain the equivalent area. The results indicate that the proposed
speedup of MPSoCs is the use of heterogeneous cores, where architecture can sustain execution even under a 20% defect
each core is specialized in different application sets. In this rate, while in the MPSoC all the processors become useless
way, the MPSoC can accelerate not only highly parallel even under a 15% defect rate. Furthermore, with lower defect
applications but also the sequential ones. An example of a rates, the proposed architecture presents higher acceleration
heterogeneous architecture is the Samsung S5PC100 [2] used when compared to the MPSoC under the same defect rates,
in the iPhone technology. with the TLP available in the applications lower than 100%.
Although the use of heterogeneous cores can be an efficient The rest of this paper is organized as follows. Section 2
solution to improve the MPSoC’s performance, there are other details the adaptive system. Section 3 presents the defect
constraints that must be considered in the design of multicore tolerance approach and some experimental results. Section 4

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 79


presents a comparison of area and performance between the The interconnection model implemented in the
reconfigurable system and the equivalent homogeneous reconfigurable fabric is based on multiplexers and buses. The
MPSoC considering different defect rates. Finally, section 5 buses are called context lines and receive data from the context
presents the conclusions and future works. registers, which store the data from the processor’s register file.
The multiplexers select the correct data that will be used by
II. PROPOSED ARCHITECTURE each functional unit. Figure 3 illustrates the interconnection
model.
The reconfigurable system consists of a coarse-grained
reconfigurable array tightly coupled to a MIPS R3000
processor; a mechanism to generate the configuration, called
Binary Translator; and the context memory that stores the
reconfiguration [7-8]. Figure 1 illustrates the reconfigurable
system.

Figure 1. Reconfigurable System

The reconfigurable array consists of a combinational circuit


that comprises three groups of functional units: the arithmetic
and logic group, the load/store group and the multiplier group.
Figure 2 presents the reconfigurable array (RA).

Figure 3. Interconnection Model

The different execution times presented by each group of


functional units allow the execution of more than one operation
per level. Therefore, the array can perform up to three
arithmetic and logic operations that present data dependency
among each other in one equivalent processor cycle,
consequently accelerating the sequential execution. Moreover,
the execution time can be improved through modifications on
the functional units and with the technology evolution,
Figure 2. Reconfigurable Array consequently increasing the acceleration of intrinsically
sequential parts of a code. Even non parallel code can have a
Each group of functional unit can have a different execution better performance when executed in the structure illustrated in
time, depending on the technology and implementation Figure 2, as shown in paper [7].
strategy. Based on this, in this work the ALU group can
perform up to three operations in one equivalent processor The Binary Translator (BT) unit implements a mechanism
cycle and the other groups execute in one equivalent processor that dynamically transforms sequences of instruction to be
cycle. The equivalent processor cycle is called level. Figure 2 executed on the array. The transformation process is
also demonstrates the parallel and sequential paths of the transparent, with no need of instruction modification before
reconfigurable array. The amount of functional units is defined execution, preserving the software compatibility of the
according to area constraints and/or an application set demand application. Furthermore, the BT works in parallel with the
(given by a certain market, e.g. portable phones). processor’s pipeline, presenting no extra overhead to the

80 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


processor. Figure 4 illustrates the Binary Translator steps resources allocation based on the defective and operational
attached to the MIPS R3000 pipeline. resources.
In addition, dynamic reconfiguration can be used to avoid
the defective resources and generate the new configuration at
run-time. Thus, there is no performance penalty caused by the
allocation process, nor extra fabrication steps are required to
correct each circuit.

IF Instruction Fetch ID Instruction Decode Finally, as it will be shown, the capability of adaptation
ID Instruction Decode DV Dependency Verification according to the application can be exploited to amortize the
EX Execution RA Resource Allocation performance degradation caused by the replacement of
MEM Memory Access TU Table Update defective resources by working ones.
WB Write Back
Since the defect tolerance approach presented in this paper
Figure 4. Binary Translator
handles only defects generated in the manufacture process, the
information about the defective units is generated before the
A. Configuration generation reconfigurable array starts its execution, by some classical
In parallel with the processor execution the BT searches for testing techniques. Therefore, the solution to provide defect
sequences of instructions that can be executed by the tolerance is transparent to the configuration generation.
reconfigurable array (RA). The detected sequence is translated Figure 5 illustrates the defect tolerance scheme
to a configuration and stored in the context memory indexed by implemented in the reconfigurable array mechanism. Figure 5.a
the program counter (PC) value of the first instruction from the presents the resources allocation in a defect-free reconfigurable
sequence. array. As already detailed, the parallel instructions are placed in
During this process the BT verifies the data dependency the same row of the reconfigurable array and the dependent
among instructions and performs resource allocation according instructions, which must be executed sequentially, are placed in
to data dependency and resources availability. Both data different rows.
dependency and resources availability verification are
performed through the management of several tables that are
filled during execution. At the end of the BT stages a
configuration is generated and stored in the context memory.
As mentioned before, since the BT works in parallel with
the processor’s pipeline, there is no overhead to generate the
configuration.

B. RA reconfiguration and execution Figure 5. Resource allocation approach

While the BT generates and stores the configuration the To select the functional units to execute the instructions, the
processor continues its execution. The next time a PC from a mechanism starts by allocating the first available unit (bottom-
configuration is found the processor changes to a halt stage and left). The next instructions are placed according to data
the respective configuration is loaded from the context memory dependency and resources availability. The control of available
and the RA’s datapath is reconfigured. Moreover all input data units is performed through a table that represents the
is fetched. Finally, the configuration is executed and the reconfigurable array. The table indicates which units are
registers and memory positions are written back. available and which ones were already allocated (i.e. the units
It is important to highlight that the overhead introduced by that are busy).
the RA reconfiguration and data access are amortized by the In Figure 5.a. the instructions 1; 2; 3; and 4 do not have
acceleration achieved by the RA. Moreover, as mentioned data dependency among each other, hence they are all placed in
before, the configuration generation does not impose any the same row. On the other hand, instruction 5 has data
overhead. More details about the reconfiguration process can dependency among one (or more than one) of the previous
be found in [7]. instructions. Hence, this instruction is placed in the first
available unit of the second row. Continuing the allocation,
III. DEFECT TOLERANCE instructions 6; 7 and 8 have data dependency on instruction 5.
Thus, they are placed in the third row. Finally, instructions 9
A. Defect tolerance approach and 10 have data dependency on one (or more than one)
Reconfigurable architectures are strong candidates to defect instruction from the previous row (6; 7; and 8), hence they
tolerance. Since they consist essentially of identical functional must be placed in the fourth row.
elements, this regularity can be exploited as spare-parts. This is In a defective reconfigurable array, the allocation algorithm
the same approach used in memory devices and has is exactly like described above. The only difference is that to
demonstrated to be very efficient [9]. Moreover, the allocate only the resources that are effectively able to work,
reconfiguration capability can be exploited to change the before the reconfigurable array starts and after the traditional

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 81


testing steps, all the defective units are set as permanently busy distribution, as well as the defect rate. Based on this, the tool’s
in the table that controls the resource allocation process, like if output has the same resources information, but now with the
they had been previously allocated. With this approach, no randomly selected units set as busy. This information was used
modification on the reconfigurable array algorithm itself is as input to the architectural simulator. In this study we used
necessary. five different defect rates (0.01%; 0.1%; 1%; 10% and 20%),
and the reference design was the reconfigurable array without
Figure 5.b and 5.c demonstrate the resource allocation defective units.
considering defective functional units. In Figure 5.b the
configuration mechanism placed the instruction in the first The size of the RA was based on several studies varying the
available unit, which in this case corresponds to the second amount of functional units and their parallel and sequential
functional unit of the first row. Since the first row still has distribution. The studies considered large RAs with thousands
available resources to place the four instructions, the of functional units, and also small arrays with only dozens
reconfigurable array sustains its execution time. In this case the functional units. The chosen RA is a middle-term of the studied
presence of a defective functional unit does not affect the architectures. It contains 512 functional units (384 ALUS, 96
performance. load/stores and 32 multipliers). The area of this architecture is
equivalent to 10 MIPS R3000.
Figure 5.c illustrates an example where defective functional
units affect the performance of the reconfigurable array. In this It is important to highlight that despite the performance
example, the first row has only three available functional units. degradation presented by the reconfigurable system under a
In this case, when there are not enough resources in one row, 20% defect rate, the performance was still superior to the
the instructions are placed in the next row, and all the data standalone processor’s performance. Hence, Figure 6 presents
dependent instructions must be moved upwards. In Figure 5.c, the acceleration degradation of the reconfigurable system,
instruction 5 is dependent on instruction 4. Hence, instruction 5 instead of the performance degradation.
was placed in the next row, and the same happened with other
instructions (6 to 10). In this example, because of the defective
units it was necessary to use one more row of the RA,
consequently increasing execution time and affecting
performance.
The same approach was implemented to tolerate defects in
the interconnection model of Figure 3. However, the strategy
can be different depending on which multiplexer is affected. If
an input multiplexer is affected the strategy is to consider the
multiplexer and its respective functional unit as defectives. On
the other hand, if an output multiplexer is defective it is
possible simply placing in the respective functional unit a
different instruction that does not use the defective multiplexer.
Figure 6. Acceleration degradation of the reconfigurable system
The defect tolerance approach for functional units and
interconnection model was already proposed in [10], where According to Figure 6, the highest acceleration penalty was
more details about the defect tolerance of interconnection presented in the execution of jpegE, with 6.5% of speedup
model and experimental results can be found. These details are reduction under a 20% defect rate. Nevertheless, the
not in the scope of this paper, since the main focus of the reconfigurable system is still 2.4 times faster than the
current paper is to demonstrate how the reconfigurable standalone processor.
architecture with the defect tolerance approach presented in The mean speedup achieved by the defect-free RA in the
[10] can be an efficient alternative to homogeneous multicores execution of MiBench applications was 2.6 times. Under a 20%
under high defect rates. defect rate the mean speedup degraded to 2.5 times. This is less
than 4% of speedup degradation.
B. Experimental results
These results demonstrate that even under a 20% defect
To evaluate the proposed approach and its impact on
rate, the reconfigurable array combined with the on-line
performance we have implemented the reconfigurable
reconfiguration mechanism is capable of not only ensuring that
architecture in an architectural simulator that provided the
the system remains working, but it also accelerates the
MIPS R3000 execution trace. Furthermore, as workloads we
execution when compared to the original MIPS R3000
used the MiBench Benchmark Suite [11] that contains a
processor.
heterogeneous application set, including applications with
different amount of parallelism.
IV. ADAPTIVE SYSTEM X HOMOGENEOUS MPSOC
To include defects in the reconfigurable array a tool was
implemented to randomly select functional and interconnection A. Area and performance
units as defective, based on several different defect rates. The
tool’s input is the information about the amount of resources To demonstrate the efficiency of the proposed architecture
available in the array and its sequential and parallel this section presents a comparison between the adaptive system
and a homogeneous MPSoC with the same area.

82 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


As mentioned before, the area of the reconfigurable system Now, one can fix the speedup and vary f to find the number
is equivalent to 10 MIPS R3000 processors, including data and of processors to achieve the required speedup. From equation
instruction caches. Moreover, the mean speedup achieved by 1, varying f from 0.1 to 1 we have that when f >= 0.65 we can
the reconfigurable system is 2.6 times for the MiBench suite. achieve the speedup of 2.6. When f < 0.65 it is not possible to
find a number of cores to achieve an acceleration of 2.6 times.
The homogeneous MPSoC used to compare area and Thus, even with hundreds or thousands of cores, if the
performance consists of ten MIPS R3000. In this analysis the application has less than 65% of parallelism, it will never
communication and memory access overheads are not achieve the speedup of 2.6, the same of the reconfigurable
considered. Although the inter-processors communication is array. Nevertheless, with 65% it would be necessary 19 cores
not considered, its impact would certainly be higher in the case to achieve speedup of 2.6 times, as shown in Figure 7.
of the MPSoC, hence all presented results are somewhat
favoring the MPSoC.
As mentioned before, according to Amdahl’s law, the
speedup achieved by the MPSoC is limited to the execution
time of the sequential portion of the application. Equation 1
repeats Amdahl’s law for parallel systems:

(1)
Where f is the fraction of the application that can be
parallelized and n is the number of cores.
Since the MPSoC has ten cores, by varying f and fixing n in
10 (to have the same area of the reconfigurable array, and
hence normalize results by area), from Amdahl’s law we
Figure 7. Number of cores as a function of f
obtained the results presented in Table I, where one can see the
speedup as a function of f in equation (1), the part that can be One solution to cope with this is to improve the
parallelized. homogeneous core’s performance to increase the speedup of
the sequential execution. Therefore, one can rewrite Amdahl’s
TABLE I. ACCELERATION AS A FUNCTION OF f, n=10. law to take this into account, as it is demonstrated in equation
f Speedup
2. This solution was discussed in [1], where the authors
0.10 1.099 presented the possible solutions to increase performance of a
0.15 1.156 homogeneous MPSoC. They conclude that more investment
0.20 1.220 should be done to increase the individual core performance
0.25 1.290 even at high cost.
0.30 1.370
0.35 1.460
0.40 1.563
0.45 1.681
0.50 1.818
0.55 1.980
0.60 2.174
(2)
0.65 2.410
0.70 2.703
0.75 3.077 Equation (2) is an extension of Amdahl’s law, and reflects
0.80 3.571 the idea of improving the MPSoC overall performance by
0.85 4.255 increasing core performance through acceleration of sequential
0.90 5.263
portions. In equation (2), AS is the speedup of the sequential
0.95 6.897
0.99 9.174
portion and AP is the speedup of the parallel portion. Table II
1.00 10.000 presents values for AS, fixing the speedup in 2.6 (acceleration
given by the reconfigurable array) and AP in 10 (homogeneous
multicore and the reconfigurable array have the same area),
Since communication and memory accesses overheads are while varying f. As one can see in Table II, only when f=100%
not considered, with 10 cores it is possible to achieve a speedup that AS=0, which means that this is the only case that does not
of 10 times if 100% of the application is parallelized. require sequential acceleration. This acceleration cannot be
According to Table I it is necessary that 70% of the achieved by the homogeneous MPSoC, however as explained
application be parallelized to achieve a speedup of 2.7 times, in section 2, the reconfigurable array can accelerate sequential
which is approximately the acceleration obtained by the portions of the application.
reconfigurable system.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 83


TABLE II. SEQUENTIAL ACCELERATION AS A FUNTION OF f solution presents lower area cost compared to the solution of
f Speedup AP AS the second analysis. However, it can be more complex to
implement, since each processor must have an extra unit to
0.1 2.60 10 2.402
implement the fault tolerance approach. Therefore, this solution
0.2 2.60 10 2.194
0.3 2.60 10 1.974
considers that each processor has 3 arithmetic and logic units,
0.4 2.60 10 1.741 where 2 ALUS are used as spare.
0.5 2.60 10 1.494
As can be observed in Figure 8, in both second and third
0.6 2.60 10 1.232
0.7 2.60 10 0.954
analyses, under a 15% defect rate all the cores fail. This means
0.9 2.60 10 0.339 that even with fault tolerance solutions, the MPSoC tends to
0.99 2.60 10 0.035 fail completely at high fault rates.
1 2.60 10 0.000

The next section presents a comparison between the


MPSoC and the RA considering the fault tolerance capability.
The results are normalized by area and speedup.

B. Defect tolerance
To compare the performance degradation of the
reconfigurable system with the MPSoC caused by the presence
of defects we performed a performance simulation varying the
defect rate.
To simulate the defects in both, MPSoC and reconfigurable
array, a tool was implemented to randomly insert defects in the Figure 8. Defects simulation in the MPSoC
architectures. To ensure that the defect position was not
affecting the results thousands of simulations were performed Figure 9 presents the performance degradation of the
and in each simulation a new random set of defects was MPSoC when the number of cores is reduced due to the
generated. Moreover, the defects generated had the same size presence of defects. To obtain the speedup it was used the
(granularity) to both MPSoC and reconfigurable architecture. Amdahl’s law represented in equation 1; the MPSoC with ten
In the first analysis we normalized RA and MPSoC by area. cores (without Fault Tolerance); and the one with 5 cores and 2
In the second analysis we increased the numbers of cores of the processors per core. Again, we considered no communication
MPSoC to evaluate the tradeoff between area and fault costs, and hence real results tend to be worse. The chart also
tolerance capability. presents the mean performance degradation of the
reconfigurable system in the execution of the MiBench
1) MPSoC and RA with same area: applications.
Figure 8 illustrates the number of cores affected by the
defects in function of the defect rate in three different studies. The analysis was performed using f=0.70 (the portion of the
The first analysis was performed in a homogeneous MPSoC application that can be parallelized). This number was used
with 10 MIPS R3000 processors without any fault tolerance because according to Table I, the speedup achieved by the
approach. According to the results, when the defect rate is MPSoC when 70% of the application can be parallelized
15% or higher, more than 9 cores are affected. Therefore, the approaches the speedup achieved by the reconfigurable system.
whole MPSoC system fails under a 15% defect rate. The numbers next to the dots in the chart represents the amount
of cores that are still working in the MPSoC under the defect
The second and third analyses were performed considering rate.
that the MPSoC has some kind of fault tolerance solution
implemented. In the second analysis, the fault tolerance
solution consists in replicating the processor in each core. In
this case, instead of having 10 cores with 10 processors, the
MPSoC has 5 cores with 2 processors in each core. The second
processor works as spare that is used only when the first
processor fails. This solution was proposed for two main
reasons. First there is no increase in area. Thus, the MPSoC
still has the same area of the RA. Second, even with half of the
number of cores, the MPSoC still presents higher speedup than
the array when the application presents 100% of parallelism.
The solution proposed in the third analysis is also based on
hardware redundancy. However, in this case instead of
replicating the whole processor, only critical components of the
processor are replicated, e.g. the arithmetic and logic unit. This Figure 9. Performance degradation of the MPSoC

84 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


As can be observed in Figure 9, the performance of the acceleration even under a 20% defect rate that presented a
MPSoC degrades faster than the degradation presented by the speedup degradation of 6.5%.
reconfigurable system, even when the MPSoC presents higher
speedup in a defect-free situation (10-cores MPSoC). This 2) Increase MPSoC core number:
happens because when a defect is placed in any part of the Since the RA consists in a large amount of identical
processor, the affected processor cannot execute properly. On functional units that can be easily replaced, the same idea was
the other hand, when a defect is placed in any functional unit or proposed to the MPSoC: increase the number of cores to
interconnection element of the reconfigurable array, the run- increase the reliability. Thus, this solution consists in adding
time mechanism selects another unit (or element) to replace the more cores to the MPSoC to allow software execution under
defective one. According to Figure 9 the execution of MiBench higher defect rates.
applications by the reconfigurable system presented less than As one can observe in Figure 11, the MPSoC with 32 cores
4% of speedup degradation even under a 20% defect rate. still executes under a 15% defect rate. However the execution
It is important to highlight that in these analyses it was not is completely sequential (one core left under 15% defect rate).
considered the impact on area and performance that the Moreover, under a 20% defect rate the 32-cores MPSoC
implementation of the fault tolerance strategies should completely fails. The speedup results presented in Figure 12
introduce. Again the presented results are somewhat favoring also demonstrates the rapid decrease in the MPSoC speedup
the MPSoC. Moreover, the choice of these fault tolerance even with 32 cores, while the RA sustains acceleration in both
solutions was based on the idea of causing the minimal impact sha and jpegE even under a 20% defect rate.
on the area of the system to maintain both RA and MPSoC
equivalent in area.
Figure 10 presents the graceful degradation of the
applications sha and jpegE. These applications were selected
because the first one achieved the highest speedup by the
reconfigurable system among all the applications from the
MiBench suite and the second presented the highest speedup
degradation.

Figure 11. Defects simulation in the 32-cores MPSoC

Figure 10. Graceful degradation of sha and jpegE applications

According to Figure 10 the execution of application sha by


the reconfigurable system presented less than 1% of speedup
degradation even under a 20% defect rate. On the other hand,
even considering that the application was 100% parallelized
(f=1), with an initial acceleration higher than the one achieved
by the reconfigurable system, the 10-cores MPSoC stopped Figure 12. Graceful degradation of 32-cores MPSoC
working under a 20% defect rate. This same behavior was
observed when the amount of parallelism available was Based on this result, one can conclude that simply
reduced (f=0.85 and f=0.70). In these cases, not only the whole replicating the cores it is not enough to increase the defect
system stopped working under a 20% defect rate, but the initial tolerance of the system to tolerate high defect rates that new
acceleration was equal and lower, respectively, than the one technologies should introduce. Moreover, adding a fault
achieved by the reconfigurable system. tolerance approach can be costly in area and performance.

Moreover, as can be observed in the figure, the same The analyses presented in this section demonstrate that to
behavior is presented in jpegE results. However, in this future technologies with high defect rates, homogeneous
application execution the MPSoC presents higher speedup with MPSoCs may not be the most appropriate solution. The main
f=0.85 than the RA that rapidly decreases to 0 when the defect reasons are the fact that a defect in any part of the processor
rate is higher than 1%. On the other hand, the RA sustains invalidates this processor. Thus, the higher is the defect rate,

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 85


the more aggressive is the performance degradation, leading to replace homogeneous MPSoCs solutions. The system consists
a completely fail of the system under defect rates already of a reconfigurable array and an on-line mechanism that
predicted to futures technologies, such as nanowires [3]. performs defective functional unit replacement at run-time
Furthermore, solutions to provide fault tolerance in without the need of extra tools or hardware.
homogeneous MPSoCs under high defect rates can be costly,
both in area and performance. The reconfigurable array design allows the acceleration of
parallel and sequential portions of applications, and can be used
Another disadvantage of homogeneous MPSoCs is the fact as a heterogeneous solution to replace the homogeneous
that they can only exploit task level parallelism, depending on MPSoCs and ensure reliability in a highly defective
the parallelism available in each application. Therefore, only a environment.
specific application set that is highly parallelized can benefit
from the high integration density and consequently the To validate the proposed approach several simulations were
integration of several cores in one chip [12]. performed to compare the performance degradation of the
reconfigurable system and the MPSoC using the same defect
There are two main solutions to cope with this. The first rates, normalizing the architectures by area and speedup.
one is to use heterogeneous MPSoCs, where each core can According to the results, the reconfigurable system sustains
accelerate a specific application set [2]. The main problem of execution even under a 20% defect rate, while the MPSoC with
this solution is that like the homogeneous multicore, the equivalent area has all the cores affected under a 15% defect
heterogeneous one must also have some fault tolerance rate.
approach to cope with high fault rates, and this can increase
area and performance costs. Future works include analyzing power and energy of these
systems and coping with transient faults in the reconfigurable
The other possible solution is to increase the speedup of system.
each core individually. With the improvement of each core it is
possible to accelerate sequential portions of code and REFERENCES
consequently increase the overall performance of the system.
[1] M. D. Hill, M. R. Marty, "Amdahl's Law in the Multicore Era,"
An example of this approach is to change the MIPS R3000 Computer, vol. 41, no. 7, July 2008, pp. 33-38.
cores for superscalar MIPS R10000 [13]. However, this [2] Samsung Electronics Co., Ltd, Samsung S5PC100 ARM Cortex A8
strategy can result in a significant area increase. According to based Mobile Application Processor, 2009.
[14], a MIPS R10000 is almost 29 times larger than the MIPS [3] A. DeHon and H. Naeimi, “Seven strategies for tolerating highly
R3000. defective fabrication,” in IEEE Design & Test, vol. 22, IEEE Press, July-
Aug. 2005, pp. 306–315.
The analyses also demonstrate that the proposed
[4] S. Borkar, “Microarchitecture and Design Challenges for Gigascale
reconfigurable architecture ensures software execution and also Integration,” keynote address, 37th Annual IEEE/ACM International
accelerates the execution of several applications even under a Symposium on Microarchitecture, 2004.
20% defect rate. Moreover, the reconfigurable system is a [5] I. Koren and C. M. Krishna, “Fault-Tolerant Systems,” Morgan
heterogeneous solution that accelerates parallel and sequential Kaufmann, 2007.
code. Thanks to this approach, the proposed architecture even [6] S. K. Shukla and R. I. Bahar, “Nano, Quantum and Molecular
exposed to high defect rates predicted to future technologies Computing: Implications to High Level Design and Validation,” Kluwer
can still accelerate code, since the parallelism exploitation is Academic Publishers, 2004.
not the only way to accelerate execution. [7] A. C. S. Beck, M. B. Rutzig, G. Gaydadjiev and L. Carro, “Transparent
Reconfigurable Acceleration for Heterogeneous Embedded
Applications,” Proc. of Design, Automation and Test in Europe (DATE),
V. CONCLUSIONS 2008, pp. 1208-1213.
[8] A. C. S. Beck and L. Carro, “Transparent Acceleration of Data
The advances on scaling of CMOS technology has Dependent Instructions for General Purpose Processors,” In International
increased the integration density and consequently provided the Conference on Very Large Scale Integration, 2007, pp. 66-71.
inclusion of several cores in one single chip. [9] E. Scott, P. Sedcole and P. Y. K. Cheung, “Fault Tolerant Methods for
Reliability in FPGAs,” In International Field Programmable Logic and
The MPSoC solutions allow the acceleration of application Applications, Sept. 2008, pp. 415-420.
execution through task level parallelism exploitation. However, [10] M. M. Pereira and L. Carro, “A Dynamic Reconfiguration Approach for
the main problem of these solutions is the fact that they are Accelerating Highly Defective Processors,” Proc. of 17th IFIP/IEEE
limited to the amount of parallelism available in each International Conference On Very Large Scale Integration, Oct. 2009.
application, as demonstrated by Amdahl’s law. [11] M. R. Guthaus, et al, “MiBench: a free commercially, representative
embedded benchmark suite,” Proc. of 4th IEEE International Workshop
One of the solutions to overcome this limit is using on Workload Characterization, IEEE Press, Dec. 2001, pp. 3-14.
heterogeneous MPSoCs, where each core is specialized in [12] K. Olukotun, L. Hammond and J. Laudon, “Chip Multiprocessor
different applications set or even increasing the speedup of Architecture,” Mark D. Hill, 2006.
each core individually. These approaches can improve [13] K. C. Yeager, “The MIPS R10000 Superscalar Microprocessor,” IEEE
performance but cannot handle high defect rates presented in Micro, April 1996, pp.28-40.
future technologies. [14] M. B Rutzig, et al, “TLP and ILP exploitation throuhg a Reconfigurable
Multiprocessor System,” In 17th Reconfigurable Architectures
This paper presented a run-time reconfigurable architecture Workshop, Atlanta, USA, 2010.
that can sustain performance even under high defect rates to

86 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


An NoC Traffic Compiler for efficient FPGA
implementation of Parallel Graph Applications
Nachiket Kapre André DeHon
California Institute of Technology, University of Pennsylvania
Pasadena, CA 91125 Philadelphia, PA 19104
nachiket@caltech.edu andre@acm.org




 
Abstract—Parallel graph algorithms expressed in a Bulk- 
 | ‡
Synchronous Parallel (BSP) compute model generate highly-  

   
structured communication workloads from messages propagating 
 
 €‚ 
along graph edges. We can expose this structure to traffic compil-  ‡
ers and optimization tools before runtime to reshape and reduce  
traffic for higher performance (or lower area, lower energy,


lower cost). Such offline traffic optimization eliminates the need
 €‚
for complex, runtime NoC hardware and enables lightweight,
scalable FPGA NoCs. In this paper, we perform load balancing, 
placement, fanout routing and fine-grained synchronization to ‡  
  ‡
optimize our workloads for large networks up to 2025 parallel  
elements. This allows us to demonstrate speedups between 1.2×  
 €ƒ‚

and 22× (3.5× mean), area reductions (number of Processing 
Elements) between 3× and 15× (9× mean) and dynamic energy  
 ‡ 


savings between 2× and 3.5× (2.7× mean) over a range of real-  ‡
world graph applications. We expect such traffic optimization !$ \ €‚
tools and techniques to become an essential part of the NoC
application-mapping flow. ^

` €‚
I. I NTRODUCTION !
Real-world communication workloads exhibit structure in Fig. 1: NoC Traffic Compilation Flow
the form of locality, sparsity, fanout distribution, and other (annotated with cnet-default workload at 2025 PEs)
properties. If this structure can be exposed to automation
tools, we can reshape and optimize the workload to im- the same message to multiple destinations. We consider
prove performance, lower area and reduce energy. In this Fanout Routing to avoid redundantly routing data.
paper, we develop a traffic compiler that exploits struc- • Finally, applications that use barrier synchronization can
tural properties of Bulk-Synchronous Parallel communication minimize node idle time induced by global synchroniza-
workloads. This compiler provides insight into performance tion between the parallel regions of the program by using
tuning of communication-intensive parallel applications. The Fine-Grained Synchronization.
performance and energy improvements made possible by the While these optimizations have been discussed indepen-
compiler allows us to build the NoC from simple hardware el- dently in the literature extensively (e.g. [1], [2], [3], [4], [5]),
ements that consume less area and eliminate the need for using we develop a toolflow that auto-tunes the control parameters
complex, area-hungry, adaptive hardware. We now introduce of these optimizations per workload for maximum benefit
key structural properties exploited by our traffic compiler. and provide a quantification of the cumulative benefit of
• When the natural communicating components of the traf- applying these optimizations to various applications in onchip
fic do not match the granularity of the NoC architecture, network settings. This quantification further illustrates how the
applications may end up being poorly load balanced. We performance impact of each optimization changes with NoC
discuss Decomposition and Clustering as techniques to size. The key contributions of this paper include:
improve load balance. • Development of a traffic compiler for applications de-
• Most application exhibit sparsity and locality; an object scribed using the BSP compute model.
often interacts regularly with only a few other objects in • Use of communication workloads extracted from Con-
its neighborhood. We exploit these properties by Placing ceptNet, Sparse Matrix-Multiply and Bellman-Ford run-
communicating objects close to each other. ning on range of real-world circuits and graphs.
• Data updates from an object should often be seen by • Quantification of cumulative benefits of each stage of the
multiple neighbors, meaning the network must route compilation flow (performance, area, energy).

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 87


 
   




 

   


   
  
   
 

    

 

 
    
 

       
 


   
       
    


Fig. 2: Architecture of the NoC

II. BACKGROUND their design. We demonstrate better performance, lower area


requirements and lower energy consumption (Section V).
A. Application
B. Architecture
Parallel graph algorithms are well-suited for concurrent
processing on FPGAs. We describe graph algorithms in a We organize our FPGA NoC as a bidirectional 2D-
Bulk Synchronous Parallel (BSP) compute model [6] and mesh [16] with a packet-switched routing network as shown
develop an FPGA system architecture [7] for accelerating such in Figure 2. The application graph is distributed across the
algorithms. The compute model defines the intended semantics Processing Elements (PEs) which are specialized to process
of the algorithm so we know which optimizations preserve graph nodes. Each PE stores a portion of the graph in its
the desired meaning while reducing NoC traffic. The graph local on-chip memory and performs accumulate and update
algorithms are a sequence of steps where each step is separated computations on each node as defined by the graph algorithm.
by a global BSP barrier. In each step, we perform parallel, The PE is internally pipelined and capable of injecting and
concurrent operations on nodes of a graph data-structure where receiving a new packet in each cycle. The switches imple-
all nodes send messages to their neighbors while also receiving ment a simple Dimension-Ordered Routing algorithm [21]
messages. The graphs in these algorithms are known when the and also support fully-pipelined operation using composable
algorithm starts and do not change during the algorithm. Our Split and Merge units. We discuss additional implementation
communication workload consists of routing a set of messages parameters in Section IV-B. Prior to execution, the traffic
between graph nodes. We route the same set of messages, compiler is responsible for allocating graph nodes to PEs.
corresponding to the graph edges, in each epoch. Applications During execution, the PE iterates through all local nodes
in the BSP compute model generate traffic with many com- and generates outbound traffic that is routed over the packet-
munication characteristics (e.g. locality, sparsity, multicast) switched network. Inbound traffic is stored in the incoming
which also occur in other applications and compute models as message buffers of each PE. The PE can simultaneously handle
well. Our traffic compiler exploits the a priori knowledge of incoming and outgoing messages. Once all messages have
structure-rich communication workloads (see Section IV-A) to been received, a barrier is detected using a global reduce
provide performance benefits. Our approach differs from some tree (a bit-level AND-reduce tree). The graph application
recent NoC studies that use statistical traffic models (e.g. [9], proceeds through multiple global barriers until the algorithm
[10], [11], [12]) and random workloads (e.g. [13], [14], [15]) terminates. We measure network performance as the number
for analysis and experiments. Statistical and random workloads of cycles required for one epoch between barriers, including
may exaggerate traffic requirements and ignore application both computation and all messages routing.
structure leading to overprovisioned NoC resources and missed
opportunities for workload optimization. III. O PTIMIZATIONS
In [9], the authors demonstrate a 60% area reduction along In this section, we describe a set of optimizations performed
with an 18% performance improvement for well-behaved by our traffic compiler.
workloads. In [11], we observe a 20% reduction in buffer sizes 1) Decomposition: Ideally for a given application, as the
and a 20% frequency reduction for an MPEG-2 workload. In PE count increases, each PE holds smaller and smaller portions
[13], the authors deliver a 23.1% reduction in time, a 23% of the workload. For graph-oriented workloads, unusually
reduction in area as well as a 38% reduction in energy for large nodes with a large number of edges (i.e. nodes that

88 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010





    
  
   
  

     
   


Fig. 3: Decomposition Fig. 4: Placement


(Random Placement vs. Good Placement)
send and receive many messages) can prevent the smooth
distribution of the workload across the PEs. As a result, performance, higher energy consumption and inefficient use
performance is limited by the time spent sending and receiving of network resources. We can Place nodes close to each other
messages at the largest node (streamlined message processing to minimize traffic requirements and get better performance
in the PEs implies work ∝ number of messages per node). than random placement. The benefit of performing placement
Decomposition is a strategy where we break down large nodes for NoCs has been discussed in [3]. Good placement reduces
into smaller nodes (either inputs, outputs or both can be both the number of messages that must be routed on the
decomposed) and distribute the work of sending and receiving network and the distance which each message must travel.
messages at the large node over multiple PEs. The idea is This decreases competition for network bandwidth and lowers
similar to that used in synthesis and technology mapping of the average latency required by the messages. Fig. 4 shows a
logic circuits [1]. Fig. 3 illustrates the effect of decomposing a simple example of good Placement. A random partitioning of
node. Node 5 with 3 inputs gets fanin-decomposed into Node the application graph may bisect the graph with a cut size of 6
5a and 5b with 2 inputs each thereby reducing the serialization edges (i.e. 6 messages must cross the chip bisection). Instead,
at the node from 3 cycles to 2. Similarly, Node 1 with 4 outputs a high-quality partitioning of the graph will find a better cut
is fanout-decomposed into Node 1a and 1b with 3 outputs and with size of 4. The load on the network will be reduced
2 outputs each. Greater benefits can be achieved with higher- since 2 fewer messages must cross the bisection. In general,
fanin/fanout nodes (see Table I). Placement is an NP-complete problem, and finding an optimal
In general, when the output from the graph node is a solution is computationally intensive. We use a fast multi-level
result which must be multicast to multiple outputs, we can partitioning heuristic [17] that iteratively clusters nodes and
easily build an output fanout tree to decompose output routing. moves the clustered nodes around partitions to search for a
However, input edges to a graph node can only be decomposed better quality solution.
when the operation combining inputs is associative. Concept- 4) Fanout Routing: Some applications may require multi-
Net and Bellman-Ford (discussed later in Section IV-A) per- cast messages (i.e. single source, multiple destinations). Our
mit input decomposition since nodes perform simple integer application graphs contain nodes that send the exact same
sum and max operations which are associative and can be message to their destinations. Routing redundant messages
decomposed. However, Matrix Multiply nodes perform non- is a waste of network resources. We can use the network
associative floating-point accumulation over incoming values more efficiently with Fanout Routing which avoids routing
which cannot be broken up and distributed redundant messages. This has been studied extensively by
2) Clustering: While Decomposition is necessary to break Duato et al. [4]. If many destination nodes reside in the same
up large nodes, we may still have an imbalanced system if physical PE, it is possible to send only one message instead
we randomly place nodes on PEs. Random placement fails of many, duplicate messages to the PE. For this to work, there
to account for the varying amount of work performed per needs to be at least two sink nodes in any destination PE. The
node. Lightweight Clustering is a common technique used PE will then internally distribute the message to the intended
to quickly distribute nodes over PEs to achieve better load recipients. This is shown in Fig. 5. The fanout edge from Node
balance (e.g. [2]). We use a greedy, linear-time Clustering 3 to Node 5a and Node 4 can be replaced with a shared edge
algorithm similar to the Cluster Growth technique from [2]. as shown. This reduces the number of messages crossing the
We start by creating as many “clusters” as PEs and randomly bisection by 1. This optimization works best at reducing traffic
assign a seed node to each cluster. We then pick nodes from the and message-injection costs at low PE counts. As PE counts
graph and greedily assign them to the PE that least increases increase we have more possible destinations for the outputs
cost. The cost function (“Closeness metric” in [2]) is chosen to and fewer shareable nodes in the PEs resulting in decreasing
capture the amount of work done in each PE including sending benefits.
and receiving messages. 5) Fine-Grained Synchronization: In parallel programs
3) Placement: Object communication typically exhibits lo- with multiple threads, synchronization between the threads
cality. A random placement ignores this locality resulting in is sometimes implemented with a global barrier for sim-
more traffic on the network. Consequently, random placement plicity. However, the global barrier may artificially serialize
imposes a greater traffic requirement which can lead to poor computation. Alternately, the global barrier can be replaced

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 89


  TABLE I: Application Graphs
 
Graph Nodes Edges Max
    Fanin Fanout
ConceptNet
  cnet-small 14556 27275 226 2538
cnet-default 224876 553837 16176 36562
    Matrix-Multiply
add20 2395 17319 124 124
bcsstk11 1473 17857 27 30
Fig. 5: Fanout-Routing fidap035 19716 218308 18 18
fidapm37 9152 765944 255 255

    gemat11 4929 33185 27 28
  

memplus 17758 126150 574 574




rdb3200l 3200 18880 6 6


utm5940 5940 83842 30 20


 Bellman-Ford
ibm01 12752 36455 33 93
 
 

ibm05 29347 97862 9 109




 

  ibm10 69429 222371 137 170
ibm15 161570 529215 267 196
ibm16 183484 588775 163 257
  ibm18 210613 617777 85 209

 
 

spreads over larger portions of the graph through a sequence
 
  of steps by passing messages from activated nodes to their
neighbors. In the case of complex queries or multiple simul-
Fig. 6: Fine-Grained Synchronization taneous queries, the entire graph may become activated after
a small number of steps. We route all the edges in the graph
representing this worst-case step. In [7], we show a per-FPGA
with local synchronization conditions that avoid unnecessary speedup of 20× compared to a sequential implementation.
sequentialization. Techniques for eliminating such barriers 2) Matrix-Multiply: Iterative Sparse Matrix-Vector Multi-
have been previously studied [18], [5]. In the BSP compute ply (SMVM) is the dominant computational kernel in several
model discussed in Section II, execution is organized as a numerical routines (e.g. Conjugate Gradient, GMRES). In each
series of parallel operations separated by barriers. We use iteration a set of dot products between the vector and matrix
one barrier to signify the end of the communicate phase and rows is performed to calculate new values for the vector to be
another to signify the end of the compute phase. If it is known used in the next iteration. We can represent this computation
prior to execution that the entire graph will be processed, the as a graph where nodes represent matrix rows and edges
first barrier can be eliminated by using local synchronization represent the communication of the new vector values. In
operations. A node can be permitted to start the compute phase each iteration messages must be sent along all edges; these
as soon as it receives all its incoming messages without waiting edges are multicast as each vector entry must be sent to each
for the rest of the nodes to have received their messages. This row graph node with a non-zero coefficient associated with
prevents performance from being limited by the sum of worst- the vector position. We use sample matrices from the Matrix
case compute and communicate latencies when they are not Market benchmark [20]. In [8], we show a speedup of 2-
necessarily coupled. We show the potential benefit of Fine- 3× over optimized sequential implementation using an older
Grained Synchronization in Fig. 6. Node 2 and Node 3 can generation FPGA and a performance-limited ring topology.
start their Compute phases after they have received all their 3) Bellman-Ford: The Bellman-Ford algorithm solves the
inputs messages. They do not need to wait for all other nodes single-source shortest-path problem, identifying any negative
to receive all their messages. This optimization enables the edge weight cycles, if they exist. It finds application in
Communicate phase and the Compute phase to be overlapped. CAD optimizations like Retiming, Static Timing Analysis
IV. E XPERIMENTAL S ETUP and FPGA Routing. Nodes represent gates in the circuit
while edges represent wires between the gates. The algorithm
A. Workloads
simply relaxes all edges in each step until quiescence. A
We generate workloads from a range of applications mapped relaxation consists of computing the minimum at each node
to the BSP compute model. We choose applications that cover over all weighted incoming message values. Each node then
different domains including AI, Scientific Computing and communicates the result of the minimum to all its neighbors
CAD optimization that exhibit important structural properties. to prepare for the next relaxation.
1) ConceptNet: ConceptNet [19] is a common-sense rea-
soning knowledge base described as a graph, where nodes B. NoC Timing and Power Model
represent concepts and edges represent semantic relationships. All our experiments use a single-lane, bidirectional-mesh
Queries to this knowledge base start a spreading-activation topology that implements a Dimension-Ordered Routing func-
algorithm from an initial set of nodes. The computation tion. The Matrix-Multiply network is 84-bits wide while Con-

90 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


TABLE II: NoC Timing Model Cycles vs. PEs

Mesh Switch Latency 106


undecomposed
Tthrough (X-X, Y-Y) 2 decomposed
Tturn (X-Y, X-Y) 4
Tintef ace (PE-NoC, NoC-PE) 6
105

Cycles
Twire 2
Processing Element Latency
Tsend 1
Treceive (ConceptNet, Bellman-Ford) 1
104
Treceive (Matrix-Multiply) 9

TABLE III: NoC Dynamic Power Model 4 16 100 400 900 2025
PEs
Datawidth Block Dynamic Power at diff. activity (mW)
(Application) 0% 25% 50% 75% 100% Fig. 7: Decomposition
52 (ConceptNet, Split 0.26 1.07 1.45 1.65 1.84 (cnet-default)
Bellman-Ford) Merge 0.72 1.58 2.1 2.49 2.82
Split 0.32 1.35 1.78 2.02 2.26
84 (Matrix-Multiply)
Merge 0.9 1.87 2.45 2.88 3.25 how the workload gets distributed across PEs using Clustering
or Placement. Finally, we perform Fanout Routing and Fine-
ceptNet and Bellman-Ford networks are 52-bits wide (with 20-
Grained Synchronization optimizations. We illustrate scaling
bits of header in each case). The switch is internally pipelined
trends of individual optimizations using a single illustrative
to accept a new packet on each cycle (see Figure 2). Different
workload for greater clarity. At the end, we show cumulative
routing paths take different latencies inside the switch (see
data for all benchmarks together.
Table II). We pipeline the wires between the switches for high
performance (counted in terms of cycles required as Twire ). A. Impact of Individual Optimizations
The PEs are also pipelined to start processing a new edge
every cycle. ConceptNet and Bellman-Ford compute simple 1) Decomposition: In Fig. 7, we show how the Concept-
sum and max operations while Matrix-Multiply performs Net cnet-default workload scales with increasing PE
floating-point accumulation on the incoming messages. Each counts under Decomposition. We observe that, Decomposition
computation on the edge then takes 1 or 9 cycles of latency allows the application to continue to scale up to 2025 PEs
to complete (see Table II). We estimate dynamic power and possibly beyond. Without Decomposition, performance
consumption in the switches using XPower [22]. Dynamic quickly runs into a serialization bottleneck due to large nodes
power consumption at different switching activity factors is as early as 100 PEs. The decomposed NoC workload manages
shown in Table III. We extract switching activity factor in to outperform the undecomposed case by 6.8× in performance.
each Split and Merge unit from our packet-switched simulator. However, the benefit is lower at low PE counts, since the
When comparing dynamic energy, we multiply dynamic power maximum logical node size becomes small compared to the
with simulated cycles to get energy. We generate bitstreams average work per PE. Additionally, decomposition is only
for the switch and PE on a modern Xilinx Virtex-5 LX110T useful for graphs with high degree (see Table I). In Figure 8
FPGA [22] to derive our timing and power models shown in we show how the decomposition limit control parameter
Table II and Table III. impacts the scaling of the workload. As expected, without
decomposition, performance of the workload saturates beyond
C. Packet-Switched Simulator 32 PEs. Decomposition with a limit of 16 or 32 allows the
We use a Java-based cycle-accurate simulator that im- workload to scale up to 400 PEs and provides a speedup
plements the timing model described in Section IV-B for of 3.2× at these system sizes. However, if we attempt an
our evaluation. The simulator models both computation and aggressive decomposition with a limit of 2 (all decomposed
communication delays, simultaneously routing messages on nodes allowed to have a fanin and fanout of 2) performance
the NoC and performing computation in the PEs. Our results in is actually worse than undecomposed case between 16 and
Section V report performance observed on cycle-accurate sim- 100 PEs and barely better at larger system sizes. At such
ulations of different circuits and graphs. The application graph small decomposition limits, performance gets worse due to an
is first transformed by a, possibly empty, set of optimizations excessive increase in the workload size (i.e. number of edges
from Section III before being presented to the simulator. in the graph). Our traffic compiler sweeps the design space
and automatically selects the best decomposition limit.
V. E VALUATION 2) Clustering: In Fig. 9, we show the effect of Clustering
We now examine the impact of the different optimizations on performance with increasing PE counts. Clustering pro-
on various workloads to quantify the cumulative benefit of vides an improvement over Decomposition since it accounts
our traffic compiler. We order the optimization appropriately for compute and message injection costs accurately, but that
to analyze their additive impacts. First we load balance our improvement is small (1%–18%). Remember from Section III,
workloads by performing Decomposition. We then determine that Clustering is a lightweight, inexpensive optimization that

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 91


Cycles vs. PEs Cycles vs. PEs
106
undecomposed undecomposed
limit=2 decomposed
limit=4 decomposed+clustered
4 limit=8 5 decomposed+placed
10 10
Cycles

Cycles
limit=16
limit=32
limit=256
104

103
4 16 100 400 900 2025 4 16 100 400 900 2025
PEs PEs

Fig. 8: Decomposition Limits Fig. 10: Decomposition, Clustering and Placement


(cnet-small) (cnet-default)

Cycles vs. PEs


Cycles vs. PEs
106 5
undecomposed 10
decomposed undecomposed
decomposed+clustered clustered
placed
105 placed+fanout
Cycles

4
10
Cycles

3
104 10

4 16 100 400 900 2025


4 16 100 400 900 2025
PEs
PEs
Fig. 9: Decomposition and Clustering
Fig. 11: Clustering, Placement and Fanout-Routing
(cnet-default)
(ibm01)

attempts to improve load balance and as a result, we expect


limited benefits. of the global barrier enables greater freedom in scheduling
3) Placement: In Fig. 10, we observe that Placement PE operations and consequently we observe a non-negligible
provides as much as 2.5× performance improvement over improvement (1.2×) in performance. Workloads with a good
a random placed workload as PE counts increase. At high balance between communication time and compute time will
PE counts, localized traffic reduces bisection bottlenecks and achieve a significant improvement from fine-grained synchro-
communication latencies. However, Placement is less effective nization due to greater opportunity for overlapped execution.
at low PE counts since the NoC is primarily busy injecting and
receiving traffic and NoC latencies are small and insignificant.
Moreover, good load-balancing is crucial for harnessing the Cycles vs. PEs
benefits of a high-quality placement (See Figure 15 with other
105
benchmarks). undecomposed
clustered
4) Fanout-Routing: We show performance scaling with placed
increasing PEs for the Bellman-Ford ibm01 workload using 104 placed+fanout
Cycles

placed+fanout+fgsync.
Fanout Routing in Fig. 11. The greatest performance benefit
(1.5×) from Fanout Routing comes when redundant messages 3
10
distributed over few PEs can be eliminated effectively. The
absence of benefit at larger PE counts is due to negligible
shareable edges as we suggested in Section III. 2
10
4 16 100 400 900 2025
5) Fine-Grained Synchronization: In Fig. 12, we find that
PEs
the benefit of Fine-Grained Synchronization is greatest (1.6×)
at large PE counts when latency dominates performance. At Fig. 12: Clustering, Placement, Fanout-Routing and
low PE counts, although NoC latency is small, elimination Fine-Grained Synchronization (ibm01)

92 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


3.5x Fine−Grained Sync.
Fanout AreaRatio =
Placement
3x PEunopt/PEopt

Cycles
Clustering
Decomposition
10%
Speedup

2.5x
unoptimized
2x
optimized

1.5x
PEopt PEunopt
PEs
1x
small
default

add20
bcsstk11
fidap035
fidapm37
gemat11
memplus
rdb3,200l
utm5,940

ibm01
ibm05
ibm10
ibm15
ibm16
ibm18
Fig. 16: How we compute area savings

ConceptNet Matrix−Multiply Bellman−Ford by computation. Fine-Grained Synchronization also provides


some improvement, but as we will see, its relative contribution
Fig. 13: Performance Ratio at 25 PEs
increases with PE count.
At 256 PEs, Fig. 14, we observe larger speedups in the
9x Fine−Grained Sync.
Fanout
range 1.2× to 8.3× (3.5× mean) due to Placement. At
8x Placement these PE sizes, the performance bottleneck begins to shift
7x Clustering
Decomposition to the network, so reducing traffic on the network has a
6x larger impact on overall performance. We continue to see
Speedup

5x performance improvements from Fanout Routing and Fine-


4x Grained Synchronization.
3x At 2025 PEs, Fig. 15, we observe an increase in speedups
2x in the range 1.2× to 22× (3.5× mean). While there is an im-
1x provement in performance from Fine-Grained Synchronization
small
default

add20
bcsstk11
fidap035
fidapm37
gemat11
memplus
rdb3,200l
utm5,940

ibm01
ibm05
ibm10
ibm15
ibm16
ibm18

compared to smaller PE cases, the modest quantum of increase


suggests that the contributions from other optimizations are
saturating or reducing.
ConceptNet Matrix−Multiply Bellman−Ford
Overall, we find ConceptNet workloads show impressive
Fig. 14: Performance Ratio at 256 PEs speedups up to 22×. These workloads have decomposable
nodes that allow better load-balancing and have high-locality.
22x They are also the only workloads which have the most
14x Fine−Grained Sync. need for Decomposition. Bellman-Ford workloads also show
Fanout
12x Placement good overall speedups as high as 8×. These workloads are
Clustering
10x Decomposition circuit graphs and naturally have high-locality and fanout.
Matrix-Multiply workloads are mostly unaffected by these
Speedup

8x
optimization and yield speedups not exceeding 4× at any
6x PE count. This is because the compute phase dominates
4x the communicate phase; compute requires high latency (9
cycles/edge from Table II) floating-point operations for each
2x
edge. It is also not possible to decompose inputs due to the
small
default

dd20
bcsstk11
fidap035
fidapm37
gemat11
memplus
rdb3,200l
utm5,940

ibm01
ibm05
ibm10
ibm15
ibm16
ibm18

non-associativity of the floating-point accumulation. As an


experiment, we decomposed both inputs and outputs of the
fidapm37 workload at 2025 PEs and observed an almost
ConceptNet Matrix−Multiply Bellman−Ford
2× improvement in performance.
Fig. 15: Performance Ratio at 2025 PEs
C. Cumulative Area and Energy Impact
For some low-cost applications (e.g. embedded) it is impor-
B. Cumulative Performance Impact
tant to minimize NoC implementation area and energy. The
We look at cumulative speedup contributions and relative optimizations we discuss are equally relevant when cost is the
scaling trends of all optimizations for all workloads at 25 PEs, dominant design criteria.
256 PEs and 2025 PEs. To compute the area savings, we pick the smallest un-
At 25 PEs, Fig. 13, we observe modest speedups in the optimized PE count that requires 1.1× the cycles of best
range 1.5× to 3.4× (2× mean) which are primarily due unoptimized case (the 10% slack accounts for diminishing
to Fanout Routing. Placement and Clustering are unable returns at larger PE counts (see Figure 16). For the fully
to contribute significantly since performance is dominated optimized workload, we identify the PE count that yields

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 93


16x
size (PE count) and our automated approach can easily adapt
14x to different system sizes. We find that most workloads benefit
12x from Placement and Fine-Grained Synchronization at large PE
Area Savings

10x counts and from Clustering and Fanout Routing at small PE


counts. The optimizations we describe in this paper have been
8x
used for the SPICE simulator compute graphs which are dif-
6x ferent from the BSP compute model. Similarly we can extend
4x this compiler to support an even larger space of automated
2x traffic optimization algorithms for different compute models.
small
default

add20
bcsstk11
fidap035
fidapm37
gemat11
memplus
rdb3,200l
utm5,940

ibm01
ibm05
ibm10
ibm15
ibm16
ibm18
R EFERENCES
[1] R. K. Brayton and C. McMullen, “The decomposition and factorization
of boolean expressions,” in Proc. Intl. Symp. on Circuits and Systems,
ConceptNet Matrix−Multiply Bellman−Ford 1982.
[2] D. Gajski, N. Dutt, A. Wu, and S. Lin, High-Level Synthesis: Introduc-
Fig. 17: Area Ratio to Baseline tion to Chip and System Design. Kluwer Academic Publishers, 1992.
[3] D. GreenField, A. Banerjee, J. G. Lee, and S. Moore, “Implications
4x of Rent’s rule for NoC design and its fault tolerance,” in NOCS First
International Symposium on Networks-on-Chip, 2007.
3.5x [4] J. Duato, S. Yalamanchili, and L. Ni, Interconnection Networks: An
Enginering Approach. Elsevier, 2003.
Energy Savings

3x [5] D. Yeung and A. Agarwal, “Experience with fine-grain synchronization


in MIMD machines for preconditioned conjugate gradient,” SIGPLAN
2.5x Notices, vol. 28, no. 7, pp. 187–192, 1993.
[6] L. G. Valiant, “A bridging model for parallel computation,” CACM,
2x vol. 33, no. 8, pp. 103–111, August 1990.
[7] M. deLorimier, N. Kapre, A. DeHon, et al “GraphStep: a system
1.5x architecture for Sparse-Graph algorithms,” in Proceedings of the 14th
Annual IEEE Symposium on Field-Programmable Custom Computing
1x Machines, 2006, pp. 143–151.
small
default

add20
bcsstk11
fidap035
fidapm37
gemat11
memplus
rdb3,200l
utm5,940

ibm01
ibm05
ibm10
ibm15
ibm16
ibm18

[8] M. deLorimier, and A. DeHon “Floating-point sparse matrix-vector


multiply for FPGAs” in Proceedings of the International Symposium
on Field-Programmable Gate Arrays. 2005.
[9] W. Ho and T. Pinkston, “A methodology for designing efficient on-chip
ConceptNet Matrix−Multiply Bellman−Ford interconnects on well-behaved communication patterns,” in Proc. Intl.
Symp. on High-Perf. Comp. Arch., 2006.
Fig. 18: Dynamic Energy Savings at 25 PEs [10] V. Soteriou, H. Wang, , and L.-S. Peh, “A statistical traffic model for
on-chip interconnection networks,” in Proc. Intl. Symp. on Modeling,
Analysis, and Sim. of Comp. and Telecom. Sys., 2006.
performance equivalent to the best unoptimized case. We [11] Y. Liu, S. Chakraborty, and W. T. Ooi, “Approximate VCCs: a new char-
acterization of multimedia workloads for system-level MPSoC design,”
report these area savings in Figure 17. The ratio of these two DAC, pp. 248–253, June 2005.
PE counts is 3–15 (mean of 9), suggesting these optimizations [12] G. Varatkar and R. Marculescu, “On-chip traffic modeling and synthesis
allow much smaller designs. for MPEG-2 video applications,” IEEE Trans. VLSI Syst., vol. 12, no. 1,
pp. 108–119, January 2004.
To compute energy savings, we use the switching activity [13] J. Balfour and W. J. Dally, “Design tradeoffs for tiled CMP on-chip
factor and network cycles to derive dynamic energy reduction networks,” in Proc. Intl. Conf. Supercomput., 2006.
in the network. Switching activity factor is extracted from the [14] R. Mullins, A. West, and S. Moore, “Low-latency virtual-channel routers
for on-chip networks,” in ISCA, 2004.
number of packets traversing the Split and Merge units of a [15] P. P. Pande, C. Grecu, M. Jones, A. Ivanov, and R. Saleh, “Performance
Mesh Switch over the duration of the simulation Activity = evaluation and design trade-offs for network-on-chip interconnect archi-
(2/P orts) × (P ackets/Cycles). In Figure 18 we see a mean tectures,” IEEE Trans. Comput., vol. 54, no. 8, pp. 1025–1040, August
2005.
2.7× reduction in dynamic energy at 25 PEs due to reduced [16] N. Kapre, N. Mehta, R. Rubin, H. Barnor, M. J. Wilson, M. Wrighton,
switching activity of the optimized workload. While we only and A. DeHon, “Packet switched vs. time multiplexed FPGA overlay
show dynamic energy savings at 25 PEs, we observed even networks,” in Proceedings of the 14th Annual IEEE Symposium on Field-
Programmable Custom Computing Machines, 2006, pp. 205–216.
higher savings at larger system sizes. [17] A. Caldwell, A. Kahng, and I. Markov, “Improved Algorithms for
Hypergraph Bipartitioning,” in Proceedings of the Asia and South Pacific
VI. C ONCLUSIONS AND F UTURE W ORK Design Automation Conference, January 2000, pp. 661–666.
[18] C.-W. Tseng, “Compiler optimizations for eliminating barrier synchro-
We demonstrate the effectiveness of our traffic compiler nization,” SIGPLAN Not., vol. 30, no. 8, pp. 144–155, 1995.
over a range of real-world workloads with performance im- [19] H. Liu and P. Singh, “ConceptNet – A Practical Commonsense Rea-
soning Tool-Kit,” BT Technical Journal, vol. 22, no. 4, p. 211, October
provements between 1.2× and 22× (3.5× mean), PE count 2004.
reductions between 3× and 15× (9× mean) and dynamic [20] NIST, “Matrix market,” <http://math.nist.gov/MatrixMarket/>, June
energy savings between 2× and 3.5× (2.7× mean). For large 2004, maintained by: National Institute of Standards and Technology.
[21] L. M. Ni and P. K. McKinley, “A survey of wormhole routing techniques
workloads like cnet-default, our compiler optimizations in direct networks,” IEEE Computer, 1993.
were able to extend scalability to 2025 PEs. We observe that [22] The Programmable Logic Data Book-CD, Xilinx, Inc., 2100 Logic
the relative impact of our optimizations changes with system Drive, San Jose, CA 95124, 2005.

94 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Investigation of Digital Sensors for Variability
Characterization on FPGAs
Florent Bruguier, Pascal Benoit and Lionel Torres
LIRMM, CNRS - University of Montpellier 2
161 rue Ada, 34392 Montpellier, France
Email: {firstname.lastname@lirmm.fr}

Abstract—In this paper, we address the variability problems overhead in order to determine how they could be efficiently
in FPGA devices both at design-time and at run-time. We use to characterize the variability of any FPGA. In the scope
consider a twofold approach to compensate variations, based of this paper, experiments was conducted on Xilinx Spartan-3
on measurements issued from digital sensors implemented with
FPGA building blocks. We compare two digital structures of FPGAs.
sensors; the Ring Oscillator and the Path Delay Sensor. The Ring The remainder of the paper is organized as follows. The
Oscillator allows a fine grain variability characterization thanks next section presents the related works in the area of variability
to a tiny replicable hard-macro structure (2 Slices). Although analysis and compensation techniques on FPGAs. Section III
less accurate, we show that the Path Delay Sensor has a lower
introduces a new multi-level compensation flow. In section IV,
total area overhead as well as a smaller latency compared to the
ring oscillator implementation. In our experiments conducted on two digital sensors are presented to measure performance vari-
Spartan-3 FPGAs, the ring oscillator is used to perform intra- ations. Finally, in section V, results are exposed and discussed:
chip cartographies (1980 positions), while both sensors are then overhead comparison, experimental setup, intra and inter-chip
compared for characterizing inter-chip performance variations. comparisons are analyzed to determine the efficiency and the
We conclude the two structures are efficient for fast variability
accuracy of the provided digital sensors.
characterization in FPGA devices. The Ring Oscillator is the
best structure for design-time measurements, whereas Path Delay
Sensor is the preferred structure to allow rapid performances II. R ELATED W ORKS
estimations at run-time with a minimal area overhead.
Few recent papers suggest techniques to characterize and
I. I NTRODUCTION compensate performance variability on FPGAs. A first ap-
Variability has become a major issue with recent technolo- proach is based on modelization. Three papers have suggested
gies in the semiconductor industry [1]. While process varia- theoretical techniques [5] [6] [8]. They are all verified through
tions impact process, supply voltage and internal temperature timing modeling. First, a multi-cycle Statistical Static Timing
[2], chip performances are also dependent on environmental Analysis (SSTA) placement algorithm is exposed in [5]. In
and on applicative changes that may further influence chip’s simulation, it is possible to improve performance yield by
behavior [3]. 68.51% compare to a standard SSTA. A second approach
Field-Programmable Gate Arrays (FPGAs) devices are not proposes a variability aware design technique to reduce the
spared by these unpredictable disparities. As underlined in [4], impact of process variations on the timing yield [8]. Timing
both inter-die and within-die variations affect FPGAs. So, it is variability is reduced thanks to the increase of shorter routing
necessary to implement solutions in order to compensate these segments. Another side, the author of [6] confronts different
variations. strategies for compensate within-die stochastic delay variabil-
Because of their inherent reconfigurability, it is possible to ity. Worst case design, SSTA, entire FPGA reconfiguration
place a component of a design at a specific place on the and sub-circuits relocation within a FPGA are considered.
FPGA floorplan, and to relocate it when it is required. In SSTA provides better results than Worst case design although
the literature, it has been suggested either to model FPGA both reconfiguration methods allow significant improvements.
variability [5] [6], or to measure it [7]. Both methods suggest However, in these papers, only theoretical techniques and
then to constraint or to adapt the design so that it takes into simulated results are exposed.
account these variations and improves performance yield. A second approach is based on delay measurements [7].
In this paper, we investigate the problem of variability In this paper, a ring oscillator is placed on the FPGA. The
characterization on FPGA. We provide a twofold method frequency of each oscillator is measured. A cartography of
for variability compensation based on digital sensors used the chip is done. Nevertheless, the study proposes here to
either at design-time or at run-time. Our study is focused on characterize 8 LUTs together as well as the impact of external
digital sensors, directly implemented in the FPGA building variation is not introduced.
blocks. Their role is to measure the effective performance These two sorts of approaches suggest it is required to have
of the device. The novelty of this paper is that we compare two stages of characterization with digital sensors: one off-line
two digital sensor structures, analyze their accuracy and area and one on-line.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 95


III. M ULTI - LEVEL COMPENSATION FLOW
In order to tackle FPGA variability issues, a multi-level
compensation flow is proposed. Basically, it uses FPGA digital
resources (LUTs and interconnects) to implement Hard Macro
sensors. An overview of our methodology is depicted in Figure
1. It is divided into two parts. In the first phase, the FPGA
performance is deeply analyzed off-line at a fine granularity
by our dedicated monitors. At run-time, a reduced monitoring
setup is implemented within the system itself, in order to check
the run-time performances. Each level is further explained in
the two following sections.

Fig. 2. Off-line and On-line Monitoring

peripherals, a Management Unit (MU), a set of sensors and


Fig. 1. Overall flow actuators. The on-line monitoring process is illustrated in
Figure 2(c).
A. Off-line monitoring and module placement strategy Our objective is to perform a dynamic compensation of
system variations. For this purpose, a subset of digital sensors
The first step of the flow is depicted in Figure 2 and
using the FPGA resources is implemented. Digital sensors
is divided into two global parts: the FPGA characterization
measure performances; data monitoring are then collected and
and the module placement strategy. The monitoring system is
analyzed by a management unit. Based on the information
directly applied at the technological level, i.e. it is intended to
available, this unit can adapt the system to the actual perfor-
check at a fine granularity intrinsic performances of the FPGA
mances; for instance, it is possible to adjust the frequency with
device.
a DFS actuator (Dynamic Frequency Scaler).
In order to realize an accurate characterization of perfor-
mances, an array of sensors covering the whole area is used As depicted in the figure 1, a deeper FPGA performance
(Fig.2(a)). Sensor data are collected and analyzed to build a analysis can be triggered when the management unit identifies
cartography of the floorplan (Fig.2(b)). suspicious system behavior. A partial or total analysis can be
Once the cartography of the FPGA is built, a placement then performed in order to build a new cartography and to
strategy is performed, considering both the system and its update the module placement strategy.
run-time monitoring service (Fig.2(c)). Basically, it consists
in placing critical modules on “ best performance ” areas.
IV. D IGITAL HARDWARE SENSORS IN FPGA
A subset of on-line sensors is also implemented within the
system in order to check at run-time the evolution of the per-
formances. The sensor placement strategy takes into account The objective of the previously described compensation flow
requirements of run-time modules as well as the result of the is to adapt the system in relation to the effective performances
off-line cartography. of the FPGA device. Next, we study how to measure locally
and globally this performance. The idea developed in this
B. On-line monitoring and dynamic compensation paper is to use digital hardware sensors designed with internal
The second stage of the compensation flow is based on resources of the FPGA, namely CLBs and switch matrices. In
hardware run-time monitoring. The run-time system imple- this section, we present the principle and the implementation
mented in the FPGA is composed of a microprocessor, some of two structures: a Ring Oscillator and Path Delay Sensor.

96 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


A. Ring Oscillator Sensor
The Ring Oscillator Sensor is based on the measurement of
the oscillator frequency. In [9], the author exposes an internal
temperature sensor for FPGA. The frequency of a ring oscilla-
tor is measured and converted into temperature. However, the
ring oscillator is implemented into an old technology where
process variability is very low.
An update of this sensor is exposed here. Its structure
is depicted in Figure 3. The main part of the sensor is
a 2p + 1 inverter chain. The oscillation frequency directly
depends on the FPGA performance capabilities. The first logic (a) Ring Oscillator (b) Path Delay
stage enables the oscillator to run for a fix number of periods Fig. 4. Hard Macro implementation for Spartan 3
of the main clock. The flip-flop at the end of the inverter
chain is used as a frequency divider and allows filtering
glitches from the oscillator. The final logic stage counts the at the output of each LUT. A clock signal is applied to the
number of transitions in the oscillator and transmits the count chain and is propagated into the LUT. At each rising edge, a
result. Then, the count result is used to calculate the oscillator n-bits thermometer code is available at the output of FFs. This
frequency as follows: thermometer code is representative of LUTs and interconnects
performances.
count ∗ f
F = (1)
p Hard Macro Out 0 Out 1 Out 30 Out 31

where F is the ring oscillator frequency, count is the 14-bit D


SET
Q D
SET
Q D
SET
Q D
SET
Q

value of the counter, f is the operating frequency of the clock CLR Q CLR Q CLR Q CLR Q

and p is the number of enabled clock periods for which the Clock

sensor is active. Enable

Hard Macro

14
Counter
Fig. 5. Path delay
Flip-Flop
Enable Enable
For example, the sensor is running and the
Driving
Clock Reset thermometer code is stored. This code looks like:
”11111111111111000000000000001111”. It is then analyzed.
The position N z of last 0 is identified. Two different
Fig. 3. Ring Oscillator
utilizations are then feasible :
• A fast one where direct compare N z to the length of a

In order to use this sensor for the FPGA characterization, critical path.
• A slow one where the time T required to cross one LUT
a three-inverter ring oscillator was implemented. With this
configuration, the core of the sensor (ring oscillator + first flip- and the associate interconnect is approximate:
flop) takes only 4 LUTs. A Hardware Macro was designed so Nz + 2
T = (2)
that the same sensor structure can be mapped at each floorplan f
location (Fig. 4(a)). It possibly allows characterizing separately where f is the frequency of the clock signal applied to the
each CLB of an FPGA. sensor and N z + 2 represents the number of crossed LUTs
B. Path Delay Sensor over one for the crossing of the sample rate FF. The time T
measured here enables to fast estimate the maximum frequency
In ASIC, in order to estimate the speed of a process, of one critical path.
sensors for Critical Path Monitoring (CPM) are used. A. Drake In order to have relevant information, the size of this sensor
presents a survey of CPM [10]. It exists a lot of techniques must take into account the FPGA family in which it operates.
to manage Critical Path but very few are used in FPGAs. The For example, for a Spartan-3 device, this sensor is composed
Path Delay Sensor proposed here is directly inspired by CPM. of 32 stages. It allows propagating a complete period of the
The structure of the Path Delay Sensor is depicted in Figure 5. Spartan-3 reference frequency clock (50M Hz). The figure
The idea of the Path Delay Sensor is to adapt CPM to FPGA. 4(b) shows the Hard Macro integration of this sensor.
Indeed, the regularity of the FPGA structure enables to create
more easily a critical path replica in FPGA than in ASIC. V. E XPERIMENTAL RESULTS
The Path Delay Sensor is composed of n LUTs and n flip- The Ring Oscillator and Path Delay Sensor described in
flops (FF). The LUTs are chained together and a FF is set the previous section are studied and compared. They were

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 97


both implemented into a Xilinx Spartan 3 Starter Kit Board Oscillator Sensor requires 2046 cycles, the Path Delay Sensor
with a XC3S1000-4FT256 (Fig.6(b)). This FPGA has a nom- allows updating more rapidly the performance measurements.
inal operating point of 1.2V @ 25◦ C. In order to ensure However, this potential advantage is to be considered with the
reproducible results, the temperature is kept constant in a processing capabilities of the management unit.
thermal chamber during all measurements (this instrument It is possible to take benefit from each sensor in different
only allows heating, and then experiments are done at 40◦ C; contexts. The Ring Oscillator Sensor will be preferably used
this point will be discussed further) (Fig. 6(a)). We analyze for off-line monitoring characterization and for an accurate
first the resource overhead required by both structures. Then, management of performance. The Path Delay Sensor will
the measurement errors are exposed and after that their impact be preferred for a dynamic and direct critical path delay
is discussed. Finally, we provide both intra and inter variability management.
characterizations of Spartan-3 FPGA devices. In the next sections, the results of our experiments are
presented. Our objective is to analyze the effectiveness of each
sensor in its chosen field.
B. Impact of sources of error
This part proposes a study of the impact of each error
source. We will reach successively voltage error, temperature
error and toggle count error.
Xilinx Spartan-3 Starter Kit does not provide the feature to
manage directly the power supply voltage. Hence, the voltage
regulator of our boards was substituted by an external voltage
regulator. The figure 7 depicts the normalized output frequency
of the Hard Macro depending on the power supply voltage. A
(a) Experimental setup in the thermal (b) Spartan 3 board voltage variation of 0.4V around the default value implies a
chamber 22.5% of the sensor frequency. In our experiments, the supply
Fig. 6. Experimental setup voltage variation was only about 0.01V between boards. It
involves a variation about 0.65M Hz on the frequency (Tab.
II). The supply voltage variation due to the fluctuation effects
A. Overhead comparison of the board regulator is less than 0.001V . This has an impact
This section introduces an overhead comparison of the two of less than 65kHz on the frequency.
sensors (Table. I). The area impact and the computing time
for each sensor are presented.
TABLE I
OVERHEAD C OMPARISON

Hard Macro Size Total Size Sensor Latency


(# Slices) (# Slices) (# Cycles)
Ring Oscillator 2 80 2046
Path Delay 16 16 2

The Hard Macro corresponds to the “ probe ” of the sensor.


A smaller size for the probe allows characterizing a smaller
area during the off-line monitoring phase. That’s why the
Hard Macro Size is directly connected to the minimal size
probing. Regarding on-line Monitoring, a small Hard Macro
size is preferable to put it close to the critical modules (e.g.
critical path). In this case, the Ring Oscillator Sensor is more Fig. 7. Normalized Output frequency versus Power Supply Voltage
interesting.
The total size of the sensor acts out the space needed for A representation of the normalized frequency depending
one full implementation of the sensor. This is to compare on the temperature variation is illustrated in figure 8. The
to the total space available in the FPGA. Indeed, the space frequency of the ring oscillator Hard Macro decreases lin-
used for sensor implementation is no longer available for the early as the temperature increases. During our experiments,
monitoring design. The Path Delay Sensor is better in this the FPGA device is placed into a thermal chamber with a
case. constant temperature (the variation is more or less 0.5◦ C).
The computing latency represents the number of clock This change causes a fluctuation of about 0.1M Hz of the
periods between two successive measurements. Since the Ring oscillator frequency.

98 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


N ormalizedF requency = −0.0010T + 1.0264

Fig. 8. Normalized Output frequency versus external temperature

Fig. 9. Oscillator frequency cartography at T = 40◦ C for Board 3


Regarding the ring oscillator, the fluctuation of the toggle
count is also to take into account. For a best efficiency, this
sensor was dimensioned for an error of a maximum of 3 in
the toggle count. It results an error on the measurement about
73kHz.
TABLE II
E STIMATE OF SOURCES OF ERROR

Error Frequency impact

Intra-board Voltage 0.001V 0.065M Hz


Inter-board Voltage 0.01V 0.65M Hz
Temperature 0.5◦ C 0.1M Hz
Toggle count 3 0.073M Hz

The total error due to external sources for intra-chip mea-


surement is around 238kHz. Nevertheless, all measurements
are average 500 times thereby error due to variations is Fig. 10. Oscillator frequency cartography at T = 40◦ C for Board 4
decreased. In fact, the total variation is around 6 times the
standard deviation σ. The average standard deviation σ̄ is then:
σ It takes around 5 hours for a complete cartography with a 500
σ̄ = √ (3)
N times averaging for each point.
where N is the number of samples. Consequently, the total The figures 9 and 10 depict the results of two cartographies.
error attributable to external sources of variation is around The cartography of the Board 4 is relatively constant with a
0.005% which corresponds to 10kHz. This variation will be maximum frequency variation about 458kHz while the second
compared to the results obtained in next sections. cartography presents a variation relatively large with a max-
imum frequency variation about 832kHz. The cartography
C. Intra-Chip Characterization using Ring Oscillator Sensor of the board 3 shows two features. First, there is a gradient
The Ring Oscillator Sensor was used to achieve full car- of frequency all over the chip. Second, some slices on the
tographies of the Spartan-3 S1000 chip. During the measure- middle of the chip are much less efficient than others. This
ments, the external temperature of the chip is kept constant type of results reinforce us in the necessity of a fine-grain
at a temperature of 40◦ C. This sensor is alternately placed at cartography in order to achieve placement strategy for optimal
each CLB location, arranged into an array of 40 * 48 positions. performances on FPGAs.
Each measurement provided by the sensor is sent to an external Note that variations measured in a same board (> 100kHz)
computer, via an UART (Universal Asynchronous Receiver are minor compared to the error calculate previously (≈
Transmitter), which performs the cartography of the chip (Fig. 10kHz). Since we can assume that the supply voltage and the
9). The resulting frequency presented here corresponds to the external temperature are fixed, we can infer that the variations
mean oscillation frequency at the output of the Hard Macro. measured on the frequency are effectively due to internal

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 99


performance variations. VI. C ONCLUSION AND F UTURE W ORKS
In future work, cartographies will be conducted with other Managing variability is one of the major issues in recent
temperature settings in order to confirm the trend on a same silicon technologies. As previously mentioned in the litera-
chip. ture, FPGA devices are also subject to process, voltage, and
temperature variations. In this paper, we have considered a
D. Inter-Chip Characterization
twofold compensation flow for FPGAs, based on the use
A similar characterization was conducted on several boards. of digital sensors directly implemented in the reconfigurable
Sample results are summarized in Table III. The figures 9 and resources. For this purpose, this article brought new results
10 illustrate the difference between two boards. We can see on the comparison of a Ring Oscillator structure and a Path
that there are significant discrepancies between boards. Indeed, Delay Sensor. In our experiments on Spartan-3 FPGAs, the
we observe here a difference of about 16.2% of the average ring oscillator was successfully used to perform intra-chip
frequency. cartographies (1980 positions). Both sensors were evaluated
In this section, the Path Delay Sensor is used to perform a for characterizing inter-chip performance variations.
comparison between multiple boards. In Table IV, we have We conclude that both structures are efficient for fast vari-
compared the five boards from the experience. This table ability characterization in FPGA devices. The ring oscillator
introduces a time, which corresponds to the delay required by is the best structure for design-time measurements, whereas
a signal edge to cross one LUT and the associate interconnect the Path Delay Sensor will be the preferred structure to allow
of the sensor. The Path Delay Sensor corroborates the results rapid performances estimations at run-time with a minimal
obtained with the Ring Oscillator Sensor (Fig. 11). However, area overhead.
with the Path Delay Sensor, we cannot distinguish the best In future work, both sensors will be studied to perform run-
FPGA between boards 3 and 4. This sensor is less accurate, time performance measurements. We will particularly focus on
but nevertheless allows fast estimations. For this reason, it will strategies to efficiently manage sensors (number, placement)
be used for on-line monitoring services. and collect monitored information in order to adapt the system.
TABLE III R EFERENCES
C OMPARISON OF RING OSCILLATOR FREQUENCY BETWEEN MULTIPLE [1] “International Technology Roadmap for Semiconductors,” 2009.
BOARDS [Online]. Available: http://www.itrs.net/Links/2009ITRS/Home2009.htm
[2] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Ke-
Board 1 2 3 4 5 shavarzi, and V. De, “Parameter variations and impact on
Frequency (MHz) 215.0 204.0 198.9 196.6 185.0 circuits and microarchitecture,” Design Automation Confer-
ence, 2003. Proceedings, pp. 338–342. [Online]. Available:
http://ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber=1219020
[3] O. Unsal, J. Tschanz, K. Bowman, V. De, X. Vera, A. Gonzalez, and
TABLE IV O. Ergin, “Impact of Parameter Variations on Circuits and Microarchitec-
C OMPARISON OF PATH DELAY SPEED BETWEEN MULTIPLE BOARDS ture,” IEEE Micro, vol. 26, no. 6, pp. 30–39, 2006. [Online]. Available:
http://ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber=4042630
[4] P. Sedcole and P. Y. K. Cheung, “Parametric yield in FPGAs due
Board 1 2 3 4 5
to within-die delay variations:a quantitative analysis,” International
T (ns) 2.86 3.07 3.2 3.2 3.33 Symposium on Field Programmable Gate Arrays, 2007. [Online].
Available: http://portal.acm.org/citation.cfm?id=1216949
[5] G. Lucas, C. Dong, and D. Chen, “Variation-aware placement for FPGAs
with multi-cycle statistical timing analysis.” in FPGA, P. Y. K. Cheung
and J. Wawrzynek, Eds. ACM, 2010, pp. 177–180. [Online]. Available:
http://dblp.uni-trier.de/db/conf/fpga/fpga2010.html#LucasDC10
[6] P. Sedcole and P. Y. K. Cheung, “Parametric Yield Modeling
and Simulations of FPGA Circuits Considering Within-Die Delay
Variations,” ACM Transactions on Reconfigurable Technology and
Systems (TRETS), vol. 1, no. 2, 2008. [Online]. Available:
http://portal.acm.org/citation.cfm?id=1371582
[7] P. Sedcole and P. K. Y. Cheung, “Within-die delay variability in 90nm
FPGAs and beyond,” in IEEE International Conference on Field Pro-
grammable Technology. IEEE, 2006, pp. 97–104. [Online]. Available:
http://ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber=4042421
[8] A. Kumar and M. Anis, “FPGA Design for Timing Yield Under
Process Variations,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 18, no. 3, pp. 423–435, 2010. [Online]. Available:
http://ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber=4799224
[9] S. Lopez-Buedo, J. Garrido, and E. Boemo, “Dynamically
inserting, operating, and eliminating thermal sensors of FPGA-
based systems,” IEEE Transactions on Components and Packaging
Technologies, vol. 25, no. 4, pp. 561–566, 2002. [Online]. Available:
http://ieeexplore.ieee.org/xpl/freeabs all.jsp?arnumber=1178745
[10] A. Drake, Adaptive Techniques for Dynamic Processor
Optimization, ser. Series on Integrated Circuits and Sys-
Fig. 11. Comparaison between multiple boards tems. Boston, MA: Springer US, 2008. [Online]. Available:
http://www.springerlink.com/content/r61t506740v74220

100 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Investigating Self-Timed Circuits for the
Time-Triggered Protocol
Markus Ferringer
Department of Computer Engineering
Embedded Computing Systems Group
Vienna University of Technology
1040 Vienna, Treitlstr. 3
Email: ferringer@ecs.tuwien.ac.at

Abstract—While asynchronous logic has many potential ad- but hide their imperfections with respect to timing behind a
vantages compared to traditional synchronous designs, one of the strictly time driven control flow that is based on worst-case
major drawbacks is its unpredictability with respect to temporal timing analysis. This masking provides a convenient, stable
behavior. Without having a high-precision oscillator, a self-timed
circuit’s execution speed is heavily dependent on temperature and abstraction for higher layers. In contrast, asynchronous designs
supply voltage. Small fluctuations of these parameters already simply allow the variations to happen and propagate them to
result in noticeable changes of the design’s throughput and higher layers. Therefore, the interesting questions are: Which
performance. This indeterminism or jitter makes the use of character and magnitude do these temporal variations have?
asynchronous logic hardly feasible for real-time applications. Can these variations be tolerated or compensated to allow the
Based on our previous work we investigate the temporal
characteristics of self-timed circuits regarding their usage in usage of self-timed circuits in real-time applications?
the Time-Triggered Protocol (TTP). We propose a self-adapting In our research project ARTS1 (Asynchronous Logic in
circuit which shall derive a suitable notion of time for both bit Real-Time Systems) we are aiming to find answers to these
transmission and protocol execution. We further introduce and questions. Our project goal is to design an asynchronous TTP
analyze our jitter compensation concept, which is a three-fold (Time-Triggered Protocol) controller prototype which is able
mechanism to keep the asynchronous circuit’s notion of time
tightly synchronized to the remaining communication partici- to reliably communicate with a set of synchronous equivalents
pants. To demonstrate the robustness of our solution, we will even under changing operating conditions. TTP was chosen for
perform temperature and voltage tests, and investigate their this reference implementation because it can be considered
impact on jitter and frequency stability. as an outstanding example for hard real-time applications.
In this paper we present new results based on our previous
I. I NTRODUCTION
work. We will investigate the capabilities of self-timed designs
Asynchronous circuits elegantly overcome some of the lim- to adapt themselves to changing operating conditions. With
iting issues of their synchronous counterparts. The often-cited respect to our envisioned asynchronous TTP controller we
potential advantages of asynchronous designs are – among will also study the characteristics of jitter (and the associated
others – reduced power consumption and inherent robustness frequency instabilities of the circuit’s execution speed) and our
against changing operating conditions [1], [2]. Recent silicon corresponding compensation mechanisms. We implement and
technology additionally suffers from high parameter variations investigate a fully functional transceiver unit, as required for
and high susceptibility to transient faults [3]. Asynchronous the TTP controller, to demonstrate the capabilities of the pro-
(delay-insensitive) design offers a solution due to its inherent posed solution with respect to TTP’s stringent requirements.
robustness. A substantial part of this robustness originates The paper is structured as follows: In Section II we give
in the ability to adapt the speed of operation to the actual some important background information on TTP, the research
propagation delays of the underlying hardware structures, project ARTS, and the used asynchronous design style. Section
due to the feedback formed by completion detection and III presents related work and describes our previous work
handshaking. While asynchronous circuits’ adaptive speed and the respective results. We show and discuss experimental
is hence a desirable feature with respect to robustness, it results in Section IV, before concluding in Section V.
becomes a problem in real-time applications that are based
on a stable clock and a fixed (worst-case) execution time. II. BACKGROUND
Therefore, asynchronous logic is commonly considered in- A. Time-Triggered Protocol
appropriate for such real-time applications, which excludes
its use in an important share of fault-tolerant applications The Time-Triggered Protocol (TTP) has been developed for
that would highly benefit from its robustness. Consequently, the demanding requirements of distributed (hard) real-time
it is reasonable to take a closer look at the actual stability 1 The ARTS project receives funding from the FIT-IT program of the
and predictability of asynchronous logic’s temporal behavior. Austrian Federal Ministry of Transport, Innovation and Technology (bm:vit,
After all, synchronous designs operate on the same technology, http://www.bmvit.gv.at/), project no. 813578.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 101


)78 )78 )78
x

TTP
)68 )68 )68
773 )68 773 )68 773 )68 TTP-Bus
%* %* 773 %* %* 773 %* %* 773
%* %* %* %* %* %*

TTP

TTP

TTP
Figure 1. TTP system structure. Figure 2. ARTS system setup.

systems. It provides several sophisticated means to incorporate voltage fluctuations, temperature drift) and classify their im-
fault-tolerance and at the same time keep the communication pact on the execution time. Using an adequate model allows us
overhead low. TTP uses extensive knowledge of the distributed to identify critical parts in the circuit and implement measures
system to implement its services in a very efficient and flexible for compensation. Another issue concerns timeliness itself, as
way. Real-time systems in general and TTP in particular are without a reference clock we do not have an absolute notion
described in detail in [4], [5]. of time. Instead, we will use the strict periodicity of TTP to
A TTP system generally consists of a set of Fail-Silent continuously re-synchronize to the system and derive a time-
Units (FSUs), all of which have access to two replicated base for message transfer.
broadcast communication channels. Usually two FSUs are In this perspective, the main challenge relies in the method
grouped together to form a Fault-Tolerant Unit (FTU), as of resynchronization, as the controller will use the data stream
illustrated in Figure 1. In order to access the communication provided by the other communication participants to dynam-
channel, a TDMA (Time Division Multiple Access) scheme is ically adapt its internal time reference. The chosen solution
implemented: Communication is organized in periodic TDMA is to use a free-running, self-timed counter for measuring
rounds, which are further subdivided into various sending the duration of external events of known length (i.e., single
slots. Each node has statically assigned sending slots, thus the bits in the communication stream). The so gained reference
entire schedule (called Message Descriptor List, MEDL) is measurement can in turn be used to generate ticks with the
known at design-time already. Since each node a priori knows period of the observed event. This local time base should
when other nodes are expected to access the bus, message enable the asynchronous node to derive a sufficiently accu-
collision avoidance, membership service, clock synchroniza- rate time reference for both low-level communication (bit-
tion, and fault detection can be handled without considerable timing, data transfer) as well as high-level services (e.g.
communication overhead. Explicit Bus Guardian (BG) units macrotick generation). The disturbing impact of environmental
are used to limit bus access to the node’s respective time fluctuations is automatically compensated over time, because
slots, thereby solving the babbling idiot problem. Global time periodic resynchronization will lead to different reference
is calculated by a fault-tolerant, distributed algorithm which measurements, depending on the current speed of the counter
analyzes the deviations in the expected and actual arrival times circuit.
of messages and derives a correction term at each node.
The Time-Triggered Protocol provides very powerful means C. Asynchronous Design Style
for developing demanding real-time applications. The highly
deterministic and static nature makes it seemingly unsuited for Our focus is on delay insensitive (DI) circuits2 , as they
an implementation based on asynchronous logic. Hence, these exhibit more pronounced “asynchronous” properties than
properties also make TTP an interesting and challenging topic bounded delay circuits. More specifically, we use the level-
for our exploration of predictability of self-timed logic. encoded dual-rail approach (LEDR [6], [7], used in Phased
Logic [8] and Code Alternation Logic [9]), which encodes
B. ARTS Project a logic signal on two physical wires. We prefer the more
complex 2-phase implementation over the popular 4-phase
The aim of the research project ARTS (Asynchronous Logic
protocol [2], [10], as it is more elegant and we already
in Real-Time Systems) is to integrate asynchronous logic
gained some practical experience with it. LEDR periodically
into real-time systems. For this purpose, an asynchronous
alternates between two disjoint code sets for representing
TTP controller is developed and integrated into a cluster
logic “HI” and “LO” (two phases ϕ0 and ϕ1 , see Figure
of (synchronous) TTP chips, as illustrated in Figure 2. The
3), thus avoiding the need to insert NULL tokens as spacers
asynchronous device should be capable of successfully taking
between subsequent data items. On the structural level, LEDR
part in time-triggered communication, thereby using solely the
designs are based on Sutherland’s micropipelines [11]. In
system inherent determinism and a priori knowledge of TTP to
its strongly indicating mode of operation the performance is
derive a suitable and precise time reference for both bit-timing
always determined by the slowest stage.
as well as high-level services.
The central concern for the project is the predictability of 2 Typically the mandatory delay constraints are hidden inside the basic
asynchronous logic with respect to its temporal properties. building blocks, while the interconnect between these modules is considered
We therefore investigate jitter sources (e.g., data dependencies, unconstrained

102 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


gradually increasing durations τ of the averaging window. It
0
therefore combines measures for both short-term and long-
term stability in a single plot. Thorough circuit analysis, jitter
1 1 estimation, classification, and interpretation, in combination
with a corresponding model have already been elaborated in
0 a previous paper. The next section provides a short summary
of the performed work and the respective results we found.
Figure 3. LEDR coding.
B. Previous Work
1) Jitter in asynchronous circuits: In our previous work
Using this approach, completion detection comes down to we investigated the temporal behavior of self-timed (delay
a check whether the phases of all associated signals have insensitive) circuits not only on an experimental basis, but
changed and match. As usual, handshaking between regis- also from a theoretical point of view. To fully understand
ter/pipeline stages is performed by virtue of a capture done the complex characteristics of logic circuits it is necessary
(or simply cDone) signal [11]. The two-rail encoding as well to separate and classify the sources and manifestations of
as the adaptive timing make LEDR circuits very robust against jitter. While jitter terminology and measurement techniques
faults and changing operating conditions, unfortunately at the are well established for synchronous designs, measuring jitter
cost of increased area consumption and reduced performance. effects in asynchronous circuits significantly differs in that no
Interfacing single-rail inputs/outputs require special “interface reference values are available. After all, it is a desired property
gates” [8], which must properly align (“synchronize”) these of asynchronous logic to adapt its speed of operation to the
to the LEDR operation phases. Concrete implementations and given conditions. We therefore define execution period jitter,
variations of such interface gates are described in [12]. or just execution jitter, to be the variation in the durations of a
specific LEDR-register. The inherent handshaking guarantees
III. S TATE OF THE A RT the average rate of phase changes for all coupled registers to
A. Related Work be the same. However, due to the fact that LEDR circuits are
Various options for building adequate time references are “elastic”, there may be substantial short-term differences in
commonly used in today’s logic designs, e.g., crystal oscilla- the execution speeds of different pipeline stages.
tors provide high precision at high frequencies. RC oscilla- From an abstract point of view, we can categorize jitter in
tors [13] or simple inverter-loops are alternatives if operating two major groups. On the one hand, systematic jitter describes
frequency and precision are no major concerns. It is also all effects that can be reproduced by our system setup. On the
possible to generate clock signals in a fault-tolerant, distributed other hand, random jitter is observed if the timing variations
way [14]. are not controllable by means of system setup. The following
Another important alternative for generating precise time classification can be made for systematic effects:
references are self-timed oscillator rings, which seem to be • Data-Dependent Execution Jitter (DDEJ) deals with
perfectly suited for the chosen asynchronous design method- cases where the actual data values induce (systematic)
ology. A lot of research has been conducted on self-timed jitter on a signal (circuit state, Simultaneous Switching
oscillator rings (which are also based on micropipelines). For Noise [24], . . . ).
example, in [15] a methodology for using self-timed circuitry • Consequently, Data-Independent Execution Jitter (DIEJ)
for global clocking has been proposed. The same authors also subsumes all non-data-dependent systematic jitter effects
used basic asynchronous FIFO stages to generate multiple (global changes of temperature and voltage, e.g).
phase-shifted clock signals for high precision timing in [16]. 2) Timing Model: Keeping the above classification of jitter
Furthermore, it has been found that event spacing in self- sources in mind, we can examine sources of data-dependent
timed oscillator rings can be controlled [17], [18]. The Charlie- and random jitter from a logic designer’s point of view. With
and the drafting-effects have thereby been identified as major the resulting model we can track the sources of data-dependent
forces controlling event spacing in self-timed rings [16], [19]. jitter to their roots. Figure 4 illustrates the remarkable effects
One of the major requirements of our time base is to be of data-dependent execution jitter. It shows the jitter histogram
stable in order to allow for reliable bus-communication. In of a free-running, self-timed, 4-bit counter. The solid black line
the synchronous world, the term “jitter” is commonly used represents a histogram obtained by simulation of the proposed
to classify the deviations of a (clock-) signal from its ideal timing model. As some of the humps are very close together,
behavior [20], [21]. It is also important for our investigations to their superpositions often appear as single peak only. If we had
fully understand the sources and effects of jitter to implement turned random jitter off entirely in the simulation, we could see
adequate countermeasures. In order to classify the frequency 16 sharp peaks in the graph (one for each counter value). On
stability of single execution steps and other timed signals, we the other hand, the filled area shows the jitter histogram taken
use Allan variance plots [22], [23], which provide the appropri- from an FPGA measurement. Again, the peaks are superposi-
ate means for a detailed analysis. Instead of a single number, tions of different delays caused by the different counter values.
Allan deviation is usually displayed as a graph describing The differences between simulation and actual measurement

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 103


1200

1000
Simulation C R
Nr. of occurrences

800 O +1 E -1 +1
Measurement U M F
600
N U
400
T X V
200 E A
0
R L
7.6 7.8 8 8.2
t [s]
8.4 8.6
x 10
−9
8.8 =
bus-line CONTROL ref-time
Figure 4. Jitter histogram simulated/measured, 4-bit counter.
Figure 6. Basic structure of the timer-reference generation circuit.
half bit-time sampling-points
Manchester coded bit-stream
measure
ˆref
Bus of signal ref-time defines an optimal sampling point. As our
jitter
bit-time ˆbit quarter bit-time
circuit is implemented asynchronously, the generated reference
ˆgen
signal will be subject to jitter. Furthermore, temperature and
ref-time voltage fluctuations will also change the reference’s signal
period. It is therefore necessary to make the circuit self-
Figure 5. Manchester code with sampling points. adaptive to changing operating conditions.
The basic structure of the circuit is shown in Figure 6. As
one can see, the interface of our design is quite simple. There
can be explained by the fact that placement and routing is only one input (bus-line, the receive-line of the TTP bus),
information of logic gates is not considered for simulation (i.e., as well as one output (ref-time, an asynchronously generated
we perform pre-layout simulations only). However, the results signal with known period). The dashed components and the
are still accurate enough to allow for identification of the MUX have been added to the circuit to allow rate correction
main sources of jitter, possible bottlenecks, and a quantitative on a by-bit basis in combination to the absolute measurement
estimation of the expected jitter characteristics. of τref during the first bit of a message (SOF, Start-of-
Frame). If the control block detects the SOF signature, it
3) Time Base Generation: In order to allow for reliable TTP
resets the free-running counter unit. The asynchronous counter
communication, the resulting asynchronous controller must
periodically increments its own value at a certain (variable)
have a precise notion of time. As there is no reliable reference
rate, which mainly depends on the circuit structure, placement
time available in the asynchronous case, we design a circuit
and routing, and environmental conditions. After time τref ,
that uses the TTP communication stream to derive a suitable,
the corresponding end of SOF will eventually be detected
stable time-base. We construct an adjustable tick-generator and
by the control-block. As a consequence, the current counter
periodically synchronize it to incoming message-bits. In our
value is preserved in register ref-val (reference value) and the
configuration, the bit-stream of TTP uses Manchester coding,
counter is restarted. The controller is now able to reproduce
thus there is at least one signal transition for each bit which we
the measured low-period τref by periodically counting from
can potentially use for recalibration. The Manchester encoding
zero to ref-val, and generating a transition on ref-time for each
is a line code which represents the logical values 0 and 1 as
compare-match. In order to achieve the 25%/75%-alignment,
falling and rising transitions, respectively. Consequently, each
we double the output frequency by simply halving ref-val.
bit is transmitted in two successive symbols, thus the needed
communication bandwidth is double the data rate. The top IV. E XPERIMENTAL R ESULTS
part of Figure 5 shows three bits of an exemplary Manchester
coded signal, whereby the transitions at 50% of the bittime In this section we will present a detailed analysis of the
define the respective logical values. This encoding scheme experiments we performed with the proposed circuit from
has the advantage of being self-clocking, which means that Figure 6. We will vary temperature and operating voltage and
the clock signal can be recovered from the bit stream. From monitor the generated time reference under these changing
an electrical point of view, Manchester encoding allows for conditions. Simultaneously, we will also evaluate the robust-
DC-free physical interfaces. ness and effectiveness of the three compensation mechanism
Figure 5 further illustrates the properties that our design implemented in the final design:
needs to fulfill. As already mentioned, Manchester coding uses 1) Low-Level State Correction: Measuring period τref of
two symbols to transmit a single bit, thus the “feature-size” the SOF sequence retrieves an absolute measure of the
τref of the communication stream is half the actual bit-time reference time. However, as only one measurement is
τbit . It can also be seen that the sampling points need to be performed per message, quantization errors and other
located at 25% and 75% of τbit , respectively. We intend to systematic, data-dependent delay variations significantly
achieve this quarter-bit-alignment by doubling the generated restrict the achievable precision. The possible resolution
τ
tick-frequency (τgen = ref 2 ). Consequently, each rising edge depends on the speed of the free-running counter, and

104 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


−2
10
is at about 25ns for our current implementation3 . Capture−done
−4 Time−reference
2) Low-Level Rate Correction: As the Manchester code 10

always provides a signal transition at 50% of τbit , we

Allan−Variance (Hz)
−6
10

can continuously adapt the measured reference value 10


−8

ref-val. We only allow small changes to ref-val: It is −10


10
either incremented or decremented by one, depending
on whether the expected signal transitions are late or 10
−12

early, respectively. The advantage of this additional cor- 10


−14
−8 −6 −4 −2 0
10 10 10 10 10
rection mechanism is that quantization errors and data- tau [s]

dependent effects are averaged over time, thus increasing


Figure 7. Allan-Variance.
precision.
3) High-Level Rate Correction: The software-stack control- 5060 266

ling the message transmission unit can add another level 5040
ref−val
ref−time 265

264
of rate correction. As it knows the expected (from the 263

ref−time [ns]
MEDL) and actual (from the transceiver unit) arrival 5020 262

ref−val
261
times of messages, the difference of both can be used to 5000 260

calculate an error-term. High-level services and message 4980


259

258
transmission can in turn be corrected by this term to 257

achieve even better precision. The maximum resolution 4960


0 500 1000
time [s]
1500
256
2000

which can be achieved by this technique depends on the


baud-rate, and is half a bit-time. Figure 8. ref-val vs. timer reference period for temperature-tests.
Remark: We are well aware that the presented results can
only be seen as snapshot for our specific setup and technology.
Changing the execution platform will certainly change the 3 ∗ 10−6 s and thus slightly overlaps with capture-done, has
outcomes of our measurements, as jitter and the corresponding been obtained by measuring ref-time. Notice that it is no
frequency instabilities mainly depend on the circuit structure coincidence that both parts in the figure almost match in
and the used technology. However, from a qualitative point of the overlapping section: Signal ref-time is based upon the
view, our results are valid for other platforms and technologies execution of the low-level hardware and is therefore directly
as well, even if concrete measurements must be taken for a coupled to the respective jitter and stability characteristics. It
quantitative evaluation. is obvious from the graph that the stability increases to about
10−10 Hz for τ ≈ 10−2 s. Furthermore, the reference signal is
A. Time Reference Generator far more stable than the underlying generation logic (cDone),
Before we start with the message transmission unit, which as periodically executing the same operations compensates
implements all of the above compensation mechanisms, we data-dependent jitter and averages random jitter. Although the
want to take a closer look at the basic building block (cf. underlying low-level signals jitter considerably due to data-
Figure 6). Clearly, compensation method (3) is not present, dependent jitter, the circuit’s output is orders of magnitudes
as we just investigate the time reference generation unit. This more stable, as these variations are canceled out during the
unit does not actually receive or transmit messages, it just periodic executions.
generates signal ref-time out of the incoming signal transitions One of the major benefits of the proposed solution is its
on the TTP bus. The measurement setup is fairly simple: robustness to changing operating conditions, thus we addition-
There is a (synchronous) sender, which periodically sends ally vary the environment temperature and observe the changes
Manchester coded messages. The asynchronous design uses in the period of ref-time. We heat the system from room
these messages to generate its internal time-reference. All temperature to about 83◦ C, and let it cool down again. Figure
measurements have been taken while the bus was idle. This 8 compares ref-val to the signal period of ref-time. While
way, we can observe the circuit’s capability of reproducing the ambient temperature increases, ref-val steadily decreases
the measured duration without any disturbing state- or rate from 265 down to 256. The period of ref-time makes an
correction effects. If not stated otherwise, the measurements approximately 19ns-step (the duration of a single execution
are taken at ambient temperature and nominal supply voltage. step) each time ref-val changes. During the periods where the
First we take a look at the frequency stability of ref-time changes in execution speed cannot be compensated (because
and cDone. The first part of the Allan-plot in Figure 7, they are too small), ref-time slowly drifts away from the
ranging from approximately 2 ∗ 10−8 s to 10−4 s on the x-axis, optimum at 5μs. Without any compensation measures the
is obtained by monitoring the handshaking signal capture- duration of ref-time would be about 5180ns at the maximum
done from a register cell. The second part, which starts at temperature, instead of being in the range of approximately
3 Notice that FPGAs are not in any way optimized for LEDR circuits. Dual-
5μs ± 38ns (i.e. the duration of ± two execution steps),
rail encoding introduces not only considerable interconnect delays, but also no matter what temperature. Notice that the performance of
significant area overhead compared to ordinary synchronous logic. the self-timed circuit decreases by 3.5% at the maximum

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 105


−20 0.3 70
20
−15
0.2 60

Deviation [bittimes]
Jitter Histogram [ns]
30

Temperature [°C]
−10
C2C Jitter [ns]

40 0.1 50
−5
50
0 0 40
60
5
70 −0.1 30
10 80
−0.2 20
15 0 1000 2000 3000 4000 5000 6000 7000 8000
90 Time [s/10]

0.8 1 1.2 1.4 1.6 0.8 1 1.2 1.4 1.6


Voltage [V] Voltage [V] Figure 10. Relative deviation from optimal sending slot and operating
temperature.
Figure 9. Cycle-to-cycle jitter (left), Jitter histogram (right).
0.8

0.6

Deviation [bittimes]
temperature, which seems to be relatively low, but it certainly 0.4

is a showstopper for reliable TTP communication. 0.2

Far more pronounced delay variations can be obtained by 0

changing the core supply voltage. We applied 0.8V to 1.68V in −0.2

steps of 20mV core voltage to our FPGA-board. This time the −0.4

execution speed of our self-timed circuit increased from about −0.6


1k 2k 5k 10k 20k 50k 75k 100k 200k 250k
Baudrate [Hz]
80ns per step to approximately 15ns per steps, as shown in
Figure 9(right). This plot illustrates the jitter histogram on the Figure 11. Mean relative deviation from optimal sending point vs. baudrate.
y-axis versus the FPGA’s core supply voltage on the x-axis.
Thereby, the densities of the histogram are coded in gray-scale
(the darker the denser the distribution). It is evident from the and issues a receive-interrupt. Likewise, the host can request
figure that performance increases exponentially with the sup- to transfer messages by writing the payload and the estimated
ply voltage. This illustration also shows other interesting facts: sending time into RAM and asserting the transmit request.
For one, almost all voltages have at least two separate humps
During reception of messages, the circuit can continuously
in their histograms. These are caused by data-dependencies
recalibrate itself to the respective baudrate, as Manchester code
that originate in the different phases ϕ0,1 . Furthermore, for
provides at least one signal transition per bit. However, be-
low voltages, additional peaks appear in the histograms and
tween messages and during the asynchronous node’s sending
the separations between the phases increase as well. This can
slot, resynchronization is not possible. In these phases we
be explained as data-dependent effects caused by different
need to rely on the correctness of ref-time. The Start-Of-Frame
delays through logic stages are magnified while the circuit
sequence of each message must be initiated during a relatively
slows down. This property is better illustrated in Figure 9(left),
tight starting window, which is slightly different for all nodes
where cycle-to-cycle execution jitter is plotted over the supply
and is continuously adapted by the TTP’s distributed clock
voltage. The graph appears almost symmetrically along the x-
synchronization algorithm. Failing to hit this starting window
axis, which is caused by the continuous alternation of phases.
is an indication that the node is out-of-sync.
We conclude that varying operating conditions not only af-
As we are interested in the accuracy of hitting the starting
fect the speed of asynchronous circuits, but also the respective
window, we configured the controlling host in a way that
jitter characteristics. In this perspective, slower circuits tend
it triggers a message-transmission 25 bittimes after the last
to have higher jitter, which is further magnified by increased
bit of an incoming message. We simultaneously heated the
quantization errors due to the low sampling rate.
system from room temperature to about 68◦ C to check on
B. Transceiver Unit the expected robustness against the respective delay variations.
The results are shown in Figure 10, where the deviation
The message transmission unit will implement all three
from the optimal sending-point (in units of bittimes) and
compensation methods mentioned at the beginning of Section
the operating temperature are plotted against time. Similar
IV. We intend to use this unit directly in the envisioned
to Figure 8 one can see that while the circuit gets warmer
asynchronous TTP controller, thus we need to examine the
(and thus slower), the deviation steadily increases. As soon
gained precision of this design with respect to timeliness.
as the accumulated changes in delay can be compensated by
The interface from the controlling (asynchronous) host to the
the low-level measurement circuitry (i.e., ref-val decreases),
sub-design of Section IV-A is realized as dual-ported RAM:
the mean deviation immediately jumps back to about zero.
Whenever the bus transceiver receives a message, it stores the
We can see in the figure that the timing error is in the range
payload in combination with the receive-timestamp4 in RAM
from approximately −0.1 to +0.2 bittimes, which will surely
4 We define the internal time to be the number of ticks of signal ref-time, satisfy the needs of TTP.
i.e., the number of execution steps performed by the bus transceiver unit. The next property we are interested in is the circuit’s behav-

106 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


−7
x 10
0.02 20
sages, as resynchronization can be performed continuously.
Rel. deviation [bittime]

0 15 However, during message transmission, the design depends on

Abs. deviation [s]


the quality of the generated reference time. Our measurement
−0.02 10
Relative deviation
Absolute deviation
show that we are able to hit the optimum sending point with
−0.04 5 a precision of approximately ±0.3 bit-times (assuming an
−0.06 0 interframe gap of 25 bits), which should be enough for the
remaining nodes to accept the messages.
−0.08 −5
1k 2k 5k 10k 20k 50k 75k
Baudrate [Hz]
100k 200k 250k
However, there still is much work to be done. The presented
temperature and voltage tests are only a relatively small subset
Figure 12. Mean relative/absolute deviation from optimum bit-time. of tests that can be performed. One of the most interesting
questions concerns the dynamics of changing operating con-
ditions: How rapidly and aggressively can the environment
ior with respect to different baudrates. Although low bitrates change for the asynchronous TTP controller to still maintain
have the advantage of minimizing the quantization error, jitter synchrony with the remaining system? It should be clear from
has much more time to accumulate compared to high data our approach that an answer to this question can only be
rates. It is thus not necessarily true that lower baudrates result given with respect to the concrete TTP schedule, as message
in a more stable and precise time reference. On the other hand, lengths, interframe gaps, baudrate, etc. directly influence the
if the data rate is too high, it is not possible to reproduce τref achievable precision of our solution. The next steps of the
correctly, and even small changes of the reference value ref- project plan include the integration of the presented transceiver
val lead to large relative errors in the resulting signal period. unit into an asynchronous microprocessor, the implementation
The optimum baudrate will therefore be located somewhere of the corresponding software stack, and the interface to
between these extremes. Figure 11 illustrates this by plotting the (external) application host controller. Once the practical
the mean deviations of the optimum sending points versus challenges are finished, thorough investigations of precision,
the bitrate (the “corridor” additionally shows the respective reliability and robustness of our asynchronous controller will
standard deviations). Notice that the y-axis shows the relative be performed.
deviation in units of bit-times. Therefore, for example, the
absolute deviation of the 1kHz bit-rate is more than 50 times R EFERENCES
larger than that of 50kHz. Clearly, TTP does not support
baudrates as low as 1kHz. Reasonable data rates are at least at [1] C. J. Myers, Asynchronous Circuit Design. Wiley-Interscience, John
100kHz and above (up to 4Mbit/s for Manchester coding). Our Wiley & Sons, Inc., 605 Third Avenue, New York, N.Y. 10158-0012:
John Wiley & Sons, Inc., 2001.
current setup allows us to use 100kbit/s for communication [2] J. Sparso and S. Furber, Principles of Asynchronous Circuit Design - A
with acceptable results. However, we hope to be able to Systems perspective. MA, USA: Kluwer Academic Publishers, 2001.
achieve 500kbit/s in our final system setup (with a more [3] N. Miskov-Zivanov and D. Marculescu, “A systematic approach to
modeling and analysis of transient faults in logic circuits,” in Quality of
sophisticated development platform and a further optimized Electronic Design, 2009. ISQED 2009., March 2009, pp. 408–413.
design). [4] H. Kopetz and G. Grundsteidl, “TTP - A Time-Triggered Protocol
Finally, we take a look at the accuracy of the generated time for Fault-Tolerant Real-Time Systems,” Symposium on Fault-Tolerant
Computing, FTCS-23.
reference for different baudrates. Figure 12 therefore shows [5] H. Kopetz, Real-Time Systems: Design Principles for Distributed Em-
the mean relative (again in units of bit-times) and the absolute bedded Applications. MA, USA: Kluwer Academic Publishers, 1997.
(in seconds) deviations of the actual reference periods from [6] M. E. Dean, T. E. Williams, and D. L. Dill, “Efficient Self-Timing with
Level-Encoded 2-Phase Dual-Rail (LEDR),” in Proceedings of the 1991
their nominal values. For all baudrates, the relative deviations University of California/Santa Cruz conference on Advanced research
are within a range of approximately ±0.01 bit-times, or ±1%, in VLSI. Cambridge, MA, USA: MIT Press, 1991, pp. 55–70.
while the absolute timing errors are significantly larger for [7] A. McAuley, “Four state asynchronous architectures,” IEEE Transac-
tions on Computers, vol. 41, no. 2, pp. 129–142, Feb 1992.
baudrates below 50kbit/s. [8] D. Linder and J. Harden, “Phased logic: supporting the synchronous
design paradigm with delay-insensitive circuitry,” IEEE Transactions
V. C ONCLUSION on Computers, vol. 45, no. 9, pp. 1031–1044, Sep 1996.
[9] M. Delvai, “Design of an Asynchronous Processor Based on Code Alter-
In this paper we introduced the research project ARTS, nation Logic - Treatment of Non-Linear Data Paths,” Ph.D. dissertation,
provided information on the project goals and explained the Technische Universität Wien, Institut für Technische Informatik, Dec.
2004.
concept of TTP. We proposed a method of using TTP’s bit [10] K. Fant and S. Brandt, “NULL Convention LogicTM: a complete and
stream to generate an internal time reference (which is needed consistent logic for asynchronous digital circuit synthesis,” in ASAP
for message transfer and most high-level TTP services). With 96. Proceedings of International Conference on Application Specific
Systems, Architectures and Processors, 1996., Aug 1996, pp. 261–273.
this transceiver unit for Manchester coded messages we per-
[11] I. E. Sutherland, “Micropipelines,” Communications of the ACM, Turing
formed measurements under changing operating temperatures Award, vol. 32, no. 6, pp. 720–738, JUN 1989, iSSN:0001-0782.
and voltages. The results clearly show that the proposed ar- [12] M. Ferringer, “Coupling asynchronous signals into asyn-
chitecture works properly. The results further indicate that the chronous logic,” Austrochip 2009, Graz, Austria, http://www.
vmars.tuwien.ac.at/php/pserver/extern/download.php?fileid=1735.
achievable precision is in the range of about 1%. This is not a [13] F. Bala and T. Nandy, “Programmable high frequency RC oscillator,” in
problem while other (synchronous) node are transmitting mes- 18th International Conference on VLSI Design., Jan. 2005, pp. 511–515.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 107


[14] M. Ferringer, G. Fuchs, A. Steininger, and G. Kempf, “VLSI Imple-
mentation of a Fault-Tolerant Distributed Clock Generation,” in 21st
IEEE International Symposium on Defect and Fault Tolerance in VLSI
Systems, 2006. DFT ’06., Oct. 2006, pp. 563–571.
[15] S. Fairbanks and S. Moore, “Self-timed circuitry for global clocking,”
2005, pp. 86 – 96.
[16] ——, “Analog micropipeline rings for high precision timing,” in 10th
International Symposium on Asynchronous Circuits and Systems, 2004.,
April 2004, pp. 41–50.
[17] V. Zebilis and C. Sotiriou, “Controlling event spacing in self-timed
rings,” in 11th IEEE International Symposium on Asynchronous Circuits
and Systems, 2005. ASYNC 2005., March 2005, pp. 109–115.
[18] A. Winstanley, A. Garivier, and M. Greenstreet, “An event spacing ex-
periment,” in Eighth International Symposium on Asynchronous Circuits
and Systems, 2002. Proceedings., April 2002, pp. 47–56.
[19] J. Ebergen, S. Fairbanks, and I. Sutherland, “Predicting performance of
micropipelines using charlie diagrams,” mar-2 apr 1998, pp. 238 –246.
[20] M. Shimanouchi, “An approach to consistent jitter modeling for various
jitter aspects and measurement methods,” in Proceedings of IEEE
International Test Conference, 2001., 2001, pp. 848–857.
[21] I. Zamek and S. Zamek, “Definitions of jitter measurement terms and
relationships,” in Proceedings of IEEE International Test Conference,
2005. ITC 2005., Nov. 2005, pp. 10 pp.–34.
[22] D. W. Allan, N. Ashby, and C. C. Hodge, “The
Science of Timekeeping,” 1997, application Note 1289.
http://www.allanstime.com/Publications/DWA/Science Timekeeping/
TheScienceOfTimekeeping.pdf.
[23] D. Howe, “Interpreting oscillatory frequency stability plots,” in IEEE
International Frequency Control Symposium and PDA Exhibition, 2002.,
2002, pp. 725–732.
[24] B. Butka and R. Morley, “Simultaneous switching noise and safety criti-
cal airborne hardware,” in IEEE Southeastcon, 2009. SOUTHEASTCON
’09., March 2009, pp. 439–442.

108 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


First Evaluation of FPGA Reconfiguration for
3D Ultrasound Computer Tomography

M. Birk, C. Hagner, M. Balzer, N.V. Ruiter M. Huebner, J. Becker


Institute for Data Processing and Electronics Institute for Information Processing Technology
Karlsruhe Institute of Technology Karlsruhe Institute of Technology
Karlsruhe, Germany Karlsruhe, Germany
{birk, balzer, nicole.ruiter}@kit.edu {michael.huebner, becker}@kit.edu

Abstract—Three-dimensional ultrasound computer tomography which interacts with the breast tissue and is recorded by the
is a new imaging method for early breast cancer diagnosis. It surrounding receivers as pressure variations over time. These
promises reproducible images of the female breast in a high qual- data sets, also called A-Scans, are sampled and stored for all
ity. However, it requires a time-consuming image reconstruction, possible sender-receiver-combinations, resulting in over 3.5
which is currently executed on one PC. Parallel processing in millions data sets and 20 GByte of raw data.
reconfigurable hardware could accelerate signal and image
processing. This paper evaluates the applicability of the FPGA- For acquisition of these A-Scans, a massively parallel,
based data acquisition (DAQ) system for computing tasks by FPGA-based data acquisition (DAQ) system is utilized. After
exploiting reconfiguration features of the FPGAs. The obtained DAQ, the recorded data sets are transferred to an attached
results show, that the studied DAQ system can be applied for computer workstation for time-consuming image reconstruc-
data processing. The system had to be adapted for bidirectional tion steps. The reconstruction algorithms need a significant
data transfer and process control. acceleration of factor 100 to be clinically relevant.
Keywords-Altera FPGAs, Reconfigurable Computing, 3D A promising approach to accelerate image reconstruction is
Ultrasound Computer Tomography parallel processing in reconfigurable hardware. This prelimi-
nary work investigates the applicability of the above mentioned
I. INTRODUCTION DAQ system for further data processing tasks by a reconfigura-
tion of the embedded FPGAs.
Breast cancer is the most common type of cancer among
women in Europe and North America. Unfortunately, an early
breast cancer diagnosis is still a major challenge. In today’s
standard screening methods, breast cancer is often initially
diagnosed after metastases have already developed [1]. The
presence of metastases decreases the survival probability of the
patient significantly. A more sensitive imaging method could
enable detection in an earlier state and thus, enhance survival
probability.
At the Institute for Data Processing and Electronics (IPE) a
three-dimensional ultrasound computer tomography (3D
USCT) system for early breast cancer diagnosis is being devel-
oped [2]. This method promises reproducible volume images of
the female breast in 3D.
Initial measurements of clinical breast phantoms with the
first 3D prototype showed very promising results [3, 4] and led
to a new optimized aperture setup [5], which is currently built
and shown in Figure 1. It will be equipped with over 2000 ul-
trasound transducers, which are in particular 628 emitters and
1413 receivers. Further virtual positions of the ultrasound
transducers will be created by rotational and translational
movement of the complete sensor aperture. Figure 1. Technical drawing of the new semi-ellipsoidal aperture. It will
be equipped with 628 ultrasound senders and 1413 receivers. During
In USCT, the interaction of unfocused ultrasonic waves with an measurement, emitters sequentially send an ultrasonic wave front, which
imaged object is recorded from many different angles and interacts with the breast and is recorded by the surrounding recievers.
afterwards computationally focused in 3D. During a measure-
ment, the emitters sequentially send an ultrasonic wave front,

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 109


The remainder of this paper is structured as follows: Sec-
tion II describes the investigated FPGA-based DAQ system in
detail. In Section III, the examined reconfiguration methodolo-
gy is discussed. This includes derived design considerations
and necessary system adaptations. Section IV illustrates an
experimental procedure, which was used as proof of functional-
ity of the used reconfiguration methodology and as a perfor- 2 MB
mance test of the DAQ system architecture. The paper is con-
cluded in Section V. Therein, the attained performance results
and limiting factors are discussed. Section VI gives an outlook
into future work in this field of research.

II. DATA ACQUISITION SYSTEM


The investigated data acquisition (DAQ) system has been 2 MB 2 GB
developed at IPE as a common platform for multi-project
usage, e.g. in the Pierre Auger Observatory [6], the Karlsruhe
Tritium Neutrino Project [7], and has also been adapted to the
needs of 3D USCT. The DAQ system is described in detail in
the following subsections.

A. Setup & Functionality 2 MB


In the USCT configuration, the DAQ system consists of 21
expansion boards: one second level card (SLC) and 20 identical
first level cards (FLC). Up to 480 receiver signals can be Figure 3. Block diagram of the digital part of an FLC in the 3D USCT
DAQ system. It is equipped with four Altera Cyclone II FPGAs. One is
processed in parallel by processing 24 channels on each FLC, used for local control (control FPGA, Cntr FPGA) and three for signal
resulting in a receiver multiplex-factor of three. The complete acuisition (computing FPGAs, Comp FPGA). Each Comp FPGA is fed by
system fits into one 19” crate, which is depicted in Figure 2. an 8fold ADC and is attached to a 2 MB QDR static RAM. The Cntrl
The SLC is positioned in the middle between 10 FLCs to the FPGA is attached to an 2 GB DDRII dynamic RAM. There are two separate
right and left, respectively. means of communication between the FPGAs: the slow local bus (Local
Bus, 80 MB/s) and a fast data link (Fast Link, 240MB/s)
The SLC controls the overall measurement procedure. It
triggers the emission of ultrasound pulses and handles data Communication with the attached PC is either possible via
transfers to the attached reconstruction PC. It is equipped with Fast Ethernet or an USB interface. For communication between
one Altera Cyclone II FPGA and a processor module (Intel SLC and the FLCs within the DAQ system a custom backplane
CPU, 1 GHz, 256 MB RAM) running a Linux operating sys- bus is used.
tem.
B. First Level Card
A FLC consists of an analogue and a digital part. Only the
digital part will be considered throughout this paper. A block
diagram of this part is given in Figure 3. Besides three 8fold
ADCs for digitization of the 24 assigned receiver channels, one
FLC is equipped with four Altera Cyclone II FPGAs, which are
used for different tasks:
x Control FPGA (Cntrl FPGA): One FPGA is used as lo-
cal control instance. It handles communication and da-
ta transfer to the other FPGAs and to the SLC via
backplane bus.
x Computing FPGA (Comp FPGA): The three other
FPGAs are used for actual signal acquisition. Each of
these is fed by one ADC and thus processes 8 receiver
channels in parallel.
As intermediate storage for the acquired A-Scans, there are
two different types of memory modules: Each computing
FPGA is connected to a distinct static RAM module (QDRII, 2
Figure 2. Image of the DAQ system in the USCT configuration. It is
MB each) and the control FPGA is attached to a dynamic RAM
composed of one Second Level Card (SLC) for measurement control and
communiation management (middle slot) and 20 First Level Cards (FLC) module (DDRII, 2 GB), summing up to a system capacity of
for parallel sensor signal aqcuision and data storage. 40 GB.

110 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


RAM

Figure 4. Data-flow of the DAQ system in the current configuration: the


acquired signals from the ultrasound receivers are conditioned and
afterwards digitized. They are digitally pre-filtered and decimated in the
FPGAs and stored in on-board memory. After a complete measurement,
they are transferred to an attached PC for further processing.

There are two separate means of communication between Figure 5. Detailed data-flow on one FLC during the conventional
the control FPGA and the computing FPGAs, see also Figure acquisition mode: Every FLC processes 24 receiver channels in parallel,
3: a slow local bus with a width of 32bit (Local Bus, 80 MB/s) whereas a group of 8 signals is digitized in a single ADC. The resulting
digital signals are digitally filtered and averaged in the computing FPGAs.
and 8bit wide direct data links (Fast Links, 240MB/s per com- Finally, the signals are transmitted to the control FPGA and stored in
puting FPGA). Additionally, there are several connections for DDRII memory.
synchronization on board.

III. METHODOLOGY
As outlined in Section I, 3D USCT promises high-quality
volumetric images of the female breast and has therefore a high
potential in cancer diagnosis. However, it includes a set of
time-consuming image reconstruction steps, limiting the me-
thod’s general applicability.
To achieve a clinical relevance of 3D USCT, i.e. applica-
tion in clinical routine, image reconstruction has to be accele-
rated by at least a factor of 100. A promising approach to re-
duce overall computation time is parallel processing of recon- Figure 6. Detailed data-flow on one FLC during the newly craeated
struction algorithms in reconfigurable hardware. processing mode: As the data sets were previously stored in DDRII
memory, they are transferred back to QDRII memory and processed in the
In the current design, the DAQ system is only used for con- computing FPGAs. Finally, the resulting data is stored again in DDRII
trolling the measurement procedure and acquisition of the ul- memory.
trasound receiver signals. The overall data-flow of the DAQ
system is shown in Figure 4. The acquired signals are condi- The hardware setup of a FLC was given in Section II and
tioned and subsequently digitized, digitally pre-filtered and shown in Figure 3. The detailed data-flow on a FLC in conven-
decimated in the FPGAs and afterwards stored in on-board tional operation mode is shown in Figure 5. During DAQ, 24
memory. After a complete measurement cycle, the resulting receiver channels are processed per FLC. The signals are split
data is transferred to an attached PC for signal processing and into groups of 8. Every group is digitized in one ADC and fed
image reconstruction. into one computing FPGA. Within a FPGA, the signals are
digitally filtered and averaged by means of the attached QDR
In this preliminary work, the utilization of the FPGAs in the memory. Finally, the measurement data is transmitted via fast
DAQ system for further processing tasks has been investigated. data links to the control FPGA, where it is stored in DDRII
Due to resource limitations, the full set of the processing algo- memory and afterwards transmitted to the SLC via backplane
rithms cannot be configured statically onto the FPGAs in an bus.
efficient manner, either alone or even less in combination with
the abovementioned DAQ functionality. In this work, after completion of a measurement cycle, i.e.
the data is stored in DDRII memory, the FPGAs were reconfi-
Therefore, a reconfiguration of the FPGAs is necessary to gured to switch from conventional acquisition to data
switch between different configurations, enabling signal acqui- processing mode. As depicted in Figure 6, instead of transmit-
sition and further processing on the same hardware system. ting the data sets via SLC to the attached PC, they were loaded
As the DAQ system has not been designed for further back to QDR II and subsequently processed in the computing
processing purposes, the scope of this work was to identify its FPGAs. After completion, the resulting data was transmitted
capabilities as well as architectural limitations in this regard. back to the control FPGA and again stored in DDRII memory.
For providing this reconfiguration methodology, the following
Only a reconfiguration of the FPGAs on the FLCs has been tasks had to be performed:
investigated, since these hold the huge majority of FPGAs
within the complete system. Therefore, only these cards and x Preventing data loss during reconfiguration
their data-flow are considered in the following sections. Fur-
thermore, an interaction of different FLCs has not been consi- x Establishing communication and synchronization
dered in this preliminary study. x Implementing bidirectional communication interfaces

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 111


shown in Figure 8. In JTAG mode, each FPGA within the
chain has to be configured sequentially.

B. Communication and Synchronization


Another important task in establishing the described recon-
figuration methodology was to organize communication and
control on the FLC as well as synchronization of parallel
processing on the computing FPGAs.
As described in Section II, there are two means of commu-
Figure 7. Passive serial configuration of a the FPGAs on a FLC at start-up
time: first the control FPGA and then the computing FPGAs are configured nication between the computing FPGAs and the control FPGA,
in parall with data from an embedded configuration ROM. see also Figure 3: the slow local bus (Local Bus) and fast direct
data links (Fast Links).
In conventional DAQ operation mode, measurement data is
transmitted only in the direction from computing FPGAs to the
control FPGA. Unfortunately, due to operational constraints in
the FPGA pins, which are assigned for the fast links, this con-
nection can only be used in the abovementioned sense, i.e. un-
idirectional.
Figure 8. JTAG chain for reconfiguration of a the FPGAs on a FLC. In Thus, in processing mode, the slower local bus has to be
JTAG reconfiguration mode, each FPGA can be selected seperately for a used for data transfer, since only this connection allows a bidi-
reconfiguration, whereas deselected FPGAs remain in normal operation rectional communication. The complete communication infra-
mode. Reconfiguration of FPGAs in chain takes place sequentially.
structure is shown in Figure 9.
As the control FPGA is not reconfigured during operation,
A. Preventing data loss during reconfiguration it must be statically configured to handle data transfer in each
The DAQ system is built up of Altera Cyclone II FPGAs, system state, i.e. DAQ as well as processing mode. Further-
which do not allow partial reconfiguration [8]. Therefore, the more, it must be able to determine the current state in order to
complete FPGA chip has to be reconfigured. To prevent a loss act appropriately. As also depicted in Figure 9, a single on-
of measurement data during the reconfiguration cycle, all data board spare connection (conf_state) is used for that purpose,
has to be stored outside the FPGAs in on-board memory, i.e. which is connected to all four FPGAs.
QDRII or DDRII memory.
In addition, each computing FPGA can be addressed and
The QDRII is static memory, so that stored data is not cor- selected directly by the control FPGA via further point-to-point
rupted during reconfiguration of the FPGAs on the FLC. How- links to establish process control and synchronization. The re-
ever, only the larger memory (DDRII) is capable of holding all spective chip_select signal triggers processing in a computing
the data sets recorded on one FLC. This dynamic memory FPGA and by the busy signal completion of processing is indi-
module needs a periodic refresh cycle to keep stored data. On cated to the control FPGA.
the FLC, the control FPGA is responsible for triggering these
refresh cycles.
During a reconfiguration this FPGA is not able to perform
this task. Since a refresh interval of the dynamic memory mod-
ule is in the order of a few microseconds and a reconfiguration
of the control FPGA takes even in the fastest mode about
100ms [8], it must not be reconfigured, or otherwise data in
DDRII memory is lost.
Due to this requirement, only the three computing FPGAs
are allowed to be reconfigured during operation. At a normal
start-up of the DAQ system, all FPGAs on a FLC are confi-
gured via passive serial mode [8] with data from an embedded
configuration ROM. As depicted in Figure 7, first the control
FPGA and then all three computing FPGAs are configured in
parallel in this mode.
In the current hardware setup it is not possible to exclude Figure 9. Communication structure during processing mode on a FLC:
the control FPGA from a configuration in passive serial mode. bidrectional data transfer is only possible via the slower Local Bus (80
MB/s). Separate point-to-point links (chip_select & busy) are used for
Thus, each FPGA on the FLC has to be addressed and reconfi- control and synchronization of parallel processing. A further single point-
gured separately, which is only possible in JTAG configuration to-point link is connected to all four FPGAs and indicates the current
mode [8] by using a JTAG chain through all four FPGAs as system state, i.e. DAQ or processing mode.

112 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Figure 10. Block diagram of an computing FPGA: defined interfaces for
communiation with the control FPGA (communication I/F) via Local Bus
or Fast Links and access to QDRII memory (memory I/F). The variable Figure 11. Detailed test procedure after DAQ and manual reconfiguration of
algorithmic part is also indicated. the computing FPGAs via JTAG. Firstly, computing FPGA A is supplied
with a set of A-Scans and processing on this FPGA is started. While
processing is underway, also data transfer and processing on the computing
C. Communication Interfaces FPGAs B and C is initiated. After completion of processing on FPGA A, the
resulting data is tranferred back to DDRII memory and further unprocessed
A further task was structuring communication and memory A-Scans are loaded. This scheme is applied repeatedly until all data sets are
interfaces in the computing FPGAs. As a result, modular inter- processed.
faces for transmitting data over the Local Bus (communication
I/F) and storing data in QDRII memory (memory I/F) were
created. Figure 10 shows a block diagram of these modules on B. Detailed Test Procedure
the computing FPGAs. The created modular design allows a The system has been tested as follows: At system start-up,
simple exchange of algorithmic modules without the need to the initial DAQ configuration was loaded into the FPGAs as
change further elements. outlined in Section IIa. Afterwards, the test pulse has been ap-
plied at the inputs of the ADCs on the FLC. This pulse was
The communication interface is controlled by the control digitized and finally stored in DDRII memory.
FPGA. It performs data transfers via Local Bus during
processing or via fast data link during DAQ mode. The memo- The further detailed procedure is indicated in Figure 11. Af-
ry interface handles accesses to the QDRII memory. It can be ter a manual reconfiguration of the computing FPGAs via
either accessed by the control FPGA via Local Bus or by the JTAG, the first set of A-Scans were transferred to the first
algorithmic modules. In the current configuration, an algorith- computing FPGA (FPGA A) via Local Bus, stored in its at-
mic module only interacts with the memory interface and thus, tached QDRII memory and subsequently processed. While data
only processes data which has already been stored in QDRII in this FPGA is being processed, the other two computing
memory. FPGAs (FPGA B and FPGA C) are supplied with their initial
data sets and processing on these FPGAs is started.
In order to guarantee a seamless data transfer over the Lo-
cal Bus, the respective memory interface in the control FPGA After completion of processing in FPGA A, the resulting A-
had to be supplemented by a buffered access mode to the Scan data was transmitted back to DDRII memory and subse-
DDRII memory. When a data transfer is initialized, enough quently further unprocessed A-Scans were sent to this FPGA.
data words are pre-loaded into a buffer, so that the transmission This scheme is repeatedly applied until all A-Scans have been
is not interrupted during a refresh cycle. processed.

IV. EXPERIMENTAL RESULTS C. Results


The reconfigurable computing system was tested by acqui- A JTAG configuration of a single computing FPGA re-
sition of a test pulse. The used pulse was in the same frequency quires 1.8s, resulting in a reconfiguration time of 5.4s for one
range as regular measurement data and was handled as a nor- FLC, when only the three computing FPGAs are configured in
mal data set (A-Scan). This was followed by the reconfigura- a JTAG chain. A reconfiguration of all 60 computing FPGAs,
tion of the computing FPGAs and an exemplary data distributed over the 20 FLCs in the complete DAQ system
processing. The main goals were to determine the required would take up to 2 minutes by building up a JTAG chain
transfer time per data set over the Local Bus as well as reconfi- through all FPGAs. The determined JTAG reconfiguration
guration times. times are illustrated in Table I.

A. Test setup TABLE I. JTAG RECONFIGURATION TIMES


For functional validation and performance measurements, a Procedure Required time
reduced setup of the complete DAQ system, containing a SLC
Reconf. of one computing FPGA 1.8s
but only one FLC was used. However, as in the current confi-
guration only a single FLC without interactions with other Reconf. of one FLC 5.4s
FLCs has been considered, this setup allows a projection for a
Extrapolated reconf. of the DAQ system ~2min
fully equipped system.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 113


TABLE II. COMPUTING FPGA OCCUPATION

Configuration
Components
Data Acquisition Data Processing Comm & Mem I/F only
Logic Elements 5766 (17%) 9487 (29 %) 1231 (4%)

Embedded Multipliers 68 (97%) 64 (91%) 0 (0%)

Memory Bits 4236 (<1%) 393268 (81%) 0 (0%)

The transfer of one data set via Local Bus in either direc- total processing time. To which extend this constraint will re-
tion, i.e. from control FPGA to computing FPGA or vice versa, strict the applicability of the presented methodology can not be
takes 75us. Usage of the Local Bus limited to one computing assessed at this point and needs a further investigation. Howev-
FPGA at a time and the same bus is used for data transfer to er, the reconfiguration time could be significantly reduced by
and from all three computing FPGAs. separate JTAG chains for each FLC and concurrent reconfigu-
ration.
Assuming the applied data parallel processing strategy, i.e.
each computing FPGAs performs the same computation on a
different data set, a high efficiency can only be reached, if the VI. OUTLOOK
following condition holds: For future work, two obvious aspects have already been de-
The parallelized processing time per A-Scan on a compu- rived in the last section. Namely, reducing data transfer time by
ting FPGA has to be longer than 450us, which is 6 times the a modified communication scheme in the processing phase and
transmission time of a single data set. In this context, paralle- reducing reconfiguration time by parallel JTAC chains.
lized time is the processing time per A-Scan on one computing Further tasks will also be porting processing algorithms to
FPGA divided by the number of concurrently processed A- the DAQ system and thus, evaluating the established reconfigu-
Scans on this FPGA. In this case, transfer time to and from all ration ability in real application.
three FPGAs could be hidden. As observable in Figure 11, this
requirement was not completely fulfilled by our exemplary What has not been considered so far is a direct communica-
data processing. tion between the computing FPGAs on a FLC and an interac-
tion of different FLCs in general. This would open up manifold
Table II outlines the occupation of the computing FPGA implementation strategies for algorithmic modules, besides the
during the test procedure. The extensive use of embedded mul- applied data parallel scheme.
tipliers in DAQ mode, which are required due to the hard real-
time constraints, state a clear demand for the established recon- In the long term, the next redesign of the DAQ system will
figuration methodology. Furthermore, the implemented com- put special focus on processing capabilities, e.g. high-speed
munication and memory interfaces are lightweight, occupying data transfer.
only 4% of the device’s logic elements.
VII. REFERENCES
As the main result of this preliminary work, the possibility
of reusing the existing DAQ system for data processing has [1] D. van Fournier, H.J.H.-W. Anton, G. Bastert, “Breast cancer
screening”, P. Bannasch (Ed.), Cancer Diagnosis: Early Detection,
been shown. By the reconfiguration of the FPGAs the functio- Springer, Berlin, 1992, pp. 78-87.
nality of the complete system has been increased. [2] H. Gemmeke and N. V. Ruiter, “3D ultrasound computer tomograph for
medical imaging,” Nucl. Instr. Meth., 2007
V. CONCLUSIONS & DISCUSSION [3] N.V. Ruiter, G.F. Schwarzenberg, M. Zapf and H. Gemmeke,
"Conclusions from an experimental 3D Ultrasound Computer
In this paper, the concept of a reconfigurable computing Tomograph," Nuclear Science Symposium, 2008.
system based on an existing DAQ system for 3D USCT has [4] N.V. Ruiter, G.F. Schwarzenberg, M. Zapf, A. Menshikov and H.
been presented. Gemmeke, “Results of an experimental study for 3D ultrasound CT,”
NAG/DAGA Int. Conf. On Acoustics, 2009
The main drawback of the existing system is the slow data [5] G.F. Schwarzenberg, M. Zapf and N.V. Ruiter, “Aperture optimization
transfer over the Local Bus, which limits the achievable per- for 3D ultrasound computer tomography,” IEEE Ultrasonics
formance during the processing phase. This issue could partly Symposium, 2007
be resolved by using a modified communication scheme, where [6] H. Gemmeke, et al., “First measurements with the auger fluorescence
data transfers from the computing FPGAs to the control FPGA detector data acquisition system,” 27th International Cosmic Ray
is still done via Fast Links as in DAQ mode and the Local Bus Conference, 2001.
is only used for the opposite direction. [7] A. Kopmann, et al. "FPGA-based DAQ system for multi-channel
detectors," Nuclear Science Symposium Conference Record, 2008.
Likewise, due to the long reconfiguration time, the number [8] Altera Corporation, “Cyclone II Device Handbook”, 2007
of reconfiguration cycles gives an essential contribution to the

114 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


ECDSA Signature Processing over Prime Fields for
Reconfigurable Embedded Systems
Benjamin Glas, Oliver Sander, Vitali Stuckert, Klaus D. Müller-Glaser, and Jürgen Becker
Institute for Information Processing Technology (ITIV)
Karlsruhe Institute of Technology (KIT)
Karlsruhe, Germany
Email: {glas,sander,becker,kmg}@itiv.uni-karlsruhe.de

Abstract—Growing ubiquity and safety relevance of embedded VI shows an application example and integration in a wireless
systems strengthens the need to protect their functionality against communication system. Section VII details on performance
malicious attacks. Communication and system authentication and resource usage. The contribution is concluded in section
by digital signature schemes is a major issue in securing such
systems. This contribution presents a complete ECDSA signature VIII.
processing system over prime fields on reconfigurable hardware.
The flexible system is tailored to serve as a autarchic subsystem II. R ELATED WORK
providing authentication transparent for any application. Inte- Since elliptic curves where proposed as basis for public
gration into a vehicle-to-vehicle communication system is shown key cryptography in 1985 by Koblitz [4] and Miller [5]
as an application example.
independently, many implementations of prime field ECDSA
I. I NTRODUCTION and elliptic curve cryptography (ECC) in general have been
published. Software implementations on general purpose pro-
With emerging ubiquitity of embedded electronic systems cessors need lots of computation power. The eBACS ECRYPT
and a growing part of distributed systems and functions even benchmark [6] gives values for 256 bit ECDSA of e.g. 1.88
in safety relevant areas the security of embedded systems msec for generation and 2.2 msec for verification on a Intel
and their communication gains importance quickly. One major Core 2 Duo at 1.4 GHz, and 2.9 msec resp. 3.4 msec on
concern of security is authenticity of communication peers an Intel Atom 330 at 1.6 GHz. Values for a crypto system
and the information exchange. Especially if many different based on a ARM7 32-bit microcontroller are given in [7] for
remote participants have to communicate or not all participants a key bitlength of 233 bit. Using a comb table precomputation
are known in advance, asymmetric signature schemes are (w = 4) 742 ms are needed for a generation and 1240 ms for
beneficial for authentication purposes. In contrast to symmetric a verification of an ECDSA signature.
schemes like the Keyed-Hash Message Authentication Code To achieve usable throughputs and latencies on embedded
HMAC [1], asymmetric signature schemes like RSA [2], systems, various specialized hardware has been proposed, e.g.
DSA [3] and the ECDSA scheme [3] considered in this many approaches for implementation of Fp arithmetic and the
contribution get along without key exchange or predistributed ECC primitives point add and point double on reconfigurable
keys, relaying usually on a certification authority as trusted hardware. A survey of hardware implementations can be found
third party instead. in [8]. Güneysu et al present in [9] a very fast approach based
This benefit comes at the cost of a much greater computa- on special DSP FPGA slices. The implementation presented
tional complexity of these schemes compared to authentication here is based on a Fp ALU presented by Gosh et al [10].
techniques based on symmetric ciphers or solely on hashing. Nevertheless open implementations of complete signature
This imposes major problems especially for embedded systems processing units performing complete ECDSA are scarce.
where resources are scarce. Järvinen et al [11] present a Nios II based ECDSA system
This contribution presents a hardware implemented system on an Altera Cyclone II FPGA for a key length of 163 Bit
for complete prime field ECDSA signature processing on performing signature generation in 0.94 msec, verification in
FPGAs. It can be integrated as an autarchic subsystem for 1.61 msec.
signature processing in embedded systems. As an application This contribution presents an FPGA-based autarchic
example the integration in a vehicle-to-vehicle communication ECDSA system for longer key lengths of 256 bit containing
system is presented. all necessary subsystems for application in embedded systems
The remainder of this paper is organized as follows. In the on reconfigurable hardware.
following section II some related work is given, section III
presents basics of the implemented signature scheme ECDSA III. ECDSA FUNDAMENTALS
and section IV outlines the assumed situation and requirements The Elliptic Curve Digital Signature Algorithm ECDSA is
for the system. The structure and implementation of the based on group operations on an elliptic curve E over a finite
signature system itself is presented in section V and section field Fq . Mostly two types of finite fields are technically used:

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 115


Fields F2n of characteristic two and prime fields Fp with large V. S IGNATURE P ROCESSING S YSTEM
primes p.
Processing of ECDSA consists of several layers of computa-
tion. On the top level the signature generation and verification
Algorithm 1 ECDSA signature generation
algorithms as well as the certificate validation are performed.
Input: Domain parameter D = (q, a, b, G, n, h), secret key d, message m
Output: Signature (r, s) These signature scheme-dependent layer is based on the group
operations point add (PA) and point double (PD) in the
1: Chose random k ∈ [1, n − 1], k ∈ N
2: Compute kG = (x1 , y1 ) underlying elliptic curve. These are in turn based on the
3: Compute r = x1 mod n. If r = 0 goto step 1. underlying finite prime field (Fp ) arithmetic, that is modular
4: Compute e = H(m) arithmetic modulo a prime p. In a even higher layer there is
5: Compute s = k−1 (e + dr) mod n. If s = 0 goto step 1.
6: return (r, s). also the communication protocol to consider at least partially
as needed for the signature system.
For the use of ECDSA a set of common domain parameters
is needed to be known to all participants. These are the 
 
modulus q identifying the underlying field, parameters a, b 
 
  
 
defining the elliptic curve E used, a base point G ∈ E, the
 
 
order n of G and the cofactor h = order(E)
n . The signature 
   

generation and verification for a key pair (Q, d) can then   


be performed using the secret key d or the public key Q  



  


respectively. The procedures needed are shown in algorithms 

1 and 2. ! 

 "#$
Algorithm 2 ECDSA signature verification
Input: Domain parameter D = (q, a, b, G, n, h), public key Q, message m,
signature (r, s).
Output: Acceptance or Rejection of the signature Fig. 1. Overview of the signature system
1: if ¬(r, s ∈ [1, n − 1] ∩ N) then
2: return ”reject” The architecture and presentation of the system reflects this
3: end if
4: Compute e = H(m)
layering. The two upper layers are implemented as finite state
5: Compute w = s−1 mod n. machines (FSM) and make use of a basic Fp arithmetic logical
6: Compute u1 = ew mod n and u2 = rw mod n. unit (ALU) and some additional auxiliary modules. Figure 1
7: Compute X = (xX , yX ) = u1 G + u2 Q.
8: if X = ∞ then
outlines the structure of the system. The different building
9: return ”reject” blocks are detailed in the following paragraphs.
10: end if
11: Compute v = xX mod n.
12: if v = r then
A. Fp modular ALU
13: return ”accept” The central processing is done by a specialized Fp -ALU for
14: else
15: return ”reject” primes of maximum 256 bit length. It is based on the ALU
16: end if proposed by Ghosh et al in [10]. Figure 2 depicts the structure.
The ALU contains one Fp adder, subtractor, multiplier and
divider/inverter each. All registers and datapaths between the
IV. S ETUP AND S ITUATION modules are 256 bit wide so that complete operands up to 256
We assume an embedded system communicating with sev- bit width (as in the p256 case) can be stored and transmitted
eral peers which are not entirely known in advance. Therefore within a single clock cycle. Four Inputs, two outputs and
the exchanged signed messages are sent with a certificate four combined operand/result register as well as a flexible
attached that is issued by a commonly trusted certification interconnect allow for a start of two operations each at the
authority. same time as long as they do not use the same basic arithmetic
This contribution focuses on prime field ECDSA as it units. The units perform operations independently, so that
is proposed also for the application example. Implemented using different starting points parallel execution in all four
are especially two elliptic curves recommended by the U.S. subunits is possible. This allows for parallelisation in the scalar
National Institute of Standards and Technology (NIST) in [12], multiplication (see paragraph V-B1).
namely the curves p224 and p256 with bitlengths 224 and 256 The Fp -adder and -subtractor perform each operation in
respectively and the corresponding domain parameters also a single clock cycle as a general addition/subtraction with
given in the standard. subsequent reduction. The Fp multiplying module computes
The proposed system works as a security subsystem ex- the modular multiplication iteratively as shift-and-add with
clusively performing signature processing and passing and reduction mod p in every step. It therefore needs |p| clock
receiving messages m to and from the external system. cycles for one modular multiplication, |p| being the bitlength

116 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


X
XX  X
X!X 
Step Fp unit # cycles
 
1. t1 = y2 − y1 sub 1
2. t2 = x2 − x1 sub 1
& & 3. t2 = t1 /t2 (= λ); t3 = x1 + x2 div; add max. 2|p|
4. t1 = t2 · t 2 mult |p|
@ 7 @ 8 5. t1 = t1 − t3 (= x3 ) sub 1
6. t1 = x1 − t1 sub 1
*  7 *  8 *  9 *  ; 7. t1 = t2 · t 1 mult |p|
8. t1 = t1 − y1 (= y3 ) sub 1
?7 ?8 ?9 ?;
max. 4|p| + 5
 YX  !X 
 & TABLE I
!!X   X 
 '  '  '  '
H ARDWARE EXECUTION OF POINT ADDITION

 

! ! ! !

!Q
!Q

Step Fp unit # cycles


01. t1 = x 1 · x1 mult |p|
  

 *   
 
 ! " 
  " 
 02. t2 = t 1 + t1 add 1
03. t1 = t 1 + t2 add 1
Fig. 2. Schematic overview of the Fp ALU 04. t1 = t1 + a add 1
05. t2 = y1 + y1 add 1
06. t2 = t1 /t2 (= λ); t3 = x1 + x1 div; add max. 2|p|
07. t1 = t2 · t 2 mult |p|
of the modulus and thereby also the maximum bitlength of the 08. t1 = t1 − t3 (= x3 ) sub 1
09. t1 = x1 − t1 sub 1
operands. 10. t1 = t2 · t 1 mult |p|
Modular inversion and division is the most complex task 11. t1 = t1 − y1 (= y3 ) sub 1
of the ALU. It is based on a binary division algorithm on max. 5|p| + 7
Fp , see [10] for details. The runtime depends on the input TABLE II
values, maximum runtime being 2|p| clock cycles, in the p256 H ARDWARE EXECUTION OF POINT DOUBLING
case therefore up to 512 cycles. Statistical analysis showed an
average runtime of 1.5 · |p| clock cycles.
ALU control is performed over multiplexer and module
The operations in the branches inside the FOR-loop, mean-
control wires and is implemented as finite state machine
ing steps 5 and 6 in the IF-branch rsp. 8 and 9 in the
presented in the following paragraph. The complete ALU
ELSE -branch can be executed in parallel. Since it is a point
allocates 14256 LUT/FF pairs in a Xilinx Virtex-5 FPGA and
addition and a point doubling each, a real parallel execution
allows for a maximum clock frequency of 41.2 MHz (after
on the ALU is possible using a tailored scheduling. Figure 3
synthesis).
depicts the implemented schedule. Execution time is therefore
In addition to the 256 bit arithmetic based on the modulus at maximum ((|p| − 1) · (6|p| + 7) + (5|p| + 7)) = 6|p|2 + 6|p|
p256 the ECDSA unit also implements the arithmetic for clock cycles for the combination of point add and point double.
modulus p224. This can be done using the same hardware 2) Double scalar multiplication: For verification of
and is also implemented in the overlaying FSM. Theoretically ECDSA signatures two independent scalar multiplications
all moduli up to 256 bit width are supported by the ALU. have to be executed (see algorithm 2 step 7). Instead of
Details on resource consumption and performance values are computing independently in sequence it is faster to compute
given in section VII. them together using an approach published originally by
Shamir [16] also known as ”Shamirs trick” shown in algorithm
B. Elliptic curve processing 4.
On the elliptic curve E addition of points is defined as group
operation. Doubling of a point is specially implemented as it Algorithm 3 Scalar multiplication in E
requires a different computation because general point addition

l−1
is not defined with operands being equal. A comprehensive Input: Point P ∈ E; Integer k = ki 2i with ki ∈ {0, 1} and kl−1 = 1.
i=0
introduction to elliptic curve arithmetic including algorithms Output: Point Q = kP ∈ E.
can be found in [13].
1: P1 = P
The operation schedules for point addition and point dou- 2: P2 = 2P
bling for execution on the ALU are given in tables I and II. 3: for i = l − 2 downto 0 do
4: if k = 0 then
The execution schedules map the operations to the executing 5: P1new = 2P1old
units using three auxiliary register t1 , t2 , t3 for storing inter- 6: P2new = P1old + P2old
mediate results. 7: else
8: P1new = P1old + P2old
1) Scalar multiplication on E: Scalar multiplication is the 9: P2new = 2P2old
central step 2 of the signature generation algorithm 1. Com- 10: end if
putation is done iteratively using the so-called Montgomery 11: end for
12: return Q = P1
ladder [14], [15] showed in algorithm 3.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 117


#$$ %& %' $*, 

 Z

#/;<




 
 '@KQ
`
K 

*>? 
  

Z[K/\]K\^ 

| }
{ 


@ K


  



  

, ` 
;' %<
;{<
 K

| }
, `?


  

}
$ ` , `
|




~*>Z
&K


 $

Fig. 4. Procedure for signature verification

Fig. 3. Parallel Scheduling of PA and PD


D. SHA2 Hashing Module
Algorithm 4 Simultaneous multiple point multiplication The SHA2 hashing unit provides functions SHA-224 and

l−1 
l−1 SHA-256 according to the Secure Hash Algorithm (SHA)
Input: Point P, Q ∈ E; Integers k = ki 2i and m = m i 2i
i=0 i=0 standard [17]. It is based on a freely available verilog SHA-256
with ki , mi ∈ {0, 1} and kl−1 ∨ ml−1 = 1.
Output: Point X = kP + mQ ∈ E.
IP-core1 adapted with a wrapper performing precomputation
of the input data and providing a simple register interface
1: Precomputation: P + Q
2: X = O (point at infinity)
accepting data in 32 bit chunks. In addition the core has been
3: for i = l − 2 downto 0 do enhanced to support SHA-224. Figure 5 shows an overview.
4: X = 2X
5: X = X + (ki P + mi G)
6: end for 

7: return X ~X!

!X ~


  Z 88;\8^`
!

 

In contrast to algorithm 3 the central operations in steps  |
!X
€  



 " 
4 and 5 cannot be parallelized as they depend on each
!Q * ! " 
  | " 
other. The maximum time consumption of the algorithm is ~~
therefore ((4|p| + 5) + |p| · ((5|p| + 7) + (4|p| + 5))) =
9|p|2 +16|p|+5 clock cycles. This is nevertheless less than the
2 · (6|p|2 + 6|p|) + (4|p| + 5)) = 12|p|2 + 16|p| + 5 cycles two Fig. 5. Schematic overview of hashing submodule
independent scalar multiplications would consume. Assuming
uniform distribution step 5 is omitted in 25% of the cases The unit processes input data in blocks of 512 bit needing
leaving an estimated runtime of ((4|p| + 5) + |p| · ((5|p| + 7) + 68 clock cycles each at a maximum clock frequency of 120
0, 75 · (4|p| + 5))) = 8|p|2 + 14, 75|p| + 5 clock cycles. MHz (after synthesis) and a ressource usage of 2277 LUT/FF-
pairs. After finishing the operation the result is available in a
C. Signature and certificate control system 256 bit register.
On top of the elliptic curve (EC) operations and the control
FSM performing them the actual signature algorithms and E. Pseudo-Random Number Generation
the certificate verification are implemented. This is done in For ECDSA signature generation, a random value k is
a seperate FSM (see figure 1), controlling the EC arithmetic needed. To provide this k the system incorporates a Pseudo-
FSM, some registers and the auxiliary hashing and random Random Number Generator (PRNG) consisting of a two linear
number generation. Figure 4 shows the sequence of operations feedback shift registers (LFSR), one with 256 bit length,
of the signature verification. See algorithms 1 and 2 for the feedback polynomial x255 +x251 +x246 +1 and a cycle length
implemented procedures. of 2256 − 1 [18] and a second LFSR with 224 bit length,
This FSM is the upmost layer of the signature module and feedback polynomial x222 +x217 +x212 +1 and a cycle length
provides a register interface for operands like messages, sig- of 2224 − 1.
natures, certificates and keys. For integration in an embedded The LFSR occupies 480 LUT/FF pairs and allows for a
system it has to be wrapped to support the message format and maximum clocking of 870 MHz although operated in the
create the inputs to select the function needed. An example for
an integration is given in section VI. 1 Available as SHA IP Core at http://www.opencores.com

118 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


system in the general system clock of 50 MHz. It is operated  ' 

continuously to reduce predictability of the produced numbers. 


‚@ ?
 7
The current register content is read out on demand. †\?"

†
 
„


F. Certificate Cache  


Z~ 
* 
 


Usually digital signatures or their respective public key  
‚
needed for verification are endorsed by a certificate issued by 

 


'
 
a trusted third party, a so-called certification authority (CA), 
†*$




to prove its authenticity. Verification of the certificate requires 


$*    
a signature verification itself and is therefore equally complex 
ƒ Q * 
 „


than the main signature verification of the message. If com- 
 ~

municating several messages with the same communication ‚@ ?


 8
peer using the same signature key the certificate can be stored
hence saving the effort for repetitive verification. …! ' 

The system incorporates a certificate cache for up to 81


certificates stored in two BRAM blocks. It can be searched in Fig. 6. Wrapping of the signature system for V2X-integration
parallel to the signature verification. Replacement of certifi-
cates is performed using a least recently used (LRU) policy.
The signature system accepts incoming messages, verifies
VI. A PPLICATION E XAMPLE signatures and certificates and passes only verified messages
on to further processing. In case of an invalid signature the
The system offers complete ECDSA signature and certifi- outer system is informed. For outgoing messages signatures
cate handling and can be used in a variety of embedded are generated and the corresponding certificate is attached to
systems seeking authentication and security of communication. the message which is then passed on to the wireless interface.
As an application example we show the integration into a
vehicle-to-X (V2X) communication system. V2X communi- A. Key container
cation is an emerging topic aiming at information exchange In the V2X environment privacy of participants is of major
between vehicles on the road and between vehicles and importance. As messages containing vehicle type and values
infrastructure like roadside units. This can be used to enhance like current position, speed and heading are continuously
safety on roads, optimize traffic flow and help to avoid traffic broadcasted from twice to up to ten times a second, these
congestions. messages could easily be used by an eavesdropper to trace
To be able to base decisions and applications on information participants. To counter such attempts anonymity in the form
received from other vehicles, trustworthiness of this infor- of pseudonyms is used that are changed on a regular basis. A
mation is mandatory. To ensure the validity and authenticity number of pseudonymes for change are stored directly in the
of information, signature schemes are used to protect the signature modules key container (see figure 6). It also contains
messages broadcasted by the participating vehicles against the public keys of trusted certification authorities needed for
malicious attacks. As V2X communication is at present in the verification of certificates. The change itself is triggered by a
process of standardization, no fixed settings are available yet, dedicated message sent to the signature processing system by
but the use of ECDSA is proposed in the IEEE 1609 Wireless the central information processing module of the C2X system.
Access in Vehicular Environments standard draft [19] as well For all other modules this privacy function is fully transparent
as the proposals of european consortia [20]. as well.
In the chosen realization V2X communication is performed
by a modular FPGA-based On-Board-Unit (OBU) presented VII. R ESOURCES AND P ERFORMANCE
in [21]. It consists of different functional modules connected The presented system has been realized using a Xilinx
by a packet-based on-chip communication system [22]. The XC5VLX110T Virtex-5 FPGA on a Digilent XUP ML509
signature verification system is integrated as a submodule evaluation board. The following values refer to an implemen-
and performs signature handling for incoming and outgoing tation of the complete signature generation and verification
messages automatically, being therefore transparent to the unit with interfacing for the application example given above.
other modules except for the unavoidable processing latency. It Table III shows an overview of the resource usage.
is connected to two different on-chip communication systems, After integration of all submodules the ECDSA unit allows
one transmitting unsecured messages and the other transmit- for a maximum clock frequency of 50 MHz that has been
ting only secured messages containing signatures and certifi- successfully tested. Table IV shows signature verification
cates. A short description of the security system and its system performance values of the ECDSA unit at 50 MHz. Values
integration is given in [23]. Figure 6 depicts the wrapped for signature generation are given in table V.
signature system with the interfacing to both communication In both tables the worst case values given are calculations
structures. based on the statistically estimated runtime of the algorithms

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 119


Lut-FF Pairs rel. res. usage max. frequency
(Synthesis) on FPGA [MHz] R EFERENCES
Signature unit 32,299 46.7% 50 [1] FIPS, “Pub 197: Advanced Encryption Standard (AES),” federal infor-
ECDSA unit 24,637 36% 50.1 mation processing standards publication, U.S. Department of Commerce,
Hashing unit 2,277 3% 120.8 Information Technology Laboratory (ITL), National Institute of Stan-
PRNG 482 0.7% 872.6 dards and Technology (NIST), Gaithersburg, MD, USA, 2001.
Fp -ALU 14,256 20% 41.2 [2] R. L. Rivest, A. Shamir, and L. Adleman, “A method for obtaining digital
Fp -ADD 858 1.2% 83 signatures and public-key cryptosystems,” Communications of the ACM,
Fp -SUB 857 1.2% 92.8 vol. 21, no. 2, pp. 120–126, 1978.
Fp -MUL 2,320 3.4% 42.3 [3] FIPS, “Pub 186-3: Digital signature standard (dss),” federal informa-
Fp -DIV 5,670 8.2% 73.4 tion processing standards publication, U.S. Department of Commerce,
TABLE III Information Technology Laboratory, National Institute of Standards and
R ESOURCE U SAGE ON A XC5VLX110T WITH 69,120 LUT S Technology (NIST), Gaithersburg, MD, USA, 2009.
[4] N. Koblitz, “Elliptic curve cryptosystems,” Mathematics of Computation,
vol. 48, no. 177, pp. 203–209, 1987.
[5] V. S. Miller, “Use of elliptic curves in cryptography,” in Advances in
Verification secp224r1 secp256r1 Cryptology: CRYPTO ’85 Proceedings, pp. 417–426, 1986.
Compute time worst-case 7,23 9,42 [6] eBACS, “ECRYPT Benchmarking of Cryptographic Systems,” website,
[ms/Sig] simulated 7,17 9,09 http://bench.cr.yp.to/ebats.html, 2010.
Throughput worst-case 138 106 [7] M. Drutarovsky and M. Varchola, “Cryptographic System on a Chip
[Sig/s] simulated 140 110 based on Actel ARM7 Soft-Core with Embedded True Random Number
Latency worst-case 361151 471111 Generator,” in DDECS ’08: Proceedings of the 2008 11th IEEE Work-
[cycles/Sig] simulated 358478 454208 shop on Design and Diagnostics of Electronic Circuits and Systems,
(Washington, DC, USA), pp. 1–6, IEEE Computer Society, 2008.
TABLE IV [8] G. M. de Dormale and J.-J. Quisquater, “High-speed hardware im-
P ERFORMANCE OF SIGNATURE VERIFICATION AT 50 MH Z plementations of Elliptic Curve Cryptography: A survey,” Journal of
Systems Architecture, vol. 53, pp. 72–84, 2007.
[9] T. Güneysu and C. Paar, “Ultra high performance ecc over nist primes
on commercial fpgas,” in CHES, pp. 62–78, 2008.
[10] S. Ghosh, M. Alam, I. S. Gupta, and D. R. Chowdhury, “A robust
for scalar multiplication. As these runtimes depend on the gf(p) parallel arithmetic unit for public key cryptography,” in DSD ’07:
Proceedings of the 10th Euromicro Conference on Digital System Design
operand values, the measured computation times are different. Architectures, Methods and Tools, (Washington, DC, USA), pp. 109–
115, IEEE Computer Society, 2007.
Generation secp224r1 secp256r1 [11] Kimmo Järvinen, Jorma Skyttä, “Cryptoprocessor for Elliptic Curve
Compute time worst-case 5,56 7,26 Digital Signature Algorithm (ECDSA),” tech. rep., Helsinki University
[ms/Sig] simulated 5,45 7,15 of Technology, Signal Processing Laboratory, 2007.
[12] NIST, “Recommended elliptic curves for federal government use,” tech.
Throughput worst-case 180 138
rep., National Institute of Standards and Technology, U.S. Department
[Sig/s] simulated 184 140
of Commerce, 1999.
Latency worst-case 278097 362881
[13] D. Hankerson, A. Menezes, and S. Vanstone, Guide to Elliptic Curve
[Cycles/Sig] simulated 272345 357315
Cryptography. Springer-Verlag New York, 2004.
TABLE V [14] P. L. Montgomery, “Modular multiplication without trial division,”
P ERFORMANCE OF SIGNATURE GENERATION AT 50 MH Z Mathematics of Computation, vol. 44, pp. 519–521, 1985.
[15] P. L. Montgomery, “Speeding the Pollard and elliptic curve methods
of factorization,” Mathematics of Computation, vol. 177, pp. 243–264,
january 1987.
[16] T. ElGamal, “A Public Key Cryptosystem and a Signature Scheme Based
on Discrete Logarithms,” IEEE Transactions on Information Theory,
VIII. C ONCLUSION AND F URTHER W ORK vol. 31, july 1985.
[17] FIPS, “Pub 180-2: Secure hash standard (shs),” federal information pro-
cessing standards publication, U.S. Department of Commerce, National
We presented a hardware implemented subsystem for Institute of Standards and Technology (NIST), Gaithersburg, USA, 2002.
ECDSA signature processing for integration into embedded [18] Roy Ward, Tim Molteno, “Table of Linear Feedback Shift Registers,”
systems based on reconfigurable hardware. It can be integrated 2007. electronically available at http://www.otagophysics.ac.nz/px/
research/electronics/papers/technical-reports/lfsr table.pdf.
as a stand-alone subsystem performing transparent authentica- [19] IEEE Vehicular Technology Society, ITS Committee, IEEE Trial-Use
tion functionality for communication systems. Applicability of Standard for Wireless Access in Vehicular Environments (WAVE) - Se-
the system has been shown using vehicle-to-X communication curity Services for Applications and Management Messages, 06.07.2006.
[20] COMeSafety Project, “European ITS Communication Architecture -
as a practical example. Overall Framework,” 2008. available at www.comesafety.org.
The performance values presented in section VII are suffi- [21] O. Sander, B. Glas, C. Roth, J. Becker, and K. D. Müller-Glaser, “Design
cient for applications like entry control systems or electronic of a vehicle-to-vehicle communication system on reconfigurable hard-
ware,” in Proceedings of the 2009 International Conference on Field-
payment where the number of communication peers is small. Programmable Technology (FPT09), pp. 14–21, Institute of Electrical
For V2X communication even larger throughput up to over and Electronics Engineers, IEEE, Dec. 2009.
2000 signatures per second are necessary. Further work there- [22] O. Sander, B. Glas, C. Roth, J. Becker, and K. D. Müller-Glaser,
“Priority-based packet communication on a bus-shaped structure for
fore includes speeding up the computation and reducing the FPGA-systems,” in DATE, 2009.
footprint. Also the use of low cost FPGAs is required for the [23] B. Glas, O. Sander, V. Stuckert, K. D. Müller-Glaser, and J. Becker,
use in embedded systems. “Car-to-car communication security on reconfigurable hardware,” in
VTC2009-Spring: Proceedings of the IEEE 69th Vehicular Technology
Conference, (Barcelona, Spain), 2009.

120 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


A Secure Keyflashing Framework for Access
Systems in Highly Mobile Devices
Alexander Klimm, Benjamin Glas, Matthias Wachs, Jürgen Becker, and Klaus D. Müller-Glaser
Institut für Technik der Informationsverarbeitung, Universität Karlsruhe (TH)
email: {klimm,glas,mueller-glaser,becker}@itiv.uni-karlsruhe.de

Abstract—Public Key Cryptography enables for entity authen- that substitute classical mechanical keys. The owner or au-
tication protocols based on a platform’s knowledge of other thenticated user identifies himself to the user device (UD) by
platforms’ public key. This is particularly advantageous for possession of the token. The UD and token are linked. Only if
embedded systems, such as FPGA platforms with limited or none
read-protected memory resources. For access control to mobile a linked token is presented to the UD it is enabled or access
systems, the public key of authorized tokens need to be stored to the UD is granted. In order to present a token to a UD,
inside the mobile platform. At some point during the platform’s information needs to be exchanged between the two over an
lifetime these might need to be updated in the field due to loss or usually insecure channel. To prevent the usage of a device
damage of tokens. This paper proposes a secure scheme for key or its accessibility through an unauthorized person this data
flashing of Public Keys to highly mobile systems. The main goal of
the proposed scheme is the minimization of online dependencies exchanged needs to be secured.
to Trusted Third Parties, certification authorities, or the like Authentication schemes based on Public Key Cryptography
to enable for key flashing in remote locations with only minor such as the Needham-Schroeder protocol [11], Okamoto Pro-
technical infrastructure. Introducing trusted mediator devices, tocol [12], and Schnorr-Protocol [16] provide authentication
new tokens can be authorized and later their public key can be procedures where no confidential data needs to be transmitted.
flashed into a mobile system on demand.
Secret keys need only be stored in the tokens and not in the
UD thus omitting the need for costly security measures in the
I. I NTRODUCTION UD. Only public keys need to be introduced into the UD (see
Embedded systems in various safety critical application section II). This operation certainly does need to be secured
domains like automotive, avionic and medical care perform against attacks. For real-world operation this operation is done
more and more complex tasks using distributed systems like in the field where the UD is not necessarily under the control
networks of electronic control units (ECUs). The introduction of the manufacturer (OEM) and a live online connection to the
of Public-Key Cryptography (PKC) to embedded systems OEM is not possible.
provides essential benefits for the production of electronic In this paper we propose a system to introduce public keys
units needing to meet security requirements as well as for into FPGA based user devices to pair these with a new token.
the logistics involved. Due to the nature of PKC, the number The proposed key flashing method allows for authorization of
of keys that need to be stored in the individual platforms the flashing process through an OEM. At the same time it can
is minimized. At the same time only the private key of the be carried out with the UD in the field and with no active
platform itself needs to be stored secretly inside each entity online connection while flashing the key.
- in contrast to symmetric crypto systems where a secret key Introduction or flashing of new keys to an embedded device
needs to be stored inside several different entities at the same can be seen as a special case of a software update. Here
time. In context of PKC, if one entity is compromised, the the main focus is usually on standardization, correctness,
others remain uneffected. robustness, and security. Recent approaches for the automotive
Computational efforts of cryptographic functionalities are area have been developed e.g. in the german HIS [8], [7] or
very high and time consuming if carried out on today’s the EAST-EEA [3] project. A general approach considering
standard platforms (i.e. microcontrollers) for embedded appli- security and multiple software providers is given in [1].
cations. Integrating security algorithms into FPGA platforms Nevertheless general update approaches are focused on the
provides for high speed up of demanding PKC crypto systems protection of IP and the provider against unauthorized copying
such as hyperelliptic curve cryptography (HECC). By adding and less on the case that the system has to be especially
dedicated hardware modules for certain parts of a crypto protected against unwanted updates as in our keyflashing
algorithm, a substantial reduction of computation time can be scenario.
achieved [10] [9]. The remainder of this paper is structured as follows. In sec-
Besides encrypting or signing messages, PKC can be em- tion II we present the basic application scenario followed by
ployed to control user access to a device via electronic tokens. a short introduction of public key cryptography in section III.
Examples for this are Remote Keyless Entry (RKE) systems The requirements for the targeted scenario are described in IV.
[17] in the automotive domain, or Hilti’s TPS technology In section V the protocol is shown and some implementational
[2]. These systems incorporate contactless electronic tokens results are given in section VI. We conclude in section VII.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 121


II. A PPLICATION S CENARIO • A user device UD that can only be accessed or used by
A mobile user device (UD) such as a vehicle, construction an authenticated user
machine or special tool with restricted access is fitted with an • A human user OWN. He is authorized to access or use UD

FPGA based access control system. This allows only the owner if he possesses a legit token
or certified users access to the device’s functionalities or even • A transponder key token TRKorig originally linked to UD

the device itself. This is achieved with a transponder (TRK) and a second token TRKnew that shall be flashed to UD
serving as an electronic version of a mechanical key. The additionally.
transponder is able to communicate to the UD via a wireless • The manufacturer OEM that produces UD

communication channel. The user device accepts a number of UD accepts a number of TRK to identify an authenticated
transponders. If one of these is presented to the user device it human user OWN of the UD. At least two tokens are linked
authenticates the transponder and the device is unlocked, thus to a UD, by storing the respective public keys VKT RK inside
granting access. the UD. The OEM is initially the only entity allowed to write
Authentication is done using Public Key Cryptography public keys into any UD.
(PKC). The Public Key of the transponders are stored securely Solely the public keys stored inside the UD are used for any
inside the user device thus establishing a ”‘guest list”’ of legal authorization check of TRKs using any PKC authentication
users to the device. During production two initial Public Keys protocol (e.g. [11], [12], [16]). The OEM’s public key VKOEM
are introduced into the user device. is stored in the UD as well.
In case of loss of a transponder it is desirable to replace it, OEM, TRK, and UD can communicate over any insecure
particularly if the user device itself is very costly or actually medium, through defined communication interfaces.
irreplaceable. Since the user device is mobile, replacement of
A. Goals
the transponder’s public key usually needs to be done in the
field. This might include very remote areas with minor to none A new TRKnew should be linked to a UD to substitute for
communication infrastructure. an original TRKorig that has been lost or is defective. From
this point on we’ll call the process of linking TRKnew to
III. BASIC PKC F UNCTIONALITIES a UD flashing. Flashing a TRK should be possible over the
In 1976, Whitfield Diffie and Martin Hellman introduced complete life cycle of the UD. When flashing the UD it is
PKC crypto systems [6]. Two different keys are used, one probably nowhere near the OEM’s location while flashing of a
public and the other secret (SK). SK and VK are a fixed TRK needs to be explicitly authorized by the OEM. Any TRK
and unique keypair. It is computational infeasible to deduce can only be flashed into a single UD. Theft or unauthorized
the private or secret key (SK) from the public key1 (VK). use of the UD resulting from improper flashing of the TRK
With VK a message Mp can be encrypted into Mc but not needs to be prohibited.
decrypted with the same key. This can only be done with In addition we demand that online connection of UD and
knowledge of SK. If an entity Alice wants to transmit a OEM during flashing a TRK must not be imperative.
message MAlice,plain to an entity Bob, it encrypts it with Bobs
public key VKBob . Only Bob can retrieve the plain text from B. Security Requirements
the encrypted message, by applying the appropriate decryption The protocol shall allow dependable authorized flashing
algorithm using his own secret key SK. under minimal requirements while preventing unauthorized
PKC can also be used to digitally sign a message. For this flashing reliably. Therefore it has to guarantee the following
a signature scheme is used that is usually different from the properties, while assuming communication over an unsecured
encryption scheme. When signing a message the secret key is open channel:
used and the signature can be verified by using the according • Correctness: In absence of an adversary the protocol has
public key. In other words, if Bob wants to sign a message, to deliver the desired result, i.e. after complete execution
he uses his own private key that is unique to him. This key is of the protocol the flashing should be accomplished.
used to encrypt the hash value of the message MBob,plain . The • Authentication: The flashing should only be feasible if
resulting value {HASH(MBob,plain )}sig is transmitted the both OEM and OWN have been authenticated and have
together with MBob,plain . A receiver can validate the signature authorized the operation.
by using Bob’s public key and retrieving HASH(MBob,plain ). • No online dependency: The protocol shall not rely on
From MBob,plain the receiver can reconstruct the received hash any live online connection to the OEM.
value and compare it with the decrypted value. If both match • Confidentiality: No confidential data like secret keys
the signature has been validated. should be retrievable by an adversary.
IV. S ITUATION AND REQUIREMENTS ANALYSIS C. Adversary model
In our application scenario we have the following main We assume an in processing power and memory polynomi-
entities: ally bounded adversary A that has access to all inter-device
1 In the case of signature schemes the public key is often called verification communications, meaning he can eavesdrop, delete, delay,
key. alter, replay or insert any messages. We assume further that

122 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


the adversary is attacking on software level without tampering to TRK. Read access to VKT RK is granted to any entity over
with the participating devices. a communication interface.
Without choosing particular instances of the cryptographic TRK possesses cryptographic primitives for PKC based
primitives we assume that the signature scheme used is secure authentication schemes on prover’s side and can thereby be
against existential forgery of signatures and regard the hashing authenticated by communication partners.
function used as a random oracle. 5) SP - Service Point: SP is a service point in the field
such as a wholesaler, certified by the OEM. Typically a SP
V. K EY F LASHING C ONCEPT is a computer terminal. Access to the terminal is secured by
Focus of the proposed key flashing protocol is the intro- means of a password as in standard PC practice. A SP can
duction of a public key VKT RK into UD. We abstract over communicate to the OEM as well as to the UD. At the same
the implementation of communication interfaces, PKC systems time it is able to read the VKT RK of any TRK.
as well as the immediate implementation of the devices and Furthermore the SP constitutes a trusted platform meaning
entities themselves. that it always behaves in the expected manner for the flashing
Two basic Flashing Scenarios are conceivable. One is that procedure and accommodates a trusted module responsible for:
TRKs are flashed directly by the OEM, either during production • storage of authorized VKT RK
or via an online connection. We concentrate on the second one, • secure counter
flashing of TRKs through an authorized service point (SP) with • key management of authorized VKT RK
no immediate online connection to the OEM. SP possesses cryptographic primitives for PKC based au-
thentication schemes on prover’s and verifier’s side and can
A. Entities
thereby be authenticated by communication partners as well
In addition to the entities introduced above (UD, OWN, as authenticate communication partners.
TRK and OEM) we use two additional participants, namely a 6) SPE - Employee of Service Point: A SPE is a physical
service point SP and an employee SPE of this service point person that is operating the SP and is regarded as a potential
conducting the flashing procedure. attacker of the flashing operation. Access control of a SPE to
1) OEM - Manufacturer: The OEM manufactures the UD the SP is enforced via password or similar. SPE is responsible
and delivers it to OWN. OWN is issued the corresponding TRKs for the system setup for the flashing application consisting of
linked to the UD. All UDs are obviously known to the OEM. establishing the communication links of UD, SP, TRK, and
The verification keys VKT RK are stored by the OEM together OEM if needed.
with their pairing to the UD. Therefore the OEM knows what UD, TRKnew , and SP are under control of the SPE and the
TRK is linked to what UD. We regard the entity OEM as a communication links to UD, TRKorig , TRKnew , SP, and OEM
trusted central server with a database functionality. can be eavesdropped, the trusted module can not be penetrated
The OEM can store data, sign data with SKOEM and though.
send data. It possesses all cryptographic abilities for PKC
based authentication schemes and can thereby authenticate B. Steps
communication partners. The following steps are necessary to introduce an new
2) UD - User Device: UD is enabled only when a linked VKT RK into a UD avoiding online dependency. All of them
TRK is presented by authenticating the TRK via a PKC are included in figure 1.
authentication scheme. All linked TRKs’ public keys VKT RK 1) Delegation of trust to SP
are stored in the UD. Additionally the public key of the OEM 2) Authorization of SPE by SP
VKOEM is stored in the UD and can not be erased or altered 3) Authorization of TRKnew by OEM
in any way and UD has a OEM-issued certificate for it’s own 4) Introducing an authorized TRKnew into a UD
public key certifying being a genuine part. UD grants read
access to all stored public keys. Write access to the memory Authorization of SPE can be done e.g. via a password
location of VKT RK is only granted in the context of the (knowledge) or by biometrical identification (physical prop-
proposed key flashing scheme. erty). The delegation of trust and authorization of TRKnew s is
very closely related and described in section V-C. These steps
UD possesses all cryptographic abilities for PKC based
form the first phase of the flashing process and can be done in
authentication schemes and can thereby authenticate commu-
advance without UD and OWN but need a communication link
nication partners.
to OEM. The final introduction of a new VKT RK is the second
3) OWN - Legal User: OWN is the legal user of UD and can
phase and is detailed in section V-D. It does no longer depend
prove this by possession of a linked TRKorig .
on interaction with OEM.
4) TRK - Transponder: TRK2 possesses a keypair
VKT RK /SKT RK for PKC functionality. It is generated inside C. Trust Delegation and TRKnew Authorization
the TRK to ensure that the secret key SKT RK is known solely
To be able to perform a key flashing procedure without an
2 TRKs can be manufactured by a supplier that has been certified by the active link to OEM a local representative has to be empowered
OEM by the OEM to perform the flashing, assuming that UD trusts

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 123


and OWN, former either directly or delegated to SP using the
credential introduced above, latter done by presenting a valid
linked TRKorig assumed to be solely accessible by OWN. If
an online connection to OEM is available the protocol can
be performed by UD and OEM directly, SP only relaying
communication.
In either case UD and SP authenticate each other mutually
using their respective OEM-issued certificates. UD additionally
checks authorization by OWN, testing whether a valid linked
token is present or not. If all these tests passed, SP presents the
authorized and OEM-signed new TRKnew to UD which checks
the OEM signature and credential. In the case of successful
verification UD accepts the new token TRKnew and adds
VKT RK to it’s internal list of linked tokens.
E. Entity Requirements
Regarding the proposed flashing protocols certain require-
ments for the entities’ functionalities have to be satisfied. An
overview is given in table I

OEM SP UD TRK
Initiate Communication • •
Acknowledge Communication • • •
Fig. 1. Flashing Scheme Generation of Keypairs • •
Signatures Generation • • • •
Signature Verification • • •
Random Number Generation • • •
only the OEM to flash legit keys. This is done by presenting a
Datamanagement for suppliers •
credential to UD accounting that flashing is authorized by OEM. Datamanagement for User Devices •
The exchange of this credential is denoted in the following as Datamanagement for Service Points •
trust delegation. Datamanagement for TRKs • •
In our case SP is the local representative. In order to request Secure Storage for delegated Trust •
Knowledge of OEM’s public key • •
the flashing credential from the OEM, SPE has to be authen-
ticated first to prevent SP abusive operations. Afterwards SP TABLE I
E NTITY R EQUIREMENTS
can connect to OEM and request a trust credential. This is
issued only after mutual authentication and only to known
partner service points. It is always valid for only a limited
Data management is one of the key requirements in the
time and limited number of flashing operations to minimize
protocol in the sense that public key data needs to be stored.
negative impact of compromised SPs. This is controlled and
Secure storage for delegated trust has some additional require-
enforced by the trusted component inside SP using the secure
ments such as intrusion detection to protect data from being
unforgeable counter keeping track of the number of flashing
altered in any way. At the same time it is mandatory that this
cycles.
data is always changed correctly as demanded by the protocol.
The public key of a TRKnew needs to be authorized by Also the OEM’s public key needs to be firmly embedded into
OEM. SP can read out VKT RK and send it to OEM. If SP is the entity and must not be altered in any case, otherwise the
allowed to flash TRKs into a UD, the OEM sends the authorized OEM can not be identified correctly from the protocol’s point
VKT RK back to SP which is stored in SP’s trusted module. of view.
Only a limited number of authorized TRKs can be stored at
any given point in time. VI. I MPLEMENTATION
As soon as a TRK has been authorized by the OEM, physical The protocol has been implemented as a proof of concept
access to the TRK needs to be controlled. The authorization in a prototypical setup based on a network of a standard PC
process of TRKs is the only step that demands for a data representing OEM and SP. Furthermore Digilent Spartan3E
connection between SP and OEM. This does not necessarily Starter Boards with a Xilinx XC3S500 FPGA represent TRKs
need to be an online connection since data could be transported and UDs.
via data carriers such as CDs, memory sticks, or the like. In figure 2 all implemented instances are depicted. TRK,
SP, and UD have to be connected when flashing the key. The
D. Flashing of TRK OEM connection needs to be established anytime prior to the
The actual flashing of a TRKnew to a given UD demands flashing according to the proposed protocol and is connected
for a valid new transponder TRKnew , authorization by OEM via TCP/IP with the SP.

124 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Module lines of code percentage
Main application 1234 41.77
GUI 264 8.94
Cryptography 385 13.03
Interaction 383 12.97
Communication 545 18.45
Data Management 143 4.84
Total 2954 100
TABLE III
P ROPERTIES OF OEM C OMPONENT

Slices 1.791 of 4.656 (38%)


Fig. 2. Component Interaction Slices: FlipFlops uses 1.590 of 9.312 (17%)
Slices: LUTs used 1.941 of 9.312 (20%)
Key Length 1024 Bit BlockRAMs used 16 of 20 (80%)
Exponent 216 + 1 (65537) Equivalent Logic Cells 1.135.468
Padding Scheme PKCS#1 v1.5 Minimal clock period 18,777 ns
Signature Scheme PKCS#1 v1.5 Maximum clock frequency 53,257 MHz
Hashing Scheme used for signing SHA1
TABLE IV
TABLE II FPGA R ESOURCES
PARAMETERS FOR RSA-S YSTEM

functional encapsulation. For ease of usage a graphical user


All other communication is done over RS232 interfaces interface (GUI) is included as well in both entities.
that are available both on PC and the FPGA boards as well.
These can be substituted for other communication structures C. Transponder/UserDevice - FPGA platform
if needed, i.e. wireless transmitters. The targeted user device is an FPGA. To enable for
easy reuse of functionalities the exemplary TRK has been
A. Choice of cryptographic Algorithms implemented on FPGA as well, but can also be integrated
into a smart card or RFID chip as long as the appropriate
The proposed keyflashing concept demands for asymmetric
cryptographic primitives are provided.
encryption and a cryptographic hashfunction. RSA [15] is
A MicroBlaze softcore processor is incorporated that pro-
chosen for encryption and signing, SHA1 for hash function-
vides all functionality including cryptographic functions.
ality. Both schemes are today’s standard and have not been
Hardware peripherals such as a LCD controller have been
broken yet, but can be substituted in our implementation
integrated for debugging purposes. To enable for handling of
for more secure schemes if needed. RSA as well as SHA1
big numbers, as are used in the cryptographic functions of the
implementations are freely available as software and hardware
protocol, the libraries libtommath [5] and libtomcrypt
IP for numerous platforms. In table II the RSA parameters
[4] are used. Only necessary components have been extracted
chosen are given.
from those libraries and are integrated into TRK and UD.
All signatures in our context are SHA1-hash values of
data that has been encrypted according to the signing scheme D. Resource Usage
PKCS#1 v1.5 [14]. Such a signature has a length of 128 Byte The resource usage of the components OEM and SP are very
when using a keylength of 1024 bit and hashvalues of 160 bit similar, since almost identical functional software blocks are
bitlength. used in both. Table III gives an exemplary overview of the lines
of codes of the OEM implementation. The memory footprint of
B. OEM/Service Point - Software Platform the compiled OEM implementation is 129 KB (139 KB for the
Both components OEM and SP have been implemented on SP implementation). At start up 15400 KB of main memory
a standard PC. All functionalities have been implemented in is used. The execution times for RSA- and SHA1-operations
software under the .NET frameworks version 2.0 using C#. were measured on a PC (2 GHz, 1024 MB RAM) and are all
The .NET framework provides the Berkeley Socket-interface in the range of milliseconds.
for communication over the PC’s serial interface. At the same Resource usage of the FPGA based components UD and
time in includes the Cryptography-namespace providing TRK are given in table IV. By implementing all functionality
all needed cryptographic primitives including hashing func- on a MicroBlaze softcore, the hardware usage is quite moder-
tions and a random number generator that are based on the ate. On the other hand the software footprint is 295 KB for the
FIPS-140-1 certified Windows CryptoAPI. The software is UD implementation, due to the non-optimized memory usage
modularized to enable for easy exchange of functional blocks of the crypto library used.
and seamless substitution of algorithms. Software modules Shown in table V are the execution times of the divers pro-
communicate only over defined interfaces to enable for full tocol instances. The duration of parts of the protocol that are

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 125


Protocol instance Duration
(min:sek.ms) has been used for simplicity. Speed up can be achieved with
ReadOut of Transponder 01:32.000 a hardware/software codesign as done in [10]. For maximal
Mutual Authentication of UD und TRK 03:14.000 speed a full FPGA hardware implementation is desirable, as
Direct Keyflashing has been done in [13] for cryptographic functionalities of a
Keyflashing to Transponder by OEM 23:50.000 HECC system.
Keyflashing by ServicePoint
The user authentication via PKC can be a solution for
Delegation of trust OEM to SP 00:00.350
Transponderdelegation 00:00.250 dedicated function enabling. Different functionalities can be
Keyflashing to Transponder by SP 12:43.000 configured onto an FPGA using partial dynamic reconfigura-
TABLE V tion. By either allowing or prohibiting, the configuration of a
P ROTOCOL E XECUTION T IMES certain bitstream depending on the user employing the system,
usage policies could be enforded thus opening up new business
models for suppliers of FPGA based systems.
One crucial point is the protection of the TRK’s public key
based soley on OEM and SP is in the area of few milliseconds. stored in the UD against physical attackers. The possibility
As soon as mobile devices (UD, TRK) process parts of the of countermeasuring attacks that might alter stored keys on a
protocol, speed is declining since all crypto operations are physical level needs to be investigated in the future as well.
currently carried out on an embedded microcontroller. Main
factor here is the RSA decryption operation. With appropriate R EFERENCES
hardware support, choice of parameters and cryptosystem, [1] Andr Adelsbach, Ulrich Huber, and Ahmad-Reza Sadeghi. Secure
substantial speedups can be achieved as shown in [9]. software delivery and installation in embedded systems. In Robert H.
Deng, editor, ISPEC 2005, volume 3439 of LNCS, pages 255–267.
VII. C ONCLUSIONS AND F UTURE W ORK Springer, 2005.
[2] Hilti Corporation. Electronic theft protection. Available electronically
In this paper we proposed a scheme for flashing public keys at www.hilti.com, 2007.
into mobile FPGA devices under the constraint that no online [3] Gerrit de Boer, Peter Engel, and Werner Praefcke. Generic remote
software update for vehicle ecus using a telematics device as a gateway.
connection to the system manufacturer (OEM) is mandatory. Advanced Microsystems for Automotive Applications, pages 371–380,
It is applicable for a variety of embedded systems that need 2005.
to implement and enforce access or usage restrictions in the [4] Tom St. Denis. Libtomcrypt. http://libtomcrypt.com/.
[5] Tom St. Denis. Libtommath. http://math.libtomcrypt.com/.
field. The scheme was implemented as a proof-of-concept [6] Whitfield Diffie and Martin E. Hellman. New directions in cryptography.
using a combination of PC-based and FGPA-based protocol IEEE Transactions on Information Theory, IT-22(6):644–654, 1976.
participants. [7] Herstellerinitiative Software (HIS). HIS-Presentation 2004-05, 2005.
available electronically at www.automotive-his.de.
A. Security Analysis [8] Herstellerinitiative Software (HIS). HIS Security Module Specification
v1.1, 2006. available electronically at www.automotive-his.de.
Looking at the security of the proposed concept some [9] Alexander Klimm, Oliver Sander, and Jurgen Becker. A microblaze
points can be identified where security relies on policies and specific co-processor for real-time hyperelliptic curve cryptography on
xilinx fpgas. Parallel and Distributed Processing Symposium, Interna-
implementing rules while other issues are covered by design. tional, 0:1–8, 2009.
Using PKC primitives and trusted computing approaches the [10] Alexander Klimm, Oliver Sander, Jürgen Becker, and Sylvain Subileau.
protocol ensures confidentiality of secret keys and mutual au- A hardware/software codesign of a co-processor for real-time hyperel-
liptic curve cryptography on a spartan3 fpga. In Uwe Brinkschulte, Theo
thentication of SP and OEM, OWN and UD, SP and UD, SP and Ungerer, Christian Hochberger, and Rainer G. Spallek, editors, ARCS,
SPE. But due to the necessity of online-independence there are volume 4934 of Lecture Notes in Computer Science, pages 188–201.
some assumptions that have to be made to guarantee security. Springer, 2008.
[11] Roger M. Needham and Michael D. Schroeder. Using encryption
This is mainly the trustworthiness of the SP in combination for authentication in large networks of computers. Commun. ACM,
with the physical protection of authorized TRKorig . 21(12):993–999, December 1978.
If these assumptions are broken e.g. by theft of authorized [12] Tatsuaki Okamoto. Provably secure and practical identification schemes
and corresponding signature schemes. In CRYPTO ’92: Proceedings of
TRK, the corresponding SP and the SPE password, unautho- the 12th Annual International Cryptology Conference on Advances in
rized flashing may be possible. As countermeasures the usage Cryptology, pages 31–53, London, UK, 1993. Springer-Verlag.
of the protocol can be adapted to dilute effects of such events. [13] J. Pelzl, T. Wollinger, and C. Paar. Embedded Cryptographic Hardware:
Design and Security, chapter Special Hyperelliptic Curve Cryptosystems
So the number of allowed authorized TRK should be as low of Genus Two: Efficient Arithmetic and Fast Implementation. Nova
as possible and the SP should be implemented using trusted Science Publishers, NY, USA, 2004. editor Nadia Nedjah.
components and based on a trusted platform secrets should be [14] RSA Laboratories Inc: RSA Cryptograpy Standard PKCS No.1. Elek-
tronisch verfügbar unter http://www.rsasecurity.com/rsalabs.
especially protected against misuse by a physical attacker. [15] R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital
signatures and public-key cryptosystems. Communications of the ACM,
B. Future Work 21(2):120–126, 1978.
Flashing speed is of utmost importance in real world im- [16] Claus P. Schnorr. Efficient identification and signatures for smart cards.
In CRYPTO ’89: Proceedings on Advances in cryptology, pages 239–
plementation. To make allowance for a real world integration 252, New York, NY, USA, 1989. Springer-Verlag New York, Inc.
of the proposed flashing schemes, optimizations regarding [17] Henning Wallentowitz and Konrad Reif, editors. Handbuch Kraft-
usage and speed of the computational units involved are fahrzeugelektronik: Grundlagen, Komponenten, Systeme, Anwendungen.
Vieweg, Wiesbaden, 2006.
needed. In the current prototype the MicroBlaze processor

126 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Teaching Reconfigurable Processor: the Biniou
Approach
Loic Lagadec ∗ †, Damien Picard ∗ † and Pierre-Yves Lucas
∗ Université Européenne de Bretagne, France.
† Université de Brest ; CNRS, UMR 3192 Lab-STICC, ISSTB,
20 avenue Le Gorgeu
29285 Brest, France.
loic.lagadec@univ-brest.fr

ADL C code Memory Optimizing


Abstract—This paper presents the Biniou approach, that we Access Context

have been developping in the Lab-STICC, together with a Prospection

”software for embedded systems” master course and its related Model Application

project. This project addresses building up a simple RISC


processor, that supports an extensible instruction set thanks to VHDL Tool Set Verilog
its reconfigurable functional unit. The Biniou approach covers EDIF
BLIF
tasks ranging from describing the RFU, synthesizing it as VHDL Circuit Bitstream Metrics
code, and implementing applications over it.

I. I NTRODUCTION Fig. 1. Overview of the Biniou flow.

a) The Master curriculum LSE: The master curriculum


LSE (Software for Embedded Systems) opened two years ago Hence Biniou enhances durability and modularity, and shorten
at the university of Western Brittany. It addresses emerging development time of programming environments.
trends in embedded systems, mainly from a software point of Figure 1 provides an overview of Biniou. In the application
view despite warmly welcoming EE (Electronic Engineering) side (right) an application is specified as C-code, memory
students. It highly focuses on RC (Reconfigurable Computing) access patterns and some optimizing contexts we use to tailor
based embedded systems, with a set of courses for teaching the application. This side outputs some post-synthesis files
hardware/configware/software co-design. conforming to mainstream formats (Verilog, EDIF, BLIF).
b) The underlying research support: The research group Results can be further processed by the Biniou P&R layer
behind this initiative is the Architectures & Systems team to produce a bitstream. Of course the bitstream matches the
from the Lab-STICC (UMR 3192). This group owns a legacy specification of the underlying reconfigurable target, be the
expertise in designing parallel reconfigurable processor (the target modeled using a specific ADL. A model is issued on
Armen [1] project was initiated in 1991) but focuses on CAD which the P&R layer can operate as previously mentioned,
environment developments (Madeo framework [2]). and a behavioral VHDL description of the target is generated
c) The Madeo framework: The madeo framework is for simulation purposes and FPGA implementation.
an open and extensible modeling environment that allows Also some debugging facilities can be added either in the
representing reconfigurable architectures. It acts as a one- architecture itself or as parts of the application [3].
stop shopping point providing basic functionalities to the pro- e) From research to teaching: Biniou has been exercised
grammer (place&route, floorplanning, simulation, etc.). Based as a teaching platform for Master 2 students from the LSE
on Madeo, several commercial and prospective architectures class. This happened in the one-year DOP (standing for
have been designed (Virtex, reconfigurable datapath, etc.), and Description and Physical Tools) course, during which students
some algorithms have been tailored to address nano-computing had to perform practical sessions and to lead a project covering
architectures (WISP, etc.). VHDL hand-writing, reconfigurable architecture modeling and
d) Biniou: The Madeo project ended in 2006 while programming, code generation, modules assembly to exhibit
being integrated as facilities in a new framework. This new a simple processor with a reconfigurable functional units
framework, named Biniou, embeds additional capabilities such allowing to extend its instruction set.
as, from the hardware side, VHDL export of the modeled The rest of this paper is organized as follows: section II
architecture and wider interchange format and extended syn- describes the DOP course, section III introduces the project
thesis support from the software side. Biniou targets Recon- while section IV points out the results that came out from the
figurable SOCs design and offers middleware facilities to project. Finally some concluding remarks and perspectives are
favor a modular design of reconfigurable IPs within the SOC. presented in section V.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 127


II. A N OVERVIEW OF THE DOP MASTER COURSE architectures are presented: fine-grained FPGA, coarse-grained
A. Students profile architectures and reconfigurable processors.
Before describing the main architectural characteristics,
1) Former curriculum: The master gathers students from general definitions are given for reconfigurable computing,
both CS (Computer Science) and EE former curricula. The reconfigurable architectures and the notion of granularity.
current master size is 12, coming from half a dozen countries. FPGAs are introduced by presenting the FPGA marketplace
Half of the students are former local students hence own and the main suppliers. We also show ASICs costs trends for
a local background in term of CS but suffer from lacks in justifying the growing interest for reconfigurable technology
electronic system design. and an evolution of the FPGA application domain which tends
As this situation could carry some risks, we chose to make to diversify over time.
students pair-achieve the project. In this way, beyond simply In order to give to students the main architectural concepts
averaging the pre-requisites matching so that the pairs are behind FPGAs, we present a first simple architecture. A mesh
equally offered a chance to succeed, we intended to favor of basic processing elements (PE) composed of one 4-LUT
incidental learning as pointed out by Shanck [4]. with its output possibly latched. Combination of the basic
2) Foreigners issues: Unfortunately, one remaining diffi- blocks (LUT, switch, buses and topology) is presented as a
culty the foreigners must face, lies in gaining the visa to enter template to be extended and complexified (in terms of routing
the country, what can take an unpredictable long delay. We structure and processing elements) for building real FPGA. A
activated a facility that the remote-teaching service of the UBO more complex view is given by an example from the industry,
offers, that is a secured collaborative web site, on which all a Xilinx Virtex-5, with an emphasis on locating template basic
the slides and supports remain at the student’s disposal as soon blocks within Xilinx schematics. As a result, students are able
as he got an access code - that we provide early enough to to locate the essential elements for a better understanding of
serve as a remedial class. Also, this offers monitoring features state-of-the-art architectures.
(recent activities such as document downloads or due work The coupling of a reconfigurable unit with a processor (as
uploads are logged on). know as reconfigurable processor) is presented as another
B. Course organization alternative for accelerating intensive DSP tasks. The concept
of instruction set metamorphosis [5] is defined and a set of
Courses are organized around two main topics covering architectures are described. For example, P-RISC [6], Garp
the hardware (architectures) and software (CAD tools and [7], XiRISC [8] and Molen [9]. A specific focus is set on the
compiler basics) aspects of reconfigurable computing. Tradi- Molen programming model and its architectural organization.
tionally, these topics are not grouped in a same curriculum and The Molen approach is presented as a meeting point between
are either taught in CS or EE. DOP courses enable students to the software domain (sequential programming and compiler)
build from their previous knowledge a cross-expertise giving and the hardware domain (specific instruction designed in
a complete vision of the domain. hardware).
1) An overview of the reconfigurable computing landscape: Drawbacks of fine-grained architectures such as low compu-
This course aims at giving a global overview of the reconfig- tation density and routing congestion are highlighted to intro-
urable computing (RC) landscape. It focuses on both industrial duce coarse-grained architectures. This type of reconfigurable
and academic architectural solutions and is structured in three architecture is firstly presented as a specialization of FPGA (in
parts: term of routing resources and processing elements) suited for
• Overview of RC for embedded systems (2 sessions) DSP application domain. Architectures presented are Kress-
• Virtualization techniques for RC (2 sessions) Array [10], Piperench [11], PACT XPP [12], Morphosys [13].
• Modeling and generation of reconfigurable architectures Programming model issues are discussed with a comparison
(1 session) between software oriented approach (generally using subsets
a) Overview of RC for embedded systems: This part of C) and hardware approach (netlist based descriptions). A
introduces the increasing needs for performance of embedded case study of the DREAM architecture is presented with
systems executing digital signal processing applications, such an emphasis on the compiler friendly approach of the tools
as smartphones, set-top boxes, HD cameras and so on. A targeting the PicoGA [14], [15].
major part of students are not familiar with the concept of b) Virtualization techniques for RC: Second part of the
computation acceleration, a comparison between the Moore course aims at giving details on advanced use of reconfig-
and Shannon law illustrates the need for different architectures urable architectures. The motivation is to leverage, thanks to
and computation models than traditional GPP. virtualization techniques, some well-known limitations of RA:
A first solution proposed to student for tackling the Shannon limited amount of resources, lack of high-level programming
law is to implement intensive tasks as ASICs in order to exploit model and non-portability of bitstream.
instruction parallelism. However, flexibility, NRE costs and In a first step virtualization is defined as a general concept
short life cycle issues related to ASICs are pointed out leading originally applied to computer (e.g. virtual memory and virtual
to look for a trade-off between flexibility and performance machine).
with reconfigurable computing as an answer. Three types of In order to leverage resource amount limitation, time multi-

128 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


plexing is defined and its support by reconfigurable architec- provide results back to the processor.
tures is detailed. It starts from temporal partitioning, dynamic To implement accelerated functions, a resources allocation
reconfiguration and the different configuration plan structures: is required, that remains highly dependent of the hardware
multi-context and partial reconfiguration. These notions are platform.
illustrated by architectures such as DPGA [16], WASMII [17] We introduce Data Flow Graph (DFGs) and Control Data
and their programming flows. Flow Graph (CDFGs) representations that act as entry point
Computation model for reconfigurable computing is ad- to the resources allocation.
dressed by two different approaches. The first approach de- Here happens the circuit synthesis, and some algorithms (A∗
scribes programming model directly supported by hardware point to point routing, pathfinder global routing, placement,
such as Piperench or STREAM [18]. These architectures TCG floorplanning, etc.) and backend tools are presented
provide facilities: CAD tools and runtime management for (Madeo, VPR, etc.).
virtualizing resources. Application portability is ensured by
the availability of the programming model in a device family, C. A Morpheus inheritance
similarly to GPP ISA family. Biniou is partially issued from our contribution to the
A second approach is to use soft-cores comparable to Morpheus FP6 project [21], standing for ”Multi-purpose dy-
software virtual machines. Soft-cores are implemented on off- namically reconfigurable Platform for Intensive Heterogeneous
the-shelf reconfigurable devices avoiding a costly ASIC design processing” (2006-2009). MORPHEUS addresses innovative
but sacrificing some performances. Illustrative examples are solutions for embedded computing based on a dynamically
given by Mitrion [19] and Quku [20]. reconfigurable platform and adequate tools. The main chal-
c) Modeling and generation of reconfigurable architec- lenge of the ”platform work package” was to propose a smart
tures: This course aims at giving to student a deep under- reconfigurable computing hardware carefully crafted for use
standing of FPGA internal behavior. with methods and tools developed in the ”methodologies and
In a first part, every elements of a basic FPGA (used as an tools” work package. The approach of the MORPHEUS toolset
example in the first course) are detailed and a VHDL behav- provides both an effortless management of reconfigurable
ioral description is explained. It starts from atomic elements, accelerated functions within a global application C code, and
such as pass gates, multiplexers and shows how to interconnect an easy design of the accelerated functions through high level
them for building up input/output blocks and configurable description and synthesis. There was the place in which our
logic blocks. A daisy-chain architecture is detailed as well contribution - being at the birth of Biniou - mainly appears.
as a configuration controller. Our second contribution was within the ”training work pack-
A second part describes the Biniou generation of the ar- age” that intents to setup courses and to affect the curricula for
chitecture from an ADL description. This part makes students hardware and software engineers targeting heterogeneous and
ready for practical sessions and for the project. reconfigurable SoCs. The LSE Master is one practical result
A FPGA is described using an ADL increasing the level coming out from Morpheus.
of abstraction compared to a VHDL description. The config-
uration plan is described as a set of domains permitting to D. Practical sessions
exploit partial reconfiguration. The approach relies on model Practical sessions are organized as three activities. The first
transformation, with an automatic VHDL code generation activity is to gather documentation and publications related to
from an high-level description. a particular aspect of the course; the students have to present
2) Software part: The software part of the courses ad- their short bibliographic study individually in front of the
dresses both state-of-the-art tools and algorithms in one hand whole class.
as well as locally designed tools in another hand. The key idea The second activity is centered on algorithms used for
is that students are naturally attracted to learning classical (or implementing application over a reconfigurable architecture:
vendors’) tools so that they can bring a direct added-value point-to-point and global routers, floorplanners, placers. Some
to any employer of the field, hence get in an interesting and data structures such as transitive closure graphs are introduced
well-paid job. later on in order to point out the need for refactoring and
However, tools obviously encapsulate the whole domain- design patterns use [22]. This bridges the software expertise
specific expertise, and letting students ”open the box” closes to the covered domain (CAD tools for RC). Another place
the gap between ”lambda users” and experts. This takes up the where this link appears is when designing a BLIF CABA
challenge of providing a valuable and inovative curriculum. netlist simulator in a couple of hours by simply combining
Several computational models are introduced with a special well known software design patterns: observer (propagation
highlight on temporal versus spatial computing. In addition, of change), state (current value, and future value for flip-
several scheme of mixing up temporal and spatial computing flops), composite (both hierarchical combinations of modules
are presented, such as reconfigurable functional units, recon- and single netlist appear in the same way).
figurable co-processor, etc. The third activity is related to tools and formats. Three slots
The link takes place when introducing the Molen paradigm, are dedicated to VHDL that most of the students do not know.
with hardware blocks running as accelerated functions that Manual description of fine-grained reconfigurable architecture

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 129


is introduced within this amount of time. Some sessions are In order to keep the project reasonably simple, we restrict
dedicated to practicing required tools during which students the use of the RFU to implementing DFGs on one hand,
manipulate logic synthesis tools (SIS, ABC), file formats con- and we provide students with the Biniou framework on the
version (PLA, BLIF, EDIF), behavioral synthesis and CDFG- other hand. Restricting the use of the reconfigurable part as
to-Verilog translation according to some data access pattern a functional units (as depicted by figure 3) also mitigates the
(Biniou). complexity of the whole design. However, this covers the need
Biniou lets students create their own FPGA, that is further for being reachable by average students while preserving the
reused in the project under a tuned up version, and highlights ability to arouse top students’ curiosity. by offering a set of
the configuration scheduling issue. interesting perspectives for further developments.
We also offer a Web based tool [23] to output RTL netlists We believe that this project is a perfect starting point to let
that students use to exercise several options when generating students to build and stress new ideas in many disciplines
their netlists. related to RC-computing such as spatial versus temporal
execution, architectures, programming environments and al-
III. T HE P ROJECT gorithms.
A. Overview of the project 1) Context: This project takes place during the fall
The project consists in designing a simple RISC processor, semester, from mid October to early January. A noticeable
that can perform spatial execution through a reconfigurable point is that almost no free slots within the timetable are
functional unit. This scheme conforms to the Molen paradigm. dedicated to this project that overlaps with courses as well
Figure 2 illustrate the schematic view of the whole proces- as with ”concurrent” projects. This intends to stress students
sor, including the RFU. and make them aware of handling competing priorities.
The processor supports a restricted instruction set (table To prevent students from postponing managing this project
I). Instructions SET and EXRU respectively configures and we use the Moodle platform for monitoring activities, col-
activates the RFU. lecting deliverables and broadcasting updates/comments/addi-
tional information.
Opcode Code 2) Expected Deliverables: We define three milestones and
NOT 0000
AND 0001
three deliverables. The milestones are practical session in front
OR 0010 of the teacher, one out of which session is shared with another
XOR 0011 course as students learn and practice logic synthesis from
LSL 0100
LSR 0101
behavioral code.
ADD 0110 Three main milestones:
SUB 0111 M1: RISC Processor, running its provided test programs
CMP 1000
JMP 1001 M2: RFU, with Galois Field based operations imple-
JE 1010 mented as bitstream
JNE 1011 M3: Integration, Final review
SET 1100
EXRU 1101 3) Schedule: The schedule is provided during the project
LOAD 1110 ”kick-off”. The collaborative platform allows specifying time-
STORE 1111 windows during which deliverables can be submitted. Re-
TABLE I minders can be sent by mail when the dead line is approaching.
I NSTRUCTION SET OPCODES .
Once the dead line is over, over due deliverables are applied
a penalty per extra half-day.

instruction opcode

16 4 read write
12
index_config
control

Instruction register Control unit

op1 op2
5
5

zero
MA0

W A B

16 16
16 Op2
portB
start_config

end_config

16 16
portA Op1
zero
16
Adapter bitSt. RFU
status

Register selAdr
PC control
ALU
bench 2 FPGA
+1
16
ready
bit

16
16 16
16 Res
portW
Configuration
cache
selRes
write_enable RFU
MA1

reset
16 16 16

Data_in Addr Data_out read write


clock Reconfigurable processor

memory bus

Fig. 2. Schematic of the entire reconfigurable processor. Fig. 3. RFU is interfaced with the processor registers through an adapter.

130 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
Opcode MA OP1 OP2
TABLE II
I NSTRUCTION LAYOUT.

to reward test-first approach. When designing a processor the


same approach applies but at a wider granularity. Hence, we
distributed some testbench programs, one of which is provided
in table III.
Fig. 4. User interface of the configuration scheduler. Configuration pages
(part A) are scheduled (part B) for producing the configuration controller Analyzing at specific timestamps (including after the appli-
program initialized by a generated testbench. cation stops) the internal states (some signals plus registers
contents), leads to design scoring.

B. Provided facilities
RAM(0)<=1110010000000000;
In order to make this project feasible on time, some building −−LOAD R0,0;Initializing
blocks are implemented during practical sessions or ready-to- RAM(1)<=0110010000011111;
use elements are given to students as starting points. −−ADD R0,31;R0=31
RAM(2)<=0110010000011111;
The processor is partially implemented during practical −−ADD R0,31;R0=62
sessions dedicated to VHDL. Besides designing basics com- RAM(3)<=0110010000011111;
binatorial and sequential circuits, advanced exercises lead −−ADD R0,31;R0=93
RAM(4)<=0110010000000111;
students to design an ALU controlled by a FSM. The goal −−ADD R0,7;R0=100
is to exercise the execution of a simple dataflow graph over RAM(5)<=1111010000000000;
the ALU. At the end of these sessions, they have implemented −−STORE [R0],0; Memory init
RAM(6)<=1110010000101011;
a first prototype of a controller and a complete ALU. −−LOAD R1,11; Selecting op
Concerning the reconfigurable part, Biniou facilities allevi- RAM(7)<=1110010001000001;
ate the students’ workload. Biniou generates a reconfigurable −−LOAD R2,1;Mask
RAM(8)<=1110010010000000;
matrix specified in Madeo-ADL, and we provide skeletons of −−LOAD R4,0;Bit counter
element descriptions (e.g. IOB) that students finalize during RAM(9)<=1110010010110000;
a practical session. Then students generate a reconfigurable −−LOAD R5,16; Loop counter
RAM(10)<=1000010010100000;
matrix. −−CMP R5,0;Loop ending?
A programmable configuration controller is also pro- RAM(11)<=1010010000010100;
vided which interfaces the reconfigurable matrix. This con- −−JE 20;(OP2 as addr)Jump up to end
RAM(12)<=1110000001100001;
troller manages partial reconfiguration and configuration pages −−LOAD R3,R1;Ref value copying
scheduling. The manual scheduling of the configuration page RAM(13)<=0001000001100010;
comes from a user interface depicted by figure 4. For valida- −−AND R3,R2;Mask forcing
RAM(14)<=1000010001100001;
tion purposes, a global testbench is also generated enabling −−CMP R3,1;set-to-1 bit detection?
to test the reconfigurable matrix configuration and execution. RAM(15)<=1011010000010001;
It instantiates the matrix, the configuration controller and −−JNE 17;(OP2 as addr) Else jump
RAM(16)<=0110010010000001;
performs initialization steps such as sending the scheduling −−ADD R4,1; Increment if true
program to the controller. RAM(17)<=0101010000100001;
1) Processor soft-core: A preliminary version with missing −−LSR R1,1; Next bit
RAM(18)<=0111010010100001;
control structures was provided in order to ensure a minimal −−SUB R5,1;Decrement
compatibility of the designs. Obviously, the matter here was RAM(19)<=1001010000001010;
to ease evaluation from a scholar point of view as well as to −−JMP 10;(OP2 as addr) Jump to start
RAM(20)<=1111000000000100;
force students to handle kind of legacy system and refactoring −−STORE [R0],R4; Saving result
rather than full re-design. RAM(21)<=1001010000010101;
We also provided the instruction set and opcode (but without −−JMP 21; Infinie loop
−−endtest2
compiler). In an ideal world, and with a more generous amount
of time to spend on the project, as the design is highly modular, TABLE III
building a working design by picking best-fit modules out of E XAMPLE PROGRAM : A BIT COUNTER .
several designs would have also been an interesting issue.
2) Decoder: It outputs signals from input instruction ac-
cording to layout on table II. 3) Reconfigurable FU design: Designing a RFU does
a) Testbench program: Students are familiar with agile through sizing the matrix, defining a basic cell and isolat-
programming, and we assume in a software-coding context ing border cells that deserve special attention because their

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 131


structure is slightly different from the common template. The
basic cell is both used as is for the internal cells and tuned to
generate the border cells.
Defining the domains appears as shown by figure 5.

Fig. 5. On the right, view of the different cell types composing the matrix
Fig. 7. Whole view of the RFU’s fine grained reconfigurable matrix.
(border cells, middle cells, IO cells). On the left, configuration domains are
defined as a set of rectangular boxes. They can be reconfigured independently
from each other.
optimizing context here is made up of typing as Galois Field
The basic cell schematic view is provided by figure 6.
GF 16 values the two parameters. A so-called high-level truth
Ultimately, the full matrix appears as an array of N 2 cells
table is computed per graph node for which values are encoded
as illustrated by the snapshot of the Biniou P&R layer (figure
and binarized before the logic minimization starts. The result
7).
4) Application synthesis over the RFU: To let students appears as a context-dependent BLIF file. This BLIF file is
figuring out the benefit of adding the RFU to the processor further processed by the Biniou P&R layer, that relies on a
design, it is desirable that students can assess and compare pathfinder like algorithm. As application is simple enough to
the impact of several options. One classical approach consists keep the design flatten, no need exists for using a floorplanner.
in implementing a DFG to exhibit spatial execution. Another However, for modular designs, a TCG based floorplanner is
option lies in implementing combinational operations (such integrated within Biniou.
as a multiplier) instead of performing a loop of processors Some constraints are considered, such as making some loca-
instructions (addition, shifts, etc.). In both cases, the RFU tion immutable to conform to the pinout of the adapter (figure
extends the instruction set. On the opposite, the underlying 3). Once the P&R process ends, a bitstream is generated. Each
arithmetic can vary keeping the instruction set stable, but element of the matrix both knows its state (used, free, which
this goes through either a library-based design or dedicated one out of N, etc.) and its layout structure. The full layout
synthesizers. Libraries are typically targeted to a reduced set of is gained by composing recursively (bottom up) these sub-
pre-defined macroblocks, and they are not easily customizable bitstreams. An interesting point is that the bitstream structure
to new kinds of functions or use-context. can vary independently from the architecture by applying
We chose to focus on the second item as this seems to several generation schemes. It’s highly valuable in a partial
carry extra added-value compared to classical flows, while
neglecting the need for a coding extra effort. Figure 8 il-
lustrates the Biniou behavioral application synthesizer. The

Fig. 6. Structure of a basic cell within the RFU matrix. Fig. 8. Specification of a GF 16 adder.

132 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


This allowed to set a properly initialized state prior to
execution’s start. Of course, this was a critical issue, and
students would have done well to fix it in an early stage as
tracing values remained the one validation scheme. This was
all the more important as the full simulation took a long time
to complete and rerun had a real cost for students.
The simulation of the processor itself is time-affordable but
the full simulation takes around 4 hours, including bitstream
loading, and whole testbench program execution.
2) Optimizations: Students came to us with several policies
to speed up the simulation. A first proposal is to let simulation
happen at several abstraction levels, with a high rate of early
error detection. Second, some modules have been substituted
by a simpler version. As an example, by providing a RFU
that only supports 8 bit ADD/SUB operations, the bitstream is
Fig. 9. An application placed and routed over the RFU.
downscaled to 1 bit with no compromise on the architecture
decomposition itself. This approach is very interesting as it
confines changes to the inside of the RFU while still preserving
reconfiguration scope when the designer faces several axes
the API. In addition, it joins back the concern of grain increase
during the architectural prospection phase. In the frame of the
in a general scope (i.e. balancing the computation/flexibility
project an example of bitstream structure is provided by the
and reducing the bitstream). Also this approach must be linked
figure 10.
to the notion of ”mock object” [25] software engineers are
5) Reconfigurable Functional Unit Integration: The recon-
familiar to, when accelerating code testing.
figurable functional unit (RFU) is composed of three main
Third, as the application is outputted as RTL code, the code
components: the reconfigurable matrix (RM) generated by
can be used as a hard FU instead of using reconfigurable one.
Biniou, a configuration cache and the RFU controller both
In this way, the students validated the GF based synthesis.
hand-written (see bottom right in figure 2).
Grabbing these last two points, the global design can be
Configuration is triggered by the processor controller which
validated very fast, be the scalability issue. This issue has
reacts to a SET instruction by sending a signal to the RFU
been ignored during the project, but is currently addressed as
controller. The RFU controller drives the configuration cache
the global design is been given a physical implementation.
controller, which provides back on demand a bitstream. The
3) Reports: Students had to provide three reports, one per
processor controller gets an acknowledgement after the con-
milestone. The reports conformed to a common template, and
figurations completes.
ranged from 10 to 25 pages each. The last report embedded the
One critical issue about the processor-RFU coupling lies in
previous ones so that the final document was made available
data transferts to/from the RFU. Students have to design a
straight after the project and students were given another
simple adapter which connects a set of RFU’s iopads to the
chance to correct their mistakes.
processor registers holding input and output data (Op1, Op2
Some recommendations were mandatory such as embedding
and Res in figure 2) with regards to the ones assigned to the
all images as source format within the package, so that we
I/O of a placed and routed application (see figure 9). Figure
could reuse some of them. As an illustration, more or less
3 gives a detailed view of the adapter.
half of the figures in this paper come from students’ reports.
IV. R ESULTS COMING OUT OF THE PROJECT The students had no constraints over the language but
some of them chose to give back English-written reports. We
A. Environment results
selected some reports to be published on line as examples or
1) Simulation: The simulation environment is ModelSim next year students.
[24] as illustrated by figure 11. 4) Oral defense: The last deliverable was made up of a
The loader module - that loads up the program - was not report, working VHDL code and an oral defense. Students
provided but students could easily get one by simply reusing had to expose within 10 minutes, in front of the group, course
and adapting the generated testbench. Only 1 group out of five teachers, and a colleague responsible for the ”communication
got it right. and job market” course.

Cell
0 16 17 28 29 33 34 93
CLB Block Inputs Block Outputs Switch

LUT LATCH MUX0 MUX1 MUX2 MUX3 T0 T1 T2 T3 PIP0 PIP1 PIP60

16 1 3 3 3 3 1 1 1 1

Fig. 10. Example of a bitstream hierarchical organization. Fig. 11. Modelsim simulation.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 133


Some students chose to center their defense on the project R EFERENCES
and the course versus project adequation, some others around [1] J. M. Filloque, E. Gautrin, and B. Pottier, “Efficient global computations
the ”product”, that was their version of the processor. One con- on a processor network with programmable logic,” in PARLE (1), pp. 69–
fined its presentation inside an introduction of reconfigurable 82, 1991.
[2] L. Lagadec, Abstraction and Modélisation et outils de CAO pour les
computing and was considered offside. architectures reconfigurables. PhD thesis, Université de Rennes 1, 2000.
[3] L. Lagadec and D. Picard, “Software-like debugging methodology for
B. Physical realization reconfigurable platforms,” Parallel and Distributed Processing Sympo-
sium, International, vol. 0, pp. 1–4, 2009.
The physical implementation was out of the scope of the [4] R. Schank tech. rep., Institute for the Learning Sciences (ILS) at
project. Several reasons motivated this choice, the first of Northwestern University.
[5] P. M. Athanas and H. F. Silverman, “Processor reconfiguration through
which was timing issues but also FPGA boards availability. instruction-set metamorphosis,” IEEE Computer, vol. 26, pp. 11–18,
We offered one student to finalize this during a 5 weeks 1993.
individual project. This is still under development by now. [6] R. Razdan, K. S. Brace, and M. D. Smith, “Prisc software acceleration
techniques,” in ICCS ’94: Proceedings of the1994 IEEE International
The development platform we use for this demonstrator is a Conference on Computer Design: VLSI in Computer & Processors,
Virtex-5 FXT ML510 Embedded Development Platform. (Washington, DC, USA), pp. 145–149, IEEE Computer Society, 1994.
[7] T. J. Callahan, J. R. Hauser, and J. Wawrzynek, “The garp architecture
V. C ONCLUSION and c compiler,” Computer, vol. 33, no. 4, pp. 62–69, 2000.
[8] F. Campi, R. Canegallo, and R. Guerrieri, “Ip-reusable 32-bit vliw
A. Forces risc core,” in Proceedings of the 27th European Solid-State Circuits
Conference, vol. 18, pp. 445–448, 2001.
One interesting point regarding this project lies in the [9] S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov,
change in the students feeling. When we presented at the and E. M. Panainte, “The molen polymorphic processor,” IEEE Trans.
Comput., vol. 53, no. 11, pp. 1363–1375, 2004.
first time the project, thought they would never complete the [10] R. Hartenstein, M. Herz, T. Hoffmann, and U. Nageldinger, “Using
goals. After the first milestone, one group gave up to avoid the kress-array for reconfigurable computing,” in Proceedings of SPIE,
paying the over due penalty and bounded their work to the pp. 150–161, 1998.
[11] S. Goldstein, H. Schmit, M. Budiu, S. Cadambi, M. Moe, and R. Taylor,
first deliverable. They finally reached 7 points out of 20. The “Piperench: a reconfigurable architecture and compiler,” Computer,
other groups faced the challenge and discovered that the key vol. 33, pp. 70–77, Apr 2000.
issue lies in getting proper tools to free oneself from manually [12] J. Becker and M. Vorbach, “Coarse-grain reconfigurable xpp devices
for adaptive high-end mobile video-processing,” SOC Conference, 2004.
developing both architectures and application mapping. The Proceedings. IEEE International, pp. 165–166, Sept. 2004.
final results were very likely acceptable and we collected [13] G. Lu, M. hau Lee, H. Singh, N. Bagherzadeh, F. J. Kurdahi, and
several working packages. E. M. Filho, “Morphosys: A reconfigurable processor targeted to high
performance image application,” in Proceedings of the International
With this experience in mind, students are now ready for Symposium on Parallel and Distributed Processing, pp. 661–669, 1999.
entering a very competitive job market, with a deep under- [14] F. Campi, A. Deledda, M. Pizzotti, L. Ciccarelli, P. Rolandi, C. Mucci,
standing of both hardware design over reconfigurable archi- A. Lodi, A. Vitkovski, and L. Vanzolini, “A dynamically adaptive dsp
for heterogeneous reconfigurable platforms,” in DATE’07, 2007.
tecture, micro-processors and reconfigurable cross integration [15] C. Mucci, C. Chiesa, A. Lodi, M. Toma, and F. Campi, “A c-based
and tools&algorithms development. algorithm development flow for a reconfigurable processor architecture,”
in IEEE International Symposium on System on Chip, 2003.
[16] A. DeHon, “Dpga utilization and application,” 1996.
B. Future evolutions [17] X.-P. Ling and H. Amano, “Wasmii: a data driven computer on a virtual
Obviously, the testbench examples we provided are not hardware,” in Proceedings of IEEE Workshop on FPGAs for Custom
Computing Machine, pp. 33–42, 1993.
sufficient to practice real metrics based measurements. Ex- [18] E. Caspi, M. Chu, Y. Huang, J. Yeh, Y. Markovskiy, A. Dehon,
ploring the benefits of this approach (e.g. measuring speed-up) and J. Wawrzynek, “Stream computations organized for reconfigurable
requires an easy path from a structured programming language execution (score): Introduction and tutorial,” in in Proceedings of the In-
ternational Conference on Field-Programmable Logic and Applications,
such as C to the processor execution. Hence, the application’s pp. 605–614, Springer-Verlag, 2000.
change would carry no need for hand-written adjustments. [19] Mitrionics, “http://www.mitrionics.com/.”
From our point of view, such an add-on would be a fruitful [20] S. Shukla, N. W. Bergmann, and J. Becker, “Quku: A two-level
reconfigurable architecture,” in ISVLSI ’06: Proceedings of the IEEE
upgrade to the course, and would spawn new opportunities Computer Society Annual Symposium on Emerging VLSI Technologies
for cross H/S expertise; keeping in mind that the DOP course and Architectures, (Washington, DC, USA), p. 109, IEEE Computer
intends to get out with highly trained students sharing skills Society, 2006.
[21] “Multi-purpose dynamically reconfigurable platform for intensive het-
in both area. erogeneous processing.” http://www.morpheus-ist.org/pages/part.htm.
Developing a small compiler was out of the scope of this [22] S. R. Alpert, K. Brown, and B. Woolf, The Design Patterns Smalltalk
project due to some timing constraints, but remains one hot Companion. Boston, MA, USA: Addison-Wesley, 1998.
[23] “Madeo-web, the madeo+ web version.” http://stiff.univ-
spot to be further addressed. This could benefit from some brest.fr/MADEO-WEB/.
Biniou facilities such as the C-entry both the logic and CDFG [24] “Modelsim.” http://www.model.com/.
synthesizers support. [25] D. Picard and L. Lagadec, “Multilevel simulation of heterogeneous
reconfigurable platforms,” International Journal of Reconfigurable Com-
An open option is then to benefit from another course and puting, vol. 2009, 2009.
invited keynoters to fulfill the prerequisites so that adapting/de-
veloping simple C parser becomes feasible in the scope of our
project, at the cost of around an extra week.

134 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Behavioral modeling and C-VHDL co-simulation of
Network on Chip on FPGA for Education
C. Killian, C. Tanougast, M. Monteiro, C. Diou, S. Jovanovic
A. Dandache IADI, University Henri Poincaré
LICM, University Paul Verlaine of Metz Nancy, France
Metz, France s.jovanovic@chu-nancy.fr
{Cedric.Killian, Camel.Tanougast}@univ-metz.fr

Abstract—In this paper, we present a behavioral modeling and communication design. Indeed, we observed that students
simulation of Network on Chip (NoC), some parts being learning prototyping based on FPGA education boards had
implemented on a FPGA education Board. This modeling can be some difficulties and were wasting time on the embedded
effectively used by students for educational purposes. Indeed, communication design linking computing nodes of the MPSoC
experience proves that behavioral modeling and simulation of the in the same chip. To overcome this problem, a NoC behavioral
NoC concepts help the students to understand MPSoC, design modeling and simulation was designed, some parts being
approach, and make more practical and reliable designs related synthesized on a FPGA board used at our laboratory. This
to NoC. We have chosen C-VHDL co-simulation for educational functional modeling and simulation was prepared as an
purposes; all design steps being given in the study. The
electronic design example of the configurable on chip
experience covers NoC concepts for MPSoC design from
communication part in the embedded system course plan. We
postgraduate education to Ph.D level education. For each
targeted audience, different properties of the system can be have preferred a NoC modeling and behavioral simulation was
emphasized. Our approach of embedded systems education is designed with C-VHDL co-simulation in Modelsim tool
based on classic referenced research findings. environment for educational purposes. The setup demonstrated
in this paper is suited for introductional modeling and
Keywords: NoC, C-VHDL Co-simulation, SoC design, verification of the designed NoC for MPSoC design based on
Education purpose. FPGA technology. More precisely, this paper introduces an
initiation educational project of modeling, and functional
I. INTRODUCTION simulation of a mesh NoC [1], led by postgraduate students -
Master 2 RSEE (Radiocommunications and Embedded
Given the evolution and the increasing complexity of Electronic Systems) specialty education of the Paul Verlaine
System on Chip (SoC) toward Multiprocessors Systems on Chip University of Metz.
(MPSoC), communication interconnection of modules or
Intellectual Property (IP) constituting these systems, had The paper outline is as follows. Pre-design studies are
undergone topological and structural evolution. Actually, the introduced in the Section II. We start with a short technical
trend is moving towards the full integration of a Network on overview (from an educational point of view) of the Modeling
Chip to implement the transmission of data packets among the of the NoC concepts, and the NoC model of the case study in
interconnected nodes which are computing modules or IPs this Section. We cover key properties of NoC and relate them
(processors, memory controllers, and so on.). Fig. 1 illustrates to training of students. In the Section III, all modules of the
this trend of on chip interconnection. modeling and functional simulation are introduced separately
in details waveform simulation reports from the test example,
In this study, we present a behavioral modeling and simulation and advantages of the behavioral simulation are assessed in
of the Network on Chip (NoC) communication used to terms of practicality, time saving and reliability. Finally,
overcome some difficulties encountered in understanding the Section IV concludes the education experiences for the paper.
concepts of Multi-Processor System on Chip (MPSoC)

Connections P2P Shared Bus Hierarchical Shared Crossbar Bus Network on Chip
Bus Time

1990 1995 2000 2005 2010

Figure 1. Interconnection evolution in Systems on Chip.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 135


- An internal switching logic based Crossbar
II. MODELING AND SIMULATION OF NOC : CASE STUDY
- 4 data transmit directions (East, West, South and North)
The network transmission is realized through routers
constituting the network, and by using switching techniques - Wormhole switching rules: the connections between the
and routing rules [1]. In this context we propose an initiation inputs and direction outputs of the routers are maintained until
project to the RSEE student using Modelsim hardware all the packets flits have been sent.
simulator tool [2].
Fig.3 gives the description of the specification of one NoC
We propose to students the modeling of a NoC by using router. The global NoC is then made by a structural VHDL
switching data packets constituted of messages, and based on description realized by router module instantiations in order to
mesh topology of configurable size (3x3, 6x6 and 10x10), design a NoC with a configurable size of 3x3, 6x6 and 10x10
Wormhole switching rules and standard XY routing rules [1]. (see Fig. 2).
Fig.2 illustrates the NoC to be designed by the students. It is a
mesh-network where each one is associated to a Processing 2) Routing algorithm
Element (PE). Each PE-router pair possesses a specific address The algorithm proposed to the student is an XY static
and can emit messages through the network. routing algorithm. Thus, the data packets are routing toward
the destination PE through the network router first along X
X axis, and then along Y axis. The XY routing algorithm is
realized in VHDL functional description from the following
N N N N algorithm:
Y
- If X_router < X_destination, direction paquets = Direction_East
R R R R
00 01 02 0N - If X_router > X_destination, direction paquets = Direction_West
N N N N
- If X_router = X_destination and Y_router > Y_ destination, direction
paquets = Direction_South
R R R R
10 11 12 1N - If X_router = X_destination and Y_router < Y_ destination, direction
N N N N paquets = Direction_North
- If X_router = X_destination and Y_router = Y_ destination, direction
R R R R paquets = Local_PE
20 21 22 2N

N
PE
N N N
North_Direction
Pe_out
R R R R Pe_in
N0 N1 N2 NN

Buffers
R Buffers
Address router i, j
ij West_Direction

N Node = module (processor, IPs, memory, etc.)


East_Direction
Switch - Crossbar

Buffers
Buffers
Routing
Figure 2. Illustration of the proposed nxn Mesh NoC to model. -
Priority

Each packet is constituted of flits (flow control units) south_Direction


corresponding to data word with fixed size. We define the
Phits (physical units) notion corresponding to the information Figure 3. Illustration of the structural architecture of one NoC router.
unit able to be emitted in one cycle trough a physic channel in
the network. Fig. 4 illustrates the XY routing algorithm of data packets
between the PE_00 and PE_22. An erroneous description of
1) Routeur behavioral VHDL modelling this algorithm can lead to deadlock or livelock of data packets
The students model a VHDL behavioral description of a [1] during the behavioral simulation steps of the modeled NoC.
NoC router characterized by data packets composed of 5 flits. The elaboration of the priority rules of the data packets in
The phit size corresponds to the channel size. The router input routers is defined and designed by the students. These rules
buffer is equal to one flit size. The router specifications are need to be associated with the routing algorithm, the number of
described as follows: flits constituting the data packets and the wormhole switching
- Data packets composed of 5 flits of 9 bits size (5 x 9 bits) rules. The goal is leading students use an iterative and
progressive description of the routing and priority rule module
- Data channel of 9 bits (phit size) gradually in the modeling and simulation phases highlighting
- One buffer of 2 flits deep (2 x 9 bits) in each input of situations of temporary deadlock, starvation, and the impacts
routers on the data packets transmission latency notion.

136 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


the VHDL description of the simulated NoC in ModelSim. Fig.
PE
7 shows the integration of the C/C++ functionality modeling
PEs in the VHDL description of the NoC.
Add 00 Add 01 Add 02 Add 03
Fig. 6 presents the file contents associated to each PE and
containing the packets (of 5 flits size) to transmit. The first flit
corresponds to the source address of the emitting PE. The
Add 10 Add 11 Add 12 Add 13
second one is the address of the destination PE. The next flits
are the data to transmit between two PEs of the network.

Add 20 Add 21 Add 22 Add 23

PE

Add_30 Add 31 Add 32 Add 33

Figure 4. Illustration of the XY routing algorithm.

III. C-VHDL CO-SIMULATION


Figure 6. Data packets examples to transmit by PE address 00
From a description in C language modeling the data packets
transmission and reception by all PE associated at each NoC During the co-simulation phases, students are aware about
router, the students realized a detailed NoC VHDL behavioral the deadlock, livelock and starvation risks of data packets in
description (routers, control data flow, logic routing, etc.). the modeled NoC [1]. These risks are mainly due to the limited
From this hardware language description, a validation via C- resources sharing and the resources access rules. They are
VHDL co-simulation by using VHDL FLI (Foreign Language usually studied and analyzed in the NoC design suitable for the
Interface) on the ModelSim tool is performed [3]. NoC design conception of specific MPSoC. Fig. 8 and 9 give examples of
and simulation steps developed by the students, correspond to a co-simulation results in ModelSim. Fig. 8 presents a data
VHDL description associated to a C/C++ program modeling packets received from a wormhole switching using the XY
all PEs providing emission or reception of data packets routing in the switch associated with the PE-54. Fig. 9 shows
transmitted in the modeled NoC. Fig. 5 depicts the interface the reception of data packets by the PE_54. This co-simulation
and the association of co-simulation in ModelSim environment. allows a rapid behavioral validation of the design NoC while
From files containing the data packets to be sent in the NoC by highlighting his characteristic performances as the latency and
some PE transmitters to PE receivers, an interface between the throughput notions. Fig. 10 gives an example of generated files
NoC VHDL description and the C/C++ function modeling the by the simulation results given the latency results in terms of
packets transmitting or receiving for each PE, is executed in the cycle number of transmitted packets in the NoC by an
ModelSim. emitter PE toward a destination PE.

IV. CONCLUSION
Data packets reception
PEs This paper proposes, for educational purposes, the
Data packets emitting
PEs modeling, simulation and performance evaluation of the chip
PEoo network communication architecture using C-VHDL co-
PE01 Add 00 Add 0j
simulation with ModelSim. This approach overcomes some
understanding difficulties encountered during the teaching of
PE0i the communication concepts in Multi-Processor System on
Chip (MPSoC) design. It alerts the students to the fundamental
PE10
impact of the interconnection in MPSoC to meet the required
performances. The data packet transmission modeling and
PEij
Add i0 Add ij simulation is described in C/C++ allowing the rapid simulation
PEs.C
PEs.vhd NoC.vhd required to validate the functionality, and the performances of a
Testbench.vhd ModelSim
Network on Chip. Routers are described in VHDL aiming the
synthesis towards FPGA technology. The educational purpose
is to provide the students the fundamental concepts in the
Figure 5. C-VHDL co-simulation in Modelsim environment. design, and development of on chip interconnection, a
significant key of MPSoC performances.
In fact, PEs are modeled in the simulation through a DLL
compilation of the C/C++ function modeling each PE of the
NoC. This DLL is defined as the architectural parts of PEs in

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 137


Figure 7. Attribution FLI of abstract model by DLL in the VHDL description.

Figure 8. Simulation results of the transmit data packets in the modeled NoC based on the Wormhole switching and XY algorithm.

Figure 9. Simulation results of the received data packets in the PE_54.

138 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Figure 10. Latency results in term of clock simulation cycle of the transmitted data packets by the PE_00.

This experience covers the major NoC concepts for MPSoC


design from postgraduate education to Ph.D level education.
Moreover, it shows the usefulness of tool like ModelSim in the
validation of a functional architecture in its working
environmental context through co-simulation. This lab is
complementary to the teaching of high level modeling system
languages such as SystemC. Indeed, the increasing complexity
of System on Chip rely more and more on specific tools whose
mastery is fundamental for microelectronic design education.

REFERENCES
[1] G. De Micheli et L. Benini, « Networks on Chips, Technology and
tools », Morgan Kaufmann publishers, 2006.
[2] Mentor Graphics, « Modelsim SE User’s Manuel, Sofware, Version 6.
4 », 2008. http://www.mentor.com/.
[3] Mentor Graphic, « ModelSim, Foreign Language Interface, version
5.6d», August 2002.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 139


140 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010
Experimental Fault Injection based on the
Prototyping of an AES Cryptosystem
Jean-Baptiste Rigaud∗ , Jean-Max Dutertre∗ , Michel Agoyan† , Bruno Robisson† , Assia Tria† ,
∗ ÉcoleNationale Supérieure des Mines de Saint Étienne,† CEA-LETI,
SAS Department,
Centre Microélectronique de Provence Georges Charpak,
880 avenue de Mimet 13541 Gardanne, FRANCE
Email: ∗ name@emse.fr,† firstname.name@cea.fr

Abstract—This paper presents a practical work for Masters prototyping is presented. Then, the AES test environment is
students in Microelectronics Design with optional modules in targeted with the insertion of an UART interface. Finally,
cryptography and secured circuits. This work targets the design a test pattern generation program is proposed using Perl
and the prototyping of the Advanced Encryption Standard
algorithm on a Spartan 3 Xilinx platform and how fault injection programming language.
techniques could jeopardize the secrecy embedded within a design
dedicated to the security thanks to simple equipments: a fault A. Advanced Encryption Standard Algorithm
injection platform based on the use of an embedded FPGA’s
Delay Locked Loop. The Rijndael block cipher [4] has been standardized as
the Advanced Encryption Standard (AES) by the National
I. I NTRODUCTION Institute of Standards and Technology (NIST) in 2001 [8] . It
This paper proposes a wide spectrum practical work for replaces the Data Encryption Standard (DES) for symmetric-
Masters students in Microelectronics Design with optional key encryption.
modules in cryptography and secured circuits. This is the
development of an experimental fault injection based on Plain Text

the prototyping of an AES cryptosystem. The main goal is


Round 0 AddRoundKey RoundKey 0
the application of academic courses around VHDL, design
methodology, FPGA prototyping, cryptography and security of SubBytes
integrated circuits. The work has two parts. The first one is the
VHDL implementation of a standard symmetric key algorithm: ShiftRows
Round 1 to 9
a 128-bit AES. Then, this cypher block is prototyped on MixColumns

Spartan 3 development board [12]. A serial communication


AddRoundKey RoundKey i
interface completes the design. It allows communication be-
tween a PC and the Xilinx board using ad hoc commands. SubBytes
Lastly, the automated generation of test programs for this
environment is addressed. All this part is mainly composed Round 10 ShiftRows

of lab work. AddRoundKey RoundKey 10

The second part of the course is dedicated to the design


of a fault injection platform. It includes both lectures and Cypher Text

laboratory work. The lectures consist in introducing the


theory of digital ICs’ timing constraints, Differential Fault Fig. 1. AES algorithm
Analysis (DFA) and the use of Xilinx FPGA’s Digital
Clock Managers (DCM). Then, during the laboratories, the It is a Substitution-Permutation Network (SPN) block cipher
students apply these principles to designing fault injection whose operations are based on binary extension fields. It
platform (implemented on a Xilinx Virtex 5 demo board), and processes a 128-bit plaintext and a key of 128, 192 or 256 bits
performing fault injection experiments by various means on long to produce a 128-bit ciphertext. From now on, we only
the previously designed AES. This course part is devoted to consider a 128-bit key AES. The AES encryption algorithm
make the students aware of fault attacks against cryptosystems. is divided into two processes: the “Data path” and the “Key
Schedule”. The words processed by these two parts are two
4x4 matrices of bytes called “states”.
II. T HE ATTACKED CIRCUIT : AES CRYPTOSYSTEM 1) Data path Encryption Process: AES has an iterative
The first part focuses on the description of the cryptosys- structure (Fig. 1). The data path is a sequence of ten sub-
tem to be designed. Second, the AES VHDL modeling and processes called “rounds”. A regular round is composed of the

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 141


four transformations called “SubBytes”, “ShiftRows”, “Mix- interesting issues such as the synchronization of data path and
Columns” and “AddRoundKey”. Before the first round, the key expansion or the Sbox modeling. This task provides the
original key and the plaintext are added. The last round uses students with the opportunity to apply by themselves the whole
all but the MixColumns transformation. FPGA design flow.
1) Specifications and framework: In addition to the AES
00 44 88 CC 00 44 88 CC standard, the following specifications are given to the students:
• The inputs and outputs are 128-bit long.
11 55 99 DD 55 99 DD 11

• A “start” and a “done” signals trigger the beginning and
22 66 AA EE AA EE 22 66

33 77 BB FF FF 33 77 BB
the end of each encryption.
• An asynchronous active low reset initializes the whole
Fig. 2. ShiftRows transformation
circuit.
• The clock nominal frequency is 100 MHz.
SubBytes transformation substitutes each byte of the state • A complete cyphering has to be done in eleven clock
according to a table called “Sbox”. cycles.
ShiftRows transformation is a cyclic permutation on the • There is no area constraint.
state rows (fig.2). The following design constraints are directly linked to the
MixColumns is a matrix product of the current state S by evaluation environment described in the next section. The
a constant matrix C over GF (28 ) (fig.3). AES clock is also provided by the external environment and
connected to the second FPGA board (cf. III). The “start”
2 3 1 1 00 44 88 CC AA E5 22 6D
signal is a trigger for the clock fault generator (fig. 11). The
1 2 3 1 55 99 DD 11 B0 77 38 FF

1 1 2 3
• AA EE 22 66
= 00 4F 88 C7
target is a Spartan-3AN (XC3S700AN) evaluation board . All
3 1 1 2 FF 33 77 BB 1A 4D 92 55
the circuits and test benches are modeled in VHDL. The
simulations (at functional, post-synthesis and post-place and
Fig. 3. MixColumns transformation route levels) are performed with ModelsimT M from Mentor
Graphics. The synthesis, place and route and bitstream gener-
The AddRoundKey transformation adds (using a bit wise ation are performed with the Xilinx ISET M development suite.
XOR) the round data to the round key processed by the 2) AES Hardware Description: The AES circuit is com-
KeySchedule. posed of three parts (fig. 5) which are the “Data Path” which
2) KeySchedule Process: Each round key is derived from encompasses the four transformations, the “KeyExpander”
the previous one through the operation described in Figure 4. and the “StateController” which is the finite state machine
The first round key (RoundKey 0) is the secret key. For each sequencing the entire algorithm. The StateController enables
round two transformations, “RotWord” and “SubBytes” are signals to activate each part of the circuit. Multiplexers and
applied to the last column of the round key state. This column demultiplexers are inserted in order to transfer and select data
is then added to a round constant (“RCon”). The next round between different rounds (they are also controlled by this sub-
key state is obtained column wise: the next first column is circuit). KeySchedule computes the round keys on the fly and
the result of a XOR operation between this modified column each round is computed within a clock cycle. Data is latched
and the former first one. Each following column of the former just before the S-Boxes (both in the data and key paths).
state is added to the column computed just before. clock resetb

RCon
D7 01 D6 D6 D2 DA
AB 00 AB AA AF A6
start State Controller done
76 00 76 74 72 78
FE 00 FE FD FA F1

00 04 08 0C 00 04 08 0C
128 D 128
01 05 09 0D 01 05 09 0D
data M e cypher
02 06 0A 0E 02 06 0A 0E
u SB SR MC ARK m
03 07 0B 0F 03 07 0B 0F x u
x
Round key RotWord SubBytes
D6 D2 DA D6 D6 D2 DA D6
0C 0D D7
00 04 08 0C
AA AF A6 AB AA AF A6 AB DataPath
01 05 09 0D 0D 0E AB
SBOX
02 06 0A 0E 0E 0F 76 74 72 78 76 74 72 78 76
FD FA F1 FE FD FA F1 FE
128
03 07 0B 0F 0F 0C FE
128
key KeyExpander
Fig. 4. Keyscheduling

Fig. 5. AES block diagram


B. AES Design
The AES is a good example for teaching integrated circuits The timing constraint of 11 clock cycles per cyphering
design: it involves a lot of architecture dilemnas with very means that 16 S-Boxes are needed for the SubBytes operation

142 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


and 4 for the KeyExpansion. Several solutions exist to design the first step in the debugging process is to send a command
S-Boxes ( [11], [7]). In this paper we choose the look-up from a serial terminal and wait for the answer (“OK”) from
table approach. It is also possible to use integrated block the FPGA. Then test patterns provided by the NIST are used.
RAM (BRAM) configured as ROM used depending on the
FPGA target chosen. The latter approach allows the students to
Rx
discover a way to generate memory blocks with Xilinx Coregen UART
application for example. Tx interface

128 128 128


C. Evaluation Environment start data key

One of the goals of this course is to give a concrete lab


AES
work in design debugging or design testing. Once the AES done
block
modeling is completed and all the simulations work fine, it is cypher
important to test it in a real environment scenario. Xilinx FPGA
Here, a dedicated evaluation environment between a PC
and the Spartan-3 board is proposed to the students. A serial Fig. 6. AES test environment
communication based on the RS232 protocol is used. Figure
6 shows the communication scheme between PC and AES 3) An Automated Cyphering : Once the communication
block. between the AES and the PC is validated, this first part course
1) A Simple Communication Protocol: The aim in this ends with the generation of test programs. The goal here is to
part is to perform cyphering, i.e. to program the key, give create scenario files based on the communication commands
a plaintext and retrieve the cypher. The following commands (cf. II-C1). After a quick overview of Perl programming lan-
are used: guage [9], students are asked to generate multiple encryption
• m or M followed by 64 ASCII characters followed by scenarios for the FPGA. They use features like simple text
”EOL”: to send a 128-bit message handling and easy configuration of the serial port (open, close,
• k or K followed by 64 ASCII characters followed by baud rate,etc.). At last, with “Crypt::OpenSSL::AES” [3], a
”EOL”: to send a 128-bit key Perl wrapper module of the OpenSSL’s AES library, they can
• g or G followed by ”EOL”: stands for go, to start the compare, on the fly, the received cyphertexts with the expected
encryption results.
The interface has to acknowledge the M and K commands With this we conclude the first part of the course. It has
by sending the string of 2 characters OK. The go command is presented the AES algorithm, its design and prototyping on
acknowledged by the 32 hexadecimal characters of the result Xilinx development board, the integration of a communication
and OK. interface (UART block) and the automation of AES executions
The same computation can be done several times where with an external PC and Perl programming. This last step is
only the key or the message is changed between two AES very important for the experimental fault injection in the next
executions. All these commands are sent via the RS232 port part of the article.
via an serial ”terminal”.
The automation of the cyphering will be described in III. D ESIGN AND USE OF AN FPGA- BASED ATTACK
subsection II-C3. PLATFORM
2) UART Hardware Implementation: A light UART is used
A. Overview of the course
to ensure communication between the PC and the AES. It only
takes into account the baud rates. Each frame is 8-bit long with This part of the course includes both lectures and labo-
one stop bit. Other features such as parity checking are not ratory work. The lectures introduce the theory of the timing
considered here. This part could be improved in the future. constraints related to the synchronous operation of digital ICs,
This entity is composed of 4 blocks : the concept of Differential Fault Attack (DFA) applied to the
• Baud generator: generates the correct local clock fre- Advanced Encrytpion Standard (AES) algorithm and the use
quency depending on the baud rate. The Baud rate is of an FPGA on board delay locked loop (DLL) to design an
initially controlled by external on-board dip switches but attack platform. The laboratory work is divided into two parts:
can be hard-coded as well. the synthesis and test of the modified clock signal intended to
• Transmitter: sends serialized data to the PC to acknowl- inject faults and a fault injection experiment on our AES board,
edge a command or to send back the cyphertext. as designed in section II. After completing this course, students
• Receiver: receives deserialized data from the PC. will be aware of the threats that fault injection techniques pose
• Controller: is a finite state machine dedicated to decoding to the physical implementation of cryptographic algorithms.
the received commands, to control the AES and to send
back results and acknowledging to the PC. B. Theoretical work
This communication interface is designed within one same This subsection introduces the theoretical background used
AES design flow. Once the bit stream is loaded into the FPGA, for fault injection purposes.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 143


1) Digital IC timing constraints: Almost all digital ICs delays. Consider a combinational logic block implementing
work according to the principle of synchrony: their internal a given logical function (depicted in figure 8). Each output
computations are sampled by a global clock signal (except possesses its own propagation delay: tprop delay(i) .
asynchronous circuits which are out of scope of this course).
Figure 7 symbolizes the internal architecture of any syn- delay
chronous circuit: combinational logic surrounded by registers
(i.e. banks of D flip-flops). The data are released from the first
D0
register banks on a clock rising edge and then processed by n
Combinational D1 m
the logic before being latched into the next bank on the next logic
Dm-1
clock rising edge. It takes a certain amount of time, called the
propagation delay, to process the data through the logic. As a inputs outputs

consequence, the time between two rising edges (i.e. the clock
period) depends on the propagation delay.
Fig. 8. Propagation delay

propagation delay Then, any fault will be injected in a place (output Di )


where the propagation delay is large enough to induce a timing
n-1 m-1
constraint violation as expressed in equation 4:
Combinational
data
1
D Q
1 logic 1
D Q
1
tprop delay(i) > Tclk + Tskew − tclk to Q − δsetup (4)

Dffi Dffi+1 Besides, these delays depend on the handled data. In other
words, each propagation delay is changing with the data
clk and so do the faults’ locations (we will return to this point
later in subsection III-C2). In addition, the propagation delays
also depend on the power supply voltage and the chip’s
Fig. 7. Internal architecture of Digital ICs temperature.
2) Differential Fault Attack on AES: Fault attacks consist
More precisely, to operate without any computation error, in injecting faults in the encryption process of a cryptographic
the data must arrive at the second register’s input before a algorithm through unusual environmental conditions. They
required time. The data arrival time is expressed in equation may result in reducing the ciphering complexity (a round re-
1, where tclk to Q is a latency time between the clock’s rising duction number for example, see [2]) or the injected faults may
edge and the arrival of the data on the register output Q, allow an attacker to gain some information on the encryption
and where tM ax(prop delays) is the biggest propagation delay process by comparing the correct with the faulty ciphertexts.
through the combinational logic (namely its critical time). This technique is called Differential Fault Attack (DFA). An
extended explanation of DFA’s theory (as done during our
tdata arrival = tclk to Q + tM ax(prop delays) (1) lectures) would be too long and of little relevance to this paper.
However, the reader could find a complete description of two
Equation 2 gives the required time for which the data must
major DFA schemes in [10] and [6].
be present. This is the sum of the clock period (Tclk ) and
These attacks allow an attacker to retrieve the secret key used
a time skew (Tskew ) which reflects the clock propagation
by the AES cryptosystem. The key points are that the fault
time between the two registers minus the register’s setup time
injection means must allow the attacker to control precisely:
(δsetup ), the setup time being the amount of time for which
the D flip-flop input must be stable before the clock’s edge to • the exact time of fault injection (i.e. only for a given and

ensure reliable operation. precise round),


• the number of faults (i.e. to limit the faults locations to

tdata required = Tclk + Tskew − δsetup (2) a byte or even to one bit).
If not, it will be impossible to extract any information related
Hence, the timing constraint (Equation 3) is derived from to the encryption key.
the two previous equations:
3) FPGA-based attack platform: A basic approach to in-
Tclk > tclk to Q + tM ax(prop delays) − Tskew + δsetup (3) ject faults through timing constraints violation (as shown in
subsection III-B1) is overclocking. It consists in decreasing
The violation of the timing constraint results in computa- the clock’s period until faults appear by setup time violation.
tional errors. This principle is well known and frequently used However, it provides no timing control: faults are injected at
as fault injection means to attack secure circuits [1]. each clock cycle, which is not suitable for DFA. To overcome
Moreover, the faults’ locations are linked to the propagation this, we choose local overclocking (or clock glitching), which

144 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


is based on inducing a timing violation by modifying only • use a nominal clock frequency of 100 MHz (which is
one clock period. The corresponding modified clock signal is consistent with the AES test chip).
denoted clk  in Figure 9. No instructions are given for the design of the combinational
Moreover, it offers to our students the opportunity to use logic block devoted to obtaining the faulty clock (clock in
and reconfigure dynamically the Delay Locked Loop (DLL) figure 9) from the DLL input’s clock signals and the trigger
embedded in the Digital Clock Managers (DCM) of the Xilinx signal issued by the AES chip. In addition, the students need
Virtex 5 FPGA [5]. to discover by themselves that a programmable counter is
required to monitor the instant the faulty period is generated

to be able to target any of the AES rounds.
 Δ
The achievement of this work is validated by measuring the
modified clock signal for different settings of Δ and of the
 ↓ targeted rounds.
Δ

 ↑



  Δ

Fig. 9. Faulty clock signal generation

The modified clock signal, clock , is built from a correct


one, clk, as depicted in Figure 9. The DLL allows delaying an
Fig. 10. Faulty clock (uppermost) and trigger signal(lowermost)
input clock signal by a programmable duration. A first Δ/2
delayed clock, clock ↓, is derived from the input clock to
Figure 10 illustrates an oscilloscope’s screen shot typically
produce the falling edge of the modified cycle and a second
asked for validation, where Δ is set to 1.8 picoseconds and the
one Δ delayed, clock ↑, to produce its rising edge. Then, they
modified period located during the ninth round of the AES.
are combined to obtain the modified clock signal, clock , used 2) Fault injection experiments: The main part of the labora-
for fault injection purposes. As a consequence, the duration of tory work is focused on fault injection experiments. A first part
the corresponding clock period is decreased by an amount of is dedicated to fault injection by using local overclocking and
time Δ. A trigger signal is used to indicate the location of to controlling precisely the injection process. Then, a second
the modified cycle. This technique allows choosing the fault part addresses the ability to inject faults by modification of
injection cycle. Furthermore, the ability to set precisely Δ the power supply voltage and the temperature of the chip.
enables a fine control over the number of faulted bits (the The experimental setup is depicted in Figure 11.
experimental results reported in subsection III-C2 demonstrate
the ability to inject one bit faults).
C. Laboratory work
The laboratory work addresses both the VHDL description
and synthesis of the fault injection platform described in
subsection III-B3 and the implementation of the fault injection.
1) Synthesis: This experimental work takes place after the
lectures described in subsection III-B and the VHDL synthesis
of the AES crytposystem described in section II have been
given. The students have to complete a very concise task:
design a fault injection platform based on local overclocking
as described in the corresponding lecture (see subsection
III-B3) on a Xilinx Virtex 5 demo board [13]. The following
guidelines are given:
• use a dynamic configuration mode to set Δ via a serial
communication port,
• reuse and adapt the IP already developped in section II-C
to implement the communication protocol, Fig. 11. Experimental setup
• set the CLOCKOUT PHASE SHIFT’s attribute of the
DLL to VARIABLE POSITIVE to obtain an elementary
variation step, δt , equal to 35 picoseconds for Δ, The first test campaign is ran as described in Algorithm 1,

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 145


where δt is the variation elementary step of Δ.    

 
 


Algorithm 1 Test Campaign Pseudo-Code 


send the key K and the plaintext M to the test chip. 




Δ ← 0.

!  " #



while (clock period > Δ) do 




encrypt and retrieve the ciphertext 


Δ ← Δ + δt 


end while 

  



 
It is carried out to illustrate the fault injection process. The
final round of the AES is targeted to allow a direct reading of
the errors by direct comparison between a correct and a faulty
Fig. 13. Fault injection for a different plaintext
ciphertexts. The AES plays the role of a big propagation delay.
The bar chart of Figure 12 shows the faults’ timing and nature
as a function of the faulty period’s duration (the horizontal ing constraints violation according equation 3. As suggested
axis). It uses a color code to reflect the nature of the faults in III-B1, this is achieved by decreasing the device’s power
(no fault, one-bit, two-bit and more faults) and their time of supply voltage (VDD ) or increasing the chip’s temperature.
appearance. As expected, the comparisons between correct and The students have to carry out these experiments and to report
faulty ciphertexts reveal that the device progressively transits the corresponding critical time as a function of VDD and the
from normal operation to multi-bit faults. This is done by temperature. Figures 14 and 15 illustrate the corresponding
exhibiting none, one-bit, two-bit and multiple-bit faults. typical results.

   



 
  12500

12000

 11500

 11000
picoseconds


!  " #

 10500 critical path (ps)





10000 $%
 9500


1.07 V
9000


8500


1,2 1,15 1,1 1,05 1 0,95 0,9

 VDD (V)


 


 

Fig. 14. Fault injection based on power supply decrease


Fig. 12. Fault injection as a function of faulty period duration

These statistics were obtained thanks to the very small value


of the faulty period granularity, namely 35 ps. This allows
injecting a one-bit fault at every round of the ciphering process
with a high degree of confidence, which is a requirement for
many DFA methods.
The next point is about controlling the faults’ location. The
second bar chart in Figure 13 is obtained with the same
experimental protocol and with the same secret key but with
a different plaintext. The first injected fault is a single bit
fault on byte number three for a clock period equal to 7585
ps, whereas the first one-bit fault injected in the previous
experiment was on byte thirteen for a clock period equal to
7340 ps. This confirms, with many other experimental results,
that the critical time’s location and value vary with the data (as Fig. 15. Fault injection based on temperature increase
stated in III-B1). This experiment is carried out to illustrate the
ability to change the fault location by changing the plaintext. As expected, as VDD decreases, the critical time increases.
Another means of injecting faults is to increase the combina- And when it crosses the clock period (Tclk ) value (namely
tional logic’s propagation delays until faults appear due to tim- 10,000 ps) at VDD = 1.07V , the first fault appears.

146 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Similarly, an increase in the chip’s temperature results in [10] Gilles Piret and Jean-Jacques Quisquater. A differential fault attack
a linear increase in the critical time (on the variation range). technique against spn structures, with application to the aes and khazad.
In Proc. Cryptographic Hardware and Embedded Systems (CHES ’03),,
Faults are injected for temperatures above 210o C when the Lecture Notes in Computer Science, pages 77–88, 2003.
critical time goes beyond the nominal clock period. [11] Johannes Wolkerstorfer, Elisabeth Oswald, and Mario Lamberger. An
The objective of these laboratory experiments is to make the asic implementation of the aes sboxes. In CT-RSA, pages 67–78, 2002.
[12] Xilinx. http://www.xilinx.com/products/spartan3/3a.htm.
students aware of fault injection means based on setup time [13] Xilinx. http://www.xilinx.com/products/virtex5/index.htm.
violation and of the ability they offer to control the faults’
sizes and locations. As a result, they will take care of the
way they design crytposystems in their professional life or in
further research activities.
IV. C ONCLUSION
This paper presents an ambitious two-in-one course. It
mainly targets Masters students in Microelectronics Design
with optional modules in cryptography and secured circuits. It
offers two parts.
The first one presents the full design of AES cryptosystem.
The students start from specifications and in the end they
test their prototyped circuit in a complete test environment.
They have to implement the communication interface and to
generate all the test sequences. The Xilinx Spartan 3 platform
is used for the practical experimentations. Some evolutions can
be done as modifying the data path length, exploring different
architectures for the Sbox or completing the UART interface.
The whole design can also be split in smaller student groups.
The second part is mainly dedicated to the design and use of
an FPGA-based attack platform. At first, lectures introduce the
theory of fault injection through timing constraints violation,
the basis of DFA and the use of a DLL to build a modified
clock signal for fault injection. Then, laboratory works allow
the students to become familiar with the practice of fault
injection and the combined threats over secure ICs. This
second part is an optional extension of the first one. However,
it could be teached independently by providing an already
programmed AES test board. This part is also used as an
introduction course on IC security for the PhD and internship
students in our research group.
R EFERENCES
[1] Hagai BarEl, Hamid Choukri, David Naccache, Michael Tunstall, and
Claire Whelan. The sorcerer’s apprentice guide to fault attacks. In
Special Issue on Cryptography and Security 94(2), pages 370–382, 2006.
[2] Hamid Choukri and Michael Tunstall. Round reduction using faults.
Proc. Second Int’l Workshop Fault Diagnosis and Tolerance in Cryp-
tography (FDTC ’05), 2005.
[3] CPAN. http://search.cpan.org/ttar/crypt-openssl-aes-
0.01/lib/crypt/openssl/aes.pm.
[4] Joan Daemen and Vincent Rijmen. The Design of Rijndael. Springer,
2002.
[5] Bernhard Fechner. Dynamic delay-fault injection for reconfigurable
hardware. In Parallel and Distributed Processing IEEE Symposium, page
282.1, Washington, DC, USA, April 2005. IEEE Computer Society.
[6] Christophe Giraud. DFA on AES. In H. Dobbertin, V. Rijmen, and
A. Sowa, editors, Advanced Encryption Standard − AES, volume 3373
of Lecture Notes in Computer Science, pages 27–41. Springer, 2005.
[7] Olivier Faurax Julien Francq. Security of several aes implementations
against delay faults. In Proceedings of the 12th Nordic Workshop on
Secure IT Systems (NordSec 2007), October 2006.
[8] NIST. Announcing the Advanced Encryption Standard (AES). Federal
Information Processing Standards Publication, n. 197, November 26,
2001.
[9] Perl. http://http://www.perl.org.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 147


 

148 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Reducing FPGA Reconfiguration Time Overhead using Virtual Configurations

Ming Liu‡† , Zhonghai Lu† , Wolfgang Kuehn‡ , Axel Jantsch†


‡ II. Physics Institute † Dept. of Electronic Systems
Justus-Liebig-University Giessen (JLU), Germany Royal Institute of Technology (KTH), Sweden
{ming.liu, wolfgang.kuehn}@physik.uni-giessen.de {mingliu, zhonghai, axel}@kth.se

Abstract—Reconfiguration time overhead is a critical factor we introduce the concept of Virtual ConFigurations (VCF),
in determining the system performance of FPGA dynamically with which the dynamic reconfiguration time may be fully
reconfigurable designs. To reduce the reconfiguration overhead, or partly hidden in the background. Experimental results
the most straightforward way is to increase the reconfiguration
throughput, as many previous contributions did. In addition will demonstrate the performance benefits of using VCFs in
to shortening FPGA reconfiguration time, we introduce a new Section IV, in terms of data delivery throughput and latency.
concept of Virtual ConFigurations (VCF) in this paper, hiding Finally we conclude the paper and propose our future work
dynamic reconfiguration time in the background to reduce in Section V.
the overhead. Experimental results demonstrate up to 29.9%
throughput enhancement by adopting two VCFs in a consumer- II. R ELATED W ORK
reconfigurable design. The packet latency performance is also
largely improved by extending the channel saturation to a FPGA dynamic reconfiguration overhead refers to the
higher packet injection rate. time spent on the module reconfiguration process. At that
time, the reconfigurable region on the FPGA cannot effec-
I. I NTRODUCTION tively work due to the lack of a complete bitstream, and
consequently it has negative effects on the system perfor-
Partial Reconfiguration (PR) enables the process of dy- mance. Reconfiguration overhead may be minimized either
namically reconfiguring a particular section of an FPGA de- with a reasonable scheduling policy which decreases the
sign while the remaining part is still operating. This vendor- context switching times of hardware modules, or by reducing
dependent technology provides common benefits in adapting the required time span for each configuration. There is
hardware modules during system run-time, sharing hardware related discussion on the former approach in our previous
resources to reduce device count and power consumption, publication of [5]. We observe from the experimental results
shortening reconfiguration time, etc. [1] [2] [3]. Typically that only less than 0.3% time is spent on the configuration
partial reconfiguration is achieved by loading the partial bit- switching with a throughput-aware scheduling policy, not
stream of a new design into the FPGA configuration memory exacerbating much the overall processing throughput of the
and overwriting the current one. Thus the reconfigurable under-test system; With regard to the latter approach, design
portion will change its behavior according to the newly optimization approaches have been previously adopted to
loaded configuration. Despite the flexibility of changing increase the configuration throughput. For instance in [6],
part of the design at system run-time, overhead exists in [7] and [8], authors explore the design space of various ICAP
the reconfiguration process since the reconfigurable portion designs and enhance the reconfiguration throughput to the
cannot work at that time due to the incompleteness of the order of magnitude of Megabytes per second. Unfortunately
configuration data. It has to wait to resume working until the the reconfiguration time is still constrained by the physical
complete configuration data have been successfully loaded in bandwidth of the reconfiguration port on FPGAs. Other ap-
the FPGA configuration memory. Therefore in performance- proaches of compressing the partial bitstreams are discussed
critical applications which require fast or frequent switching in [8] and [9] for shrinking the reconfiguration time under
of IP cores, the reconfiguration time is significant and should the precondition of a fixed configuration throughput. In ad-
be minimized to reduce the overhead. dition to all the above described contributions, in this paper
To reduce the dynamic reconfiguration overhead, the most we will address the challenge of reducing the reconfiguration
straightforward way is to increase the data write-in through- overhead, employing the concept of virtualization on FPGA
put of the configuration interface on FPGAs, specifically configuration contexts.
the Internal Configuration Access Port (ICAP) on Xilinx
FPGAs [4]. As an additional approach, we address the III. V IRTUAL C ONFIGURATION
challenge of reducing the reconfiguration overhead by em- In canonical PR designs, one Partially Reconfigurable
ploying the concept of virtualization on FPGA configuration Region (PRR) has to stop from working, when a new module
contexts. The remainder of the paper will be organized as is to be loaded to replace the existing one by run-time recon-
follows: In Section II, related work of reducing dynamic figuration. This is the overhead of switching hardware pro-
reconfiguration overhead will be discussed. In Section III, cesses, which restricts the overall system performance. As a

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 149


solution, we propose the concept of Virtual ConFiguration
(VCF) to hide the configuration overhead of a PR design.
As shown in Figure 1, two copies of configuration contexts,
each of which represents a VCF, are altogether dedicated to
a single PRR on a multi-context FPGA [10] [11] [12]. The
active VCF may still keep working in the foreground when
module switching is expected. The run-time reconfiguration
only happens invisibly in the background, and the new
partial bitstream is loaded into configuration context 2. After
the reconfiguration is finished, the newly loaded module
can start working by being swapped with the foreground
Figure 2. Timing diagrams of PR designs without or with VCFs
context, migrating from the background to the foreground.
The previously active configuration will be deactivated into
the background and wait for the next time reconfiguration.
The configuration context swapping between the background
and the foreground is logically realized by changing the
control on the PRR among different VCFs. It does not
need to really swap the configuration data in the FPGA
configuration memory, but instead switches control outputs
taking effect on the PRR using multiplexer (MUX) devices.
Hence the configuration context swapping takes only very
short time (normally some clock cycles), and is tiny enough
to be negligible compared to the processing time of the
system design. Figure 3. Virtual reconfigurations on single-context FPGAs

IV. E XPERIMENTS
A. Experimental Setup
To investigate the impact of VCFs on performance, we
set up a producer-consumer design with run-time recon-
figuration capability. As illustrated in Figure 4, the pro-
ducer periodically generates randomly-destined packets to
Figure 1. Virtual reconfigurations on multi-context FPGAs 4 consumers and buffers them in 4 FIFOs. Each FIFO is
dedicated to a corresponding consumer algorithm, which
With the approach of adopting VCFs, the reconfiguration can be dynamically loaded into the reserved consumer PRR.
overhead can be fully or partly removed with the duplicated The scheduler program monitors the “almost_full” signals
configuration contexts. The timing advantage is illustrated from all FIFOs and arbitrate the to-be-loaded consumer
in Figure 2, comparing to the canonical PR designs without module using a Round-Robin policy. Afterwards, the loaded
VCFs. We see in Figure 2a, the effective work time and the consumer will consume its buffered data in a burst mode,
reconfiguration overhead have to be arranged in sequence on until it has to be replaced by the winner of the next-
the time axis, in the canonical PR design without VCFs. By round reconfiguration arbitration. The baseline canonical PR
contrast in Figure 2b, the reconfiguration process only hap- design has only one configuration context and must stop the
pens in the background and the time overhead is therefore working module before the reconfiguration starts. In the PR
hidden by the working VCF in the foreground. design with VCFs, we adopt only two configuration contexts
In normal FPGAs with only single-context configuration since the on-chip area overhead of multiple configuration
memories, VCFs may be implemented by reserving du- contexts should be minimized. Experimental measurements
plicated PRRs of the same size (see Figure 3). At each have been carried out in cycle-accurate simulation using
time, only one PRR is allowed to be activated in the synthesizable VHDL codes. Simulation provides much con-
foreground and selected to communicate with the rest static venience for observing all the signals in the waveform and
design by MUXes. The other PRR waits in the background debugging the design. It will have the same results when
for reconfiguration and will be swapped to the foreground implementing the design on any dynamically reconfigurable
to work after the module is successfully loaded. Taking FPGA. Both the baseline and the VCF designs run at a
into account the resource utilization overhead of reserving system clock of 100 MHz. The overall on-chip buffering
duplicated PRRs, usually we do not adopt more than 2 capability is parameterized in the order of KiloBytes. For
VCFs. the reconfiguration time of each module, we select 10

150 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


μs which is a reasonable value when using the practical
Xilinx ICAP controller for partial reconfiguration [7]. The
generated packets are 256-bit wide. The FIFO width is 32
bits. Before packets go into the FIFO, they are fragmented
into flits.

Figure 5. Throughput measurement results (reconfiguration time = 10 μs)

Figure 4. Experimental setup of the consumer-reconfigurable design


(0.00628 vs. 0.00492 packets/cycle/node), using 2 VCFs to
partly counteract the reconfiguration overhead. The channel
B. Results saturation point is extended to about 1 packet per 35 cycles
by duplicated VCFs
We did measurements on the received packet throughput
in the unit of packets per cycle per consumer node, with the
FIFO depth of 512, 1K and 2K respectively. Measurement
results are demonstrated in Figure 5. We observe from the
figure that:
1) As the packet injection rate increases, the on-chip
communication becomes saturated progressively due
to the limitation of the packet consuming capability;
2) For both types of PR designs (red or light curves for
with 2 VCFs and blue or dark curves for without),
larger FIFO depths lead to higher saturated throughput,
since the data read-out burst size can be increased
by larger buffering capability, and the reconfiguration
time overhead is comparatively reduced;
3) Introducing VCFs can further reduce the reconfigura-
tion overhead by hiding the reconfiguration time in Figure 6. Throughput measurement results (reconfiguration time = 50 μs)
the background. In the most obvious case of 1K FIFO
depth, two VCFs increase the throughput from 0.0127 Except for the throughput comparison, we collected also
packets/cycle/node to 0.0165, achieving a performance statistics on packet latency performance to demonstrate the
enhancement of 29.9%. Other two cases of 512 and effect of using VCFs. We discuss the average latency of a
2K FIFO depth have a performance enhancement of certain amount of packets, and exclude the system warm-up
26.4% and 17.9% respectively. and cool-down cycles out of measurements, only taking into
We enlarged the time span of each configuration from 10 account steady communications. The latency is calculated
μs to 50 μs and did further throughput measurements with from the instant when the packet is injected into the source
a middle-size FIFO depth of 1K. Results are demonstrated queue to that when the packet is received by the destination
in Figure 6, comparing the PR design using 2 VCFs with node. It consists of two components: the queuing time in the
the one without VCF. We observe that the overall system source queue and the network delivery time in flit FIFOs.
throughput is worsened by the increased reconfiguration Measurements were conducted in the experimental setup
time overhead, specifically from a saturated value of 0.0127 with the smaller reconfiguration time of 10 μs and the
(see Figure 5) into 0.00492 packets/cycle/node for the non- middle-size FIFO depth of 1K. Results are illustrated in
VCF design. The increased reconfiguration time also easily Figure 7. We observe that 2 VCFs have a slight reduction
results in the channel saturation at an even lower packet effect on the packet latency before the channel saturation. In
injection rate of about 1 packet per 50 cycles. In this test, this curve segment, packets do not stay in the source queue
we can still see the performance improvement of 27.6% for too long time, but they must wait in flit FIFOs until

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 151


their specific destination node is configured to read them No. COSY-099 41821475, HIC for FAIR, and WTZ: CHN
out in a burst mode. Therefore we see two comparatively flat 06/20.
curve segments before the channel saturation, because of the
steady switching frequency of consumer nodes. Nevertheless R EFERENCES
after the channel’s packet delivery capability is saturated, [1] C. Kao, “Benefits of Partial Reconfiguration”, Xcell Journal,
packets have to spend much time waiting in the source Fourth Quarter 2005, pp. 65 - 67.
queue to enter the flit FIFOs. Thus the average latency of
[2] E. J. Mcdonald, “Runtime FPGA Partial Reconfiguration”, In
packets deteriorates significantly and generates rising curve Proc. of 2008 IEEE Aerospace Conference, pp. 1 - 7, Mar.
segments in the figure. By contrast, using 2 VCFs may 2008.
reduce the reconfiguration overhead and extends the channel
saturation to a higher packet injection rate. It reduces the [3] C. Choi and H. Lee, “An Reconfigurable FIR Filter Design
on a Partial Reconfiguration Platform”, In Proc. of First
packet wait time in the source queue and introduce them
International Conference on Communications and Electronics,
into the flit FIFOs at an early time, leading to a large pp. 352 - 355, Oct. 2006.
improvement on the packet latency performance.
[4] Xilinx Inc., “Virtex-4 FPGA Configuration User Guide”,
UG071, Jun. 2009.

[5] M. Liu, Z. Lu, W. Kuehn, and A. Jantsch, “FPGA-based Adap-


tive Computing for Correlated Multi-stream Processing”, In
Proc. of the Design, Automation & Test in Europe conference,
Mar. 2010.

[6] J. Delorme, A. Nafkha, P. Leray and C. Moy, “New OPBHW-


ICAP Interface for Realtime Partial Reconfiguration of FPGA”,
In Proc. of the International Conference on Reconfigurable
Computing and FPGAs, Dec. 2009.

[7] M. Liu, W. Kuehn, Z. Lu, and A. Jantsch, “Run-time Partial


Reconfiguration Speed Investigation and Architectural Design
Space Exploration”, In Proc. of the International Conference
on Field Programmable Logic and Applications, Aug. 2009.

[8] S. Liu, R. N. Pittman, and A. Forin, “Minimizing Partial


Figure 7. Latency measurement results (reconfiguration time = 10 μs)
Reconfiguration Overhead with Fully Streaming DMA Engines
and Intelligent ICAP Controller”, In Proc. of the International
Symposium on Field-Programmable Gate Arrays, Feb. 2010.
V. C ONCLUSION AND F UTURE W ORK [9] J. H. Pan, T. Mitra, and W. Wong, “Configuration Bitstream
Compression for Dynamically Reconfigurable FPGAs”, In
In this paper, we introduce the concept of virtual con- Proc. of the International Conference on Computer-Aided
figurations to hide the FPGA dynamic reconfiguration time Design, Nov. 2004.
in the background and reduce the reconfiguration overhead.
Experimental results on a consumer-reconfigurable design [10] Y. Birk and E. Fiksman, “Dynamic Reconfiguration Archi-
demonstrate up to 29.9% throughput improvement of re- tectures for Multi-context FPGAs”, International Journal of
Computers and Electrical Engineering, Volume 35, Issue 6,
ceived packets by each consumer node. The packet latency Nov. 2009.
performance is largely improved as well, by extending the
channel saturation to a higher packet injection rate. This [11] M. Hariyama, S. Ishihara, N. Idobata and M. Kameyama,
approach is well suited for PR designs on multi-context FP- “Non-volatile Multi-Context FPGAs using Hybrid Multiple-
GAs. For single-context FPGAs, performance improvement Valued/Binary Context Switching Signals”, In Proc. of Inter-
national Conference Reconfigurable systems and Algorithms,
is accompanied by resource utilization overhead of reserving Aug. 2008.
duplicated PR regions.
In the future work, we will take advantage of VCFs [12] K. Nambaand H. Ito, “Proposal of Testable Multi-Context
in practical PR designs for specific applications. Research FPGA Architecture”, IEICE Transactions on Information and
and engineering work on multi-context FPGAs will also Systems, Volume E89-D, Issue 5, May. 2006.
be useful to popularize this technology in dynamically
reconfigurable designs with high performance requirements.

ACKNOWLEDGMENT
This work was supported in part by BMBF under contract
Nos. 06GI9107I and 06GI9108I, FZ-Juelich under contract

152 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Timing Synchronization for a Multi-Standard
Receiver on a Multi-Processor System-on-Chip
Roberto Airoldi, Fabio Garzia and Jari Nurmi
Tampere University of Technology
Department of Computer Systems
Korkeakoulunkatu 1, P.O. Box 553, FI-33101
Tampere, Finland
email: name.surname@tut.fi

Abstract—This paper presents the implementation of timing implementation of the timing synchronisation procedure for
synchronisation for a multi-standard receiver on Ninesilica, a ho- W-CDMA and OFDM systems on a homogeneous MPSoC.
mogeneous Multi-Processor System-on-Chip (MPSoC) composed
This paper is organised as follows: in the next section a brief
of 9 nodes. The nodes are arranged in a mesh topology and the
on chip communication is supported by a hierarchical Network- overview of the Multi-processor System-on-Chip architecture
on-Chip. The system is prototyped on FPGA. The mapping on is given; in section III timing synchronisation for W-CDMA
Ninesilica of the timing synchronisation algorithms showed a and OFDM systems and their mapping onto the MPSoC are
good parallelization efficiency leading to speed-ups up to 7.5x analysed; finally in section IV and V results and conclusions
when compared to a single processor architecture.
are drawn.
I. I NTRODUCTION
II. N INESILICA MPS O C OVERVIEW
In the last twenty years the development of embedded
Ninesilica is a homogeneous MPSoC composed of nine
systems has been driven by the boom of wireless technology.
nodes arranged in a 3x3 mesh topology. Ninesilica is derived
A continuous development of new wireless protocols has
from the Silicon Café template, developed at Tampere Uni-
imposed new constraints on the receivers. Today’s state of
versity of Technology. The template allows the creation of
the art devices are able to work over a heterogeneous set
either heterogeneous or homogeneous multi-processor archi-
of networks, such as: 2G, 3G, bluetooth Wi-Max and Wi-Fi.
tecture with a generic number of nodes. The communication
Traditional design approaches, based on a collage of single-
between nodes take place through a hierarchical Network-on-
standard receivers, introduce limitations in the number of
Chip (NoC)[6] that taps directly into the node communication
supported wireless protocols due to constraints such as power
system.
and area consumption, flexibility and efficiency. In the past
Fig. 1 presents a schematic view of Ninesilica while figure
few years research institutes from both academia and industry
2 shows the internal structure of a single node. Each node
moved their focus towards Software Defined Radio (SDR).
hosts a COFFEE RISC processor [5], data and instruction
SDR platforms might be a feasible approach to implement
memories and a Network Interface (NI). The NI is composed
highly flexible transceivers. Ideally a SDR terminal would be
of two parts: initiator and target. The initiator is responsible
able to work over different networks just re-configuring/re-
for routing remote data (coming from the NoC) inside the
programming the software running on the platform, according
node. On the other hand the target is responsible to route
to the user’s willing. Hence SDR enabling platform must
data from the node to the NoC. The central nodes is also
provide a high computational power to meet the strict real-time
equipped with I/O interfaces and takes care of the data
requirements of today’s and tomorrow’s wireless standards
distribution among the other nodes. Moreover it acts as a
within high flexibility. Many solutions have been proposed
schedule manager distributing tasks to the others nodes (also
from both academia and industry to meet the requirements
referred as computational nodes). Indeed its main task is to
for SDR applications. Reconfigurable architectures (such as
control the communication flow and the other nodes activities.
Montium [1]) as well as DSP solutions (see [2] [3]) have been
widely explored for SDR applications. Furthermore, in the past The system was prototyped on an Stratix IV FPGA device.
few years Multi-Processor Systems-on-Chip (MPSoCs) have The synthesis results are collected in Table I. More details
gained a growing interest from the research community as a about the architecture can be found in [7].
possible way to meet high performance required by wireless
III. W-CDMA AND OFDM TIMING SYNCHRONISATION
standards within high flexibility[4].
Two communication protocols are mostly utilised for the Considering the signal processing chain for wireless stan-
physical layer processing in wireless communications: W- dard’s physical layer it is possible to identify two different
CDMA and OFDM. In this paper the authors present the communication techniques that cover most of the wireless

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 153


Fig. 1. Schematic view of Ninesilica MPSoC

signal processing for today and next generation communica-


tion. W-CDMA [8] is utilised as physical layer for UMTS
systems while OFDM [9] is for example utilised in Wi-Fi Fig. 2. Detailed view of a Ninesilica node
(IEEE 802.11a/g/n), Wi-Max (IEEE 802.16) and 3GPP-LTE.
A very critical step in the signal processing for wireless
systems is the timing synchronisation. Timing synchronisation sample for each slot. This sequence is common to all the
procedure takes care of synchronising the mobile terminal transmitting cells and it is known at the receiver. The slot
to the baseband station that offers the best available down- boundary identification is then done through a match filter.
link. Hence errors in the timing would lead into an incorrect The match filters correlate the incoming data stream to the
demodulation of the incoming data. Moreover this procedure known sequence. Output values in the sequence that exceed a
is highly computational demanding and should be performed pre-fixed threshold indicates a match in the slot search. Once
continuously to keep the synchronisation. Independently from the position of a slot is known it is possible to perform a multi-
the radio communication protocol utilised, the timing syn- path detection for a better accuracy in the synchronisation. The
chronisation step is based on the evaluation of correlation match filter computes a sum of 256 complex multiplications.
sequences. This operation is performed on Ninesilica in a distributed
way. Each computational node computes a part of the sum
TABLE I
STRATIX IV SYNTHESIS RESULTS OF N INESILICA MPS O C returning the partial results to the control node. Hence the
control node performs the final sum and check if the threshold
Component Adapt. Registers Utilisation was exceeded. In that case the multi-path correction can be
LUT %
C OFFEE RISC 7054 4941 2.0 performed, otherwise a new evaluation of the match filter is
Local network node 296 226 0.1 done.
Computational Node 7360 5167 2.1 The multi-path estimation and correction is based on the
Global Network 5104 4170 1.3 observation of four consecutive slots. The incoming data
Total 71679 50897 20 stream is correlated to the known sequence over a multi-path
window (1024 samples). The correlation sequences are then
averaged and the maximum value of the sequence indexes the
A. W-CDMA Timing Synchronisation mapping on Ninesilica multi-path that offers the best signal-noise ratio. The data of
W-CDMA transmissions are organised in frames. Each the four slots are equally divided among the computational
frame is composed of 15 different slots. To ease the synchroni- nodes. Each node working on independent sets of data returns
sation procedure between cell and mobile terminal W-CDMA the correlation values to the central node which performs the
protocol utilises 3 separate channels: Primary Synchronisation average and the maximum search to identify the best path.
Channel (P-SCH), Secondary Synchronisation Channel (S-
SCH) and Common Pilot Channel (CPICH). However for B. OFDM Timing Synchronisation mapping on Ninesilica
the timing synchronisation only 1 channel is utilised. As OFDM timing synchronisation for IEEE 802.11a is obtained
figure 3 shows, P-SCH transmits the same sequence of 256 through a delay and correlate approach. Each transmission

154 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Fig. 3. Data structure of the P-SCH over 1 frame period
Fig. 4. Data structure of a IEEE 802.11a OFDM preamble

begin with a preamble. This preamble is utilised by the


receiver to synchronise to the transmitter. Figure 4 shows core execution. For the OFDM system the achieved speed up
its structure. The first part of the preamble is the short is 6.7x.
training sequence. This sequence is the continuous repetition
of a short sequence. For this reason the course grain timing V. C ONCLUSIONS
synchronisation can be done through a delay and correlate In this work the authors presented the timing synchro-
approach. The received data-stream is correlated to a delayed nisation for W-CDMA and OFDM systems on Ninesilica.
version of itself. The delay introduced is equal to the length of Ninesilica takes advantage of data level parallelism leading
the short sequence. Peaks in the correlation sequence identifies to high speed-ups if compared to a single core. Future work
the beginning of a communication. Ninesilica performs this will explore the scalability of the system with the number of
operation in a distribute way. The central node distribute a nodes.
chunk of data among the computational nodes, which perform
ACKNOWLEDGEMENT
the correlation on independent sections of data returning the
processed data to the control node. The control node evaluates The author gratefully acknowledges Nokia Foundation for
if a communication was found by analysing the correlation their support.
values received. If not a new chunk of data is sent for a R EFERENCES
further correlation analysis. After the initial synchronisation [1] G. Rauwerda, P. M. Heysters, and G. J. Smit, “An OFDM Receiver
is performed a multi-path estimation can take place. Implemented on the Coarse-grain Reconfigurable Montium Processor,”
OFDM multi-path estimation and correction is based on the in roceedings of the 9th International OFDM Workshop (InOWo’04),
(Dresden, Germany), pp. 197–201, September 15-16 2004.
correlation between the long training sequences of the com- [2] J. Glossner, D. Iancu, M. Moudgill, G. Nacer, S. Jinturkar, and M. Schulte,
munication preamble and a known sequence. This operation “The sandbridge sb3011 sdr platform,” in Mobile Future, 2006 and the
is performed following the same approach of the W-CDMA Symposium on Trends in Communications. SympoTIC ’06. Joint IST
Workshop on, pp. ii–v, June 2006.
case. Data is sent to the computational nodes which process [3] Y. Lin, H. Lee, M. Woh, Y. Harel, S. Mahlke, T. Mudge, C. Chakrabarti,
independently the data and return the correlation evaluations and K. Flautner, “Soda: A high-performance dsp architecture for software-
to the control node. The control node finally analyses the defined radio,” Micro, IEEE, vol. 27, pp. 114–123, Jan.-Feb. 2007.
[4] W. Wolf, A. A. Jerraya, and G. Martin, “Multiprocessor System-on-Chip
correlation sequence refining the timing synchronisation with (MPSoC) Technology,” Computer-Aided Design of Integrated Circuits
the estimation of the multi-path. and Systems, IEEE Transactions on, vol. 27, pp. 1701–1713, Oct. 2008.
[5] J. Kylliinen, T. Ahonen, and J. Nurmi, “General-purpose embedded
processor cores - the COFFEE RISC example,” in Processor Design:
IV. R ESULTS System-on-Chip Computing for ASICs and FPGAs (J. Nurmi, ed.), ch. 5,
pp. 83–100, Kluwer Academic Publishers / Springer Publishers, June
The mapping of the timing synchronisation algorithm for 2007. ISBN-10: 1402055293, ISBN-13: 978-1-4020-5529-4.
OFDM and W-CDMA receivers on the Ninesilica was evalu- [6] T. Ahonen and J. Nurmi, “Hierarchically heterogeneous network-on-
chip,” in Proceedings of the 2007 International Conference on Computer
ated in terms of number of clock cycles spent to accomplish as a Tool (EUROCON ’07), pp. 2580–2586, IEEE, 9-12 September 2007.
the procedure. Data for the simulation was provided by a ISBN: 978-1-4244-0813-9, DOI: 10.1109/EURCON.2007.4400469.
Matlab model of WCDMA and OFDM systems. It was utilised [7] R. Airoldi, F. Garzia, and J. Nurmi, “Implementation of a 64-point FFT
on a Multi-Processor System-on-Chip,” in Proceedings of the 5th Interna-
a signal to noise ratio of −20dB. Moreover the Matlab model tional Conference on Ph.D. Research in Microelectronics & Electronics
was utilised as reference for the validation of the simulation (Prime ’09, (Cork, Ireland), pp. 20–23, IEEE, July 2009.
results. Furthermore the simulation results were compared to [8] E. Dahlman, P. Beming, J. Knutsson, F. Ovesjo, M. Persson, and
C. Roobol, “W-CDMA - the radio interface for future mobile multimedia
an implementation on a single processor to determine the communications,” Vehicular Technology, IEEE Transactions on, vol. 47,
scalability of the algorithm parallelization. pp. 1105–1118, November 1998.
[9] X. Wang, “OFDM and its application to 4G,” in Proc. International
A single correlation point for W-CDMA is performed in Conference on Wireless and Optical Communications 14th Annual WOCC
2,546 clock cycles on Ninesilica and 12,381 on a single 2005, p. 69, 22–23 April 2005.
COFFEE core, leading to a speed-up of 5x. For the OFDM sys-
tem a single correlation point (delay and correlate approach)
takes respectively 88 and 350 clock cycles on Ninesilica and
COFFEE core, giving a speed-up of 4x. The execution of the
whole synchronisation process for W-CDMA on Ninesilica
gives a speed up of 7.5x when compared to a single COFFEE

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 155




156 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Mesh and Fat-Tree comparison for
dynamically reconfigurable applications
Ludovic Devaux, Sebastien Pillement, Daniel Chillet, Didier Demigny
University of Rennes I / IRISA
6 rue de Kerampont, BP 80518, 22302 LANNION, FRANCE
Email: Ludovic.Devaux@irisa.fr

Abstract—Dynamic reconfiguration of FPGAs al- resources that are used to define the hardware parts
lows the dynamic management of various tasks that of the tasks. These sets can be the hardware imple-
describe an application. This new feature permits, mentation of the tasks (static or dynamic), shared
for optimization purpose, to place tasks on line in
available regions of the FPGA. Dynamic reconfigu- elements (memory, input/output), or also hardware
ration of tasks leads notably to some communication processors running software tasks.
problems since tasks are not present in the matrix In this article, we compare the two more popular
during all computation time. This dynamicity needs interconnection networks called Meshes and Fat-
to be supported by the interconnection network. In Trees. Our proposed comparison is based over the
this paper, we compare the two most popular inter-
connection topologies which are the Mesh and the Fat- adequacy between these networks and current par-
Tree. These networks are compared considering the tially and dynamically reconfigurable FPGAs. Do-
dynamic reconfiguration paradigm also with the net- ing so, we present which interconnection network is
work performances and the architectural constraints. the most useful for implementing real life complex
applications using dynamic reconfiguration. The
I. I NTRODUCTION paper is organized as follow. In section II, FPGAs
Evolution of technologies permits to support supporting dynamic reconfiguration are introduced.
complex signal processing applications. The num- In section III, Mesh and Fat-Tree architectures are
ber of tasks constituting an application grows up discussed considering typical applicative require-
and starts to outnumber the available resources pro- ments and FPGA’s characteristics. To conclude, we
vided by many FPGAs. Facing this implementation indicate which of the Mesh or the Fat-Tree topology
constraint, FPGAs that can be reconfigured on-the- best fits present applications using the dynamic
fly were proposed. Hence, at each moment, only reconfiguration and current FPGA architectures.
the hardware tasks which need to be executed are
II. C ONTEXT
configured in the FPGA fabric. Assuming that all
the tasks do not have to be executed simultaneously, A. Reconfigurable architectures
they are allocated and scheduled at runtime. This The industrial products supporting dynamic re-
is the Dynamic and Partial Reconfiguration (DPR) configuration are Atmel AT40K series [1], Altera
paradigm. Stratix IV series [2] and Xilinx Virtex2 pro, Vir-
The management of the DPR leads to high chal- tex4, Virtex5 and most recently Virtex6 series [3],
lenges to be effective. Thus, the interconnection [4]. Atmel FPGAs are very limited in number of
architecture needs to be compliant with the dynamic available reconfigurable resources so they do not
implementation, and location, of the tasks. This support complex applications like signal processing
architecture should support the constraints induced [1]. Altera circuits are not reconfigurable like Atmel
by the DPR paradigm by providing a flexible way or Xilinx FPGAs [2]. Indeed, they currently do not
for transferring data between every sets of logical support the dynamic reconfiguration for all logical

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 157


resources but only for inputs/output parameters.
Thus, Xilinx series are the best choice for current
dynamic hardware dependent researches. Consid-
ering that Xilinx is currently the market leader
in dynamically reconfigurable devices, the inter-
connection networks are adapted to Xilinx FPGA
characteristics and especially on the Virtex5, and
Virtex6 series that are the most recent ones.
Dynamic reconfiguration of Xilinx FPGAs is al-
lowed by the PlanAhead tool [5]. PlanAhead allows Fig. 1. Presentation of a (a) Mesh and (b) Fat-Tree topologies.
to specify dynamically reconfigurable regions, so S boxes are switch elements while R boxes represent a set of
called PRRs (Partially Reconfigurable Regions). A logical resources used to implement the tasks.
PRR is implemented statically despite the fact that
its content is dynamic. Thus, at runtime, dynamic
reconfiguration takes place inside the PRRs. This In this formula, assuming that the fat-tree is com-
is the base element of dynamic reconfiguration. In plete in terms of connected tasks, N is expressed
Xilinx Virtex2 pro FPGAs, the dynamic reconfigu- by N = 2x where x is an integer and x ≥ 1.
ration impacted the wall columns of resources even When N does not match the previous formula,
if only a little part of then is declared to be part of designers should build the network considering the
a PRR. This limitation had a great impact on the admissible value of N just higher in order to keep
possibility to implement complex application, but the complete tree based structure of the network.
it no longer exists in Virtex4, virtex5, and Virtex6 The number of connection links needed by the Fat-
series. In these series, the dynamic reconfiguration Tree is calculated by
only impacts the resources inside a PRR so that sev-
eral PRR can be declared using the same columns LF at−T ree = N (log k N ) [6]
2
of resources.
In a Mesh, if the number of connected tasks is
III. C OMPARISON OF M ESH AND FAT-T REE N whose value can be every positive integer, and
TOPOLOGIES if D is the radix of the Mesh, then the number of
switches needed to build a regular bi-dimensional
In this section, The Mesh and Fat-Tree topolo- square Mesh is calculated by
gies are presented in detail. In order to compare
them, hardware costs, network performances, and √ 2
adequacy to be implemented in present FPGAs, are SM esh = N = D2
studied.
The number of needed communication links is
A. Used resources calculated by

A Mesh (Figure 1.(a)) is a direct network in LM esh = N + 2(D2 − D)


which each switch connects a hardware task. On
the contrary, a Fat-Tree (Figure 1.(b)) is an indirect Results presented Table I can be calculated from
network. Since the number of connected tasks is previous formulas. Thus, a Mesh has the advantage
N and the number of inputs/outputs of each switch of consuming less resources for routing purpose
is k, then the number of switches in a Fat-Tree is than a Fat-Tree. But, from these results we can see
calculated by that the difference between Fat-Tree and Mesh re-
2N sources scales down when the number of connected
SF at−T ree = (log k N ) [6] tasks is low.
k 2

158 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


TABLE I
N UMBER OF SWITCHES AND LINKS NEEDED TO IMPLEMENT From this point of view, a Fat-Tree topology has
A M ESH AND A FAT-T REE DEPENDING ON THE NUMBER OF many advantages compared to a Mesh. Indeed, a
CONNECTED TASKS .
Fat-Tree avoids deadlock risks that can occur in
Connected Tasks FAT-TREE MESH
a Mesh if the control does not possess specific
Switch Link Switch Link mechanisms to anticipate this risk [6]. The fact that
2 1 2 4 6 the bandwidth is constant between each hierarchical
4 4 8 4 8 level of the Fat-Tree is also a very interesting
8 12 24 9 20 characteristic. Thus even if communicating tasks are
16 32 64 16 40
placed at the opposite sides of the network, a Fat-
32 80 160 36 92
64 192 384 64 176
Tree guaranties every data to be routed over the
network without any contention because there is
always at least one communication way available
with a constant bandwidth. While a Mesh is a
B. Network characteristics for dynamic operation
direct network, the routing can not be guaranteed
An application is typically constituted of some like in a Fat-Tree because of the low available
statically implemented tasks and of several dy- bandwidth compared to the number of switches
namic tasks. In order to fit the largest scope of [7]. Futhermore, in a Mesh, the communication
applications, no assumptions are made over the requirements between tasks and shared elements
implementation of the tasks. So, they are imple- induce the creation of hot-spots, increasing the
mented heterogeneously in terms of needed logical likelihood of livelocks and deadlocks. One way to
resources. avoid contention risks is to use virtual channels.
Concerning the placement, every task can be However, this solution has a very high cost in terms
connected everywhere to the network even if it leads of used memories.
to the worst cases of communication. Indeed, two The two networks are simulated using ModelSim
tasks exchanging a large amount of data (data-flow 9.5c [8] for a 32 bits data width, and with a
applications) can be connected to the same switch buffer depth of 4x32 bit words. Concerning network
in a Fat-Tree or to two neighbor switches in a Mesh, performances, the highest data rates are chosen in
but also to the opposite sides of the network. This order to place the two topologies into the worst
concept is presented Figure 2. functioning cases. So, for a connection of 8 tasks,
their transfer rate are fixed to 800Mbit/s each.
Simulations were realized sending 1562 packets of
16 words each for an injection time of 1ms (Figure
3).
From these results, with a transfer rate fixed to
800Mbit/s per task, Mesh and Fat-Tree topologies
provide equal latencies until 4 simultaneously con-
nected tasks. Then, due to a change of the network
scales, a Mesh present a lower average latency until
9 connected tasks. Fat-Tree provides a just higher
average latency but it can connect 12 tasks without
saturating. This is a very important result because,
with a uniform repartition of the data traffic, a Mesh
seems interesting for interconnecting a low number
of tasks. For more connected tasks, a Fat-Tree is
Fig. 2. Mesh (A,B) and Fat-Tree (C,D) topologies interconnect-
ing 4 PRRs. Dynamic tasks T1 and T3 communicate respectively well suited. However, These results were obtained
with tasks T2 and T4. from a simulation with a Uniform repartition of data

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 159


TABLE II
N UMBER OF USED RESOURCES NEEDED TO IMPLEMENT A
M ESH AND A FAT-T REE DEPENDING ON THE NUMBER OF
CONNECTED TASKS .

Resources FAT-TREE MESH


4 tasks 8 tasks 4 tasks 8 tasks
Registers 900 2700 524 1549
LUTs 3416 10248 1992 6083
Free registers 97% 92% 98% 95%
Free LUTs 90% 69% 94% 81%

the guarantee that their specifications (routing, la-


tency...) are respected even if the FPGAs and the
tasks are heterogeneous. However, considering its
structure, the Fat-Tree is particularly suited for this
concept of implementation.
Fig. 3. (A) Comparison of the average latencies for a connection IV. C ONCLUSION
of 8 tasks, and (B) maximal admissible data rates per task
depending on the number of simultaneously connected tasks. In this article, we have presented a compari-
son between the two more popular interconnec-
tion networks, the Mesh and the Fat-Tree. It ap-
peared that a Mesh has the advantage to consume
and without any hot-spot. In usual systems using less routing resources than the Fat-Tree. However,
shared elements, a Mesh will present hot-spots near considering present applications that do not often
need much than ten simultaneously implemented
these elements and the resulting average latency will dynamic tasks, and rarely much than fifteen, the
grow up significantly. While this problem has no difference in terms of logical resources utilization
influence over the Fat-Tree topology, the latter is for routing purpose can be acceptable. Furthermore,
if both of them are compliant with present FPGA
more suited than a Mesh for an implementation in specifications, the demonstration was made that a
real life applications. Fat-Tree is more adapted to the dynamic reconfigu-
ration paradigm. It presents equal or higher network
Considering the maximal admissible transfer performances, a deadlock free optimal routing algo-
rates per tasks, a Fat-Tree supports equal or higher rithm, and a material structure allowing to provide a
rates than a Mesh. So, in real life applications where constant bandwidth to every tasks everywhere into
the network.
bandwidth is a major requirement, Fat-Trees should
be implemented instead of Meshes. R EFERENCES
[1] ATMEL, AT40K05/10/20/40AL. 5K - 50K Gate FPGA with
C. Network implementation DSP Optimized Core Cell and Distributed FreeRam, En-
hanced Performance Improvement and Bi-directional I/Os
The implementation of Fat-Tree and Mesh (3.3 V)., 2006, revision F.
topologies in a Xilinx Virtex5 VC5VSX50T lead [2] Altera, Stratix IV Device Handbook - Volume 1, ver 4.0, Nov
to the results presented in Table II. Considering the 2009., 11 2009.
[3] Xilinx, Virtex-5 FPGA Configuration User Guide, 2008,
FPGA resource utilization, implementing a NoC as v3.5.
a central column presents many advantages. Thus, [4] ——, XST User Guide for Virtex-6 and Spartan-6 Devices,
depending on the hierarchical level, the Fat-Tree December 2, 2009, uG687 (v 11.4).
[5] ——, PlanAhead User Guide - version 1.1, 2008.
can be implemented with a very limited number of [6] J. L. Hennessy and D. A. Patterson, Computer Architecture:
resources. Therefor, the resources remain free for A Quantitative Approach. Morgan Kaufmann, 2006, ch.
other tasks or for a processor implementation. Appendix E : Interconnection Networks.
[7] V.-D. Ngo and H.-W. Choi, “Analyzing the performance of
Thus, with this concept of implementation, both mesh and fat-tree topologies for network on chip design,”
Mesh and Fat-Tree topologies are compliant with Computeur Science, vol. 3824, pp. 300–310, 2005.
present technology and can be implemented with [8] M. graphics, ModelSim LE/PE Users Manual 6.5.c, 2009.

160 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Technology Independent, Embedded Logic Cores
Utilizing synthesizable embedded FPGA-cores for ASIC design validation

Joachim Knäblein, Claudia Tischendorf, Erik Markert, Ulrich Heinkel


Chair Circuit and System Design
Chemnitz University of Technology
Chemnitz, Germany

Abstract—This article describes an approach to embed


technology independent, synthesizable FPGA-like cores into II. RELATED WORK
ASIC designs. The motivation for this concept is to combine the The idea of embedding an EPLA core into ASICs is not new.
best aspects of the two chip design domains ASIC and FPGA. Several companies (Abound Logic currently [1], Adaptive
ASICs have better timing performance, are cheaper in mass
Silicon in 2002 [2]) attempted to introduce such a concept in
production and less power consumptive. FPGAs have the big
advantage to be reconfigurable. With FPGA-like cores being
the market. A similar approach was implemented by a project
embedded into ASICs this extraordinary FPGA feature is funded by the European Union [3]. And there are even more
transferred to the ASIC domain. The main innovative aspect of projects pursuing this idea (e.g. GARP [24]). Nevertheless it
the approach proposed in this paper is not the concept of seems, that it is difficult to be commercially successful with
combining ASIC and FPGA on one die. This has already been the technology dependent, synthesizable core idea, although
done before. The novelty is to find ways to use standard there is a certain demand in the industry for such a concept:
components and cells for the FPGA part to be able to enhance Adaptive Silicon disappeared after the year 2002 from the
ASIC designs without being restricted by technological and market and Abound Logic moved their scope away from the
vendor related barriers.
embedded core concept to their own family of stand-alone
Among many other applications reconfigurability can be
leveraged to improve verification problems, which arise with
FPGA devices. Two of the reasons for this lacking market
today’s 100 million gate designs. Dedicated, synthesized PSL [23] acceptance might be:
monitors, which are loaded in embedded FPGA cores, accelerate x Technology dependency
the process of narrowing error locations on the chip. In order to achieve a good ratio between physical silicon
area consumption and implemented logic, the cores are
Keywords-component; ASIC; FPGA; synthesis; PSL; typically optimized on transistor level. As a consequence
it takes a lot of effort and time to be able to offer them for
I. INTRODUCTION
a particular technology and a particular manufacturer. In a
Whether to use FPGAs and/or ASICs in the design of a new time with fast innovation cycles in chip design it is not
system is a fundamental question. Both chip domains have possible to supply solutions for all technologies of all
their benefits and drawbacks, which must be considered manufacturers. This limits the manufacturer selection
carefully. ASICs have better timing performance, better range of potential EPLA customers very much and thus,
scalability, more options, are cheaper in mass production and those customers may tend to implement without an EPLA
less power consumptive. On the other hand FPGA based instead of being forced to go with a certain
designs offer faster design cycles, better debugging and bug technology/manufacturer.
fixing capabilities, less costs for small lots and reconfiguration x Cost overhead
capabilities. Depending on numerous parameters, the one or As described above, the individual design of EPLA cores
the other concept is more suitable for a particular design job. for particular technologies is a very complex job and
An obvious question is why not to combine the best of both customers of course have to pay for this. On the other side
worlds and embed a FPGA-like core into an ASIC design? the customer must justify the use of an embedded core in
This paper deals with technical aspects of the implementation his application from the commercial point of view. If
of this idea based on components and cells, which are these aspects do not match, it makes no sense to embed an
available in (almost) all technologies and from all EPLA in the design.
manufacturers. In the course of this paper such an embedded, Consequently, the aspects manufacturer independency and
independent FPGA core shall be denoted as EPLA (Embedded cost effectiveness are vital for dissemination of EPLA
Programmable Logic Array). approaches in the chip design industry. This paper introduces
The paper is organized as follows: First related work is listed a concept for EPLA cores, which takes that into account.
and the advantages of synthesizable FPGA cores are
discussed. Then appropriate structures of logic elements and III. MANUFACTURER INDEPENDENT FPGA CORES
interconnection concepts are investigated. Finally a special This chapter discusses how an EPLA can be realized in a
application of such embedded cores is presented. manufacturer independent way. To be independent means:

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 161


x The EPLA must be built from cells, which are available in Config_in
D
SET
Q D
SET
Q D
SET
Q D
SET
Q
Config_out

(almost) every technology library, i.e. only standard logic CLR Q CLR Q CLR Q CLR Q

cells and RAM blocks are permitted


x The EPLA design must fit in a standard ASIC design flow An-1 A1 A0
x The core is limited to ASIC capabilities in size and timing S Z

performance Logic_in

Typically FPGA architectures comprise logic element arrays, A0

Z
which are variably interconnected. Such a logic element (LE) D
SET
Q A1
Logic_out

consists of one or more look-up-tables (LUTs), which realize a CLR Q


part of the logic function of the design to be mapped onto the
FPGA. In addition, the LE contains one or more flip-flops in
order to allow sequential behavior. The basic principles were Figure 1: Logic structure of a logic element (LE)
patented by Xilinx co-founder Ross Freeman in 1988 [4]. This
patent however expired in 2004 and this paved the way to use In order to decode this number of boolean functions 2
such principles for a wide range of applications. storage elements are needed. On the other side in [6] an upper
bound for the gate count of a k-input logic cone is derived to
A. EPLA Structure be O(2k/k). Thus the maximum gate effectiveness of a LUT is
The challenge is to find a LE architecture and an
interconnection scheme, which fulfills the above 2 1
O  ∗ 2−  = O  
independency requirements and is still efficient with respect to  
the physical area consumption and timing performance of the
core. Other than expected, not the LE design is the most In other words: the maximum gate effectiveness decreases
challenging task when developing EPLA architectures, but it with the input count of the LUT on a reciprocal basis.
is the interconnection scheme. This is the case, because the Moreover, this optimum effectiveness can only be achieved, if
number of possible interconnections of n LEs is O(n2) [5]. an LE is packed with logic. In typical designs the distribution
Simplified example: for a core which comprises 1000 4-input of input logic cones may vary from 2 inputs to more than 1000
LEs, the number of possible interconnections (crossbar) is 4 inputs. If the EPLA architecture only offers LE with a huge
million. Since every interconnection must be switchable in the number of inputs, a lot of logic resources are wasted even if
general case, this results in a reconfiguration storage such a large LE is able to realize a particular number of
requirement of 4 million storage elements only for the smaller input cones. The following example illustrates this
interconnection definition. On the other side typically only a aspect:
very small amount of these possible interconnection are really
needed and thus such a trivial interconnection approach would Example:
be a terrible waste of silicon resources. Therefore this full Given is an 8-input LE with 4 outputs. This LE can realize
interconnection scheme must be reduced by two measures: logic in the range of one boolean function with 8 independent
x Interconnection clusters must be defined, which have full variables or four functions with two variables. The maximum
internal interconnection capabilities. The connections effectiveness of the first case is a factor of 28/8*2-8=1/8
within the clusters must be efficient with respect to timing whereas the effectiveness of the second case is a factor
and area consumption. The interconnection in-between of.4*22/2*2-8=1/32.
the clusters must be reduced as far as possible, because At this point we have gathered a number of parameters, which
these are limited in number and time and area consuming. play an important role in the design of an technology
x The input count of the LEs must be chosen as high as independent EPLA. These parameters are:
possible. The reason for this is that the more fine grained x Number of LEs in the core
an architecture is, the more logic cells are needed to form x Input count of the LEs
a given logic function. The more logic cells must be x Number of outputs of the LEs
interconnected, the more wires are needed for this job and x Size of cluster with full interconnection capability
wiring must be minimized, because it is costly. Another
driver for this large input count is the so called “group B. Logic Element Design
optimization”, which is described in section C. So far we have dealt in this paper with considerations on how
The counter problem of high LE input count, however, is that LEs should be structured and connected in the EPLA. The
the gate effectiveness (the ratio between logic gate count and next thing is to reason about the internals of a logic element
physical gate count) of a LUT decreases with the input count with n inputs and m outputs. The main building block of an LE
of the LUT. The reason for this is that a LUT with k inputs can is the LUT that realizes up to m logic functions of n input

realize 22 different boolean functions as can be deduced from variables. For the implementation of this LUT a RAM is used
a truth table representation. Consider a LE structure like which holds m output bits for each of the 2 input
shown in Figure 1. combinations. This is shown in Figure 2.

162 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Depending on the technology of the manufacturer an efficient
realization with a RAM block may only be reasonable for
greater values of n and m. If n is less than a particular,
technology dependent value, the realization of the LUT might
be more efficient when based on flip-flops or latches like A0 logic_out[m-1:0]
shown in Figure 1. The area consumption of the RAM based logic_in[n-1:0] Z

solution can be optimized if multi-read-ports are available in a RAM D


SET
Q A1

technology. Then the LUTs of several LEs can share one CLR Q

RAM block like shown in Figure 3. Such shared RAM blocks


consume less area than the sum of individual RAM blocks
SET

would do. D Q

Q
The rest of the components of an LE is straightforward. Every clk
CLR

config

output of the LUT feeds a flip-flop that can be optionally


bypassed. Each configuration capability in a LE like this
bypass multiplexer control needs one or more storage
elements. In order to set those storage elements in a simple but Figure 2: RAM based logic element
efficient way, they are connected to form a long shift register
like known from the scan chain approach. Configuring the
EPLA with a particular circuit means to load the core with "0.0000.0000" + 8xInput "0.1111.1111" + 8xInput
specific data by using this chain and by loading the RAM
"1" + 16xInput
based LUTs. RAM 128K X 4 Bits

C. Interconnection Scheme LUT0 logic_out0[3:0]


Q
SE
T
D

8 Inputs Q
SE
T
D

Several of the LEs described above are combined to form a LE Read


CL
R Q
CL
R Q

cluster. In addition, such a cluster comprises an port 0


interconnection block, which connects LE outputs to other LE LUT255 logic_out255[3:0]
Q
SE
T
D

inputs inside this cluster (see Figure 4 for details). The 8 Inputs D

CL
R
SE
T

Q
Q

Read Q

interconnection block must meet several requirements. The


CL
R

Write port 1
terminology for discussion of such requirements in the port for
configur
following is borrowed from switching theory [19]. ation LUT256
data logic_out256[3:0]
1. Allow arbitrary permutations of inputs N at M outputs 16 Inputs D
SE
T
Q
Q
SE
T
D

with N ≥ M. Such a network is called “non-blocking”. In CL


R Q
CL
R Q

[19] a distinction is used between “rearrangeably non- Read


port 256
blocking” and “strictly non-blocking”. This distinction is Figure 3: Using multiple read ports to share one RAM between LEs
only of interest for the case of dynamical switching,
which does not apply to our problem. Several interconnection block concepts have been
2. Allow connection of one input to more than one output. In investigated:
the following text this feature is called “multicast”. x Multiplexer based interconnection concepts
3. The network should consume as low gate count as x The multicast Clos network [15][16][17]
possible x The Beneš network [18][19] and optimizations [20] with
multiplexer based multicast extension
4. The network should have as low propagation delay as
x The reverse Omega network [19] used in a Fat Tree
possible
architecture
5. Groups of outputs are defined. Within such groups the Since the reverse Omega network has some interesting
order of input permutations is “don’t care”, i.e. it only properties, this particular interconnection approach is
matters that a particular input route appears inside the discussed here deeper.
output group. Its position inside the group is irrelevant. The structure of a reverse Omega network is shown in Figure
This aspect allows for particular optimizations. The 5. On the positive side the reverse Omega network offers
permutation capability at low gate and delay cost compared to
reason why this optimization is possible, is that in the
all the other concepts. The number of 2:1 multiplexers in the
application of the interconnection block the order of

structure for M = N is log 2


and the delay in units of 2:1
inputs for e.g. a specific target LE is not significant, 2

because the LUT input configuration can be chosen multiplexer delays is log 2
.
freely.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 163



∗ log 2
− (
− ) , ≥
=  2

(
− 1) ∗ log 2 + (
− 1), <
2
D. Combining Fat Tree and Omega Network (FTON)
LEx,0 LEx,y-1
The Fat Tree concept is described in [21]. A binary Fat Tree
consists of several stages splitting input-to-output connections
into two branches in every stage. The splitting is done here
using reverse Omega networks. The reverse variant is chosen,
because our application requires, that the output and not the
inputs are CCM. The idea is illustrated in Figure 6. Due to the
output group optimization aspect, the routing within the
Omega networks can be chosen in a way that blocking does
not take place. The gate count behavior of the fat tree Omega
network is shown in Figure 7 for =
/4,  = /8 on
logarithmic scale.
Due to the multi stage structure the delay behavior of the
Figure 4: Structure of a LE cluster binary Fat Tree Omega network is bad compared to the other
interconnection realizations, but the binary tree can be
generalized towards a r-Tree meaning not two, but r branches
are used in each stage. Then, for constant r over the stages,
logr G stages are needed, resulting in a delay of:

d = log2 N + log2 M/r + log2 M/r2 + log2 M/r3 + …=


log2 N + log2 M + log2 M + …-log2r(1+2+…) =
log2 N + (logr G-1)* (log2 M- ½*log2r* logr G)

This equation is valid for 2 ≤ r ≤ G. For r = G the entire


circuit is reduced to a single stage with an individual network
for every output group.
The delay behavior is shown in Figure 8.

N M
inputs outputs
in
G
groups
Figure 5: A 16 x 16 Reverse Omega network

On the negative side the Omega network is only non-blocking


under very strict conditions and it does not support multicast.
In [19], p. 114 it is shown, that an Omega network is only
non-blocking, if the input-to-output assignment is sorted in a
so called cyclic compact monotone sequence (CCM). This
2*N x M/2 network 4*M/2 x M/4 8*M/4 x M/8
aspect restricts the routing ability of the Omega network very
much, but can be overcome with the approach described in the
Figure 6: Fat Tree realized with Omega networks for r = 2
following.
Although the Omega network is used normally in its When inspecting the diagrams it is obvious that there is a
symmetric form, i.e. M = N, it is possible to reduce it to competition between gate count and delay. The less the gate
asymmetric use, i.e. N ≥ M. The simplest way to achieve this count, the higher the delay. The best interconnection concepts
is to set
− output ports to unconnected and let the are the Benes Network and the FTON. The Benes based
synthesis process do the rest. Remarkably, the ports, which are realization makes the interconnection block with the least gate
set to unconnected, must be carefully selected to achieve the count, but the highest delay. With the FTON, gate count can
best reduction result. It can be shown, that for M and N being a be traded for reduced delay in a wide range by variation of
power of 2, the multiplexer count of a N:M reverse Omega parameter r.
Network is approximately:

164 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


E. Architecture of the EPLA
With all the considerations we now define an architecture for I
O
u
the EPLA. The interconnection concept has three layers as n t
p p
shown in Figure 9. The first layer deals with inter-cluster u
t
u
t
connections. Since this is the most powerful interconnection s s

layer the software flow described in section F takes care that


strongly connected components (i.e. LEs) are concentrated
inside clusters. The next interconnection layer connects
adjacent LE clusters. Again, the fitting software takes care to Global Switch
allocate strongly connected clusters in adjacent blocks.
The rest of the connections is wired by the last
interconnection layer, the global switch.
Figure 9: Architecture of the EPLA
1,E+10
2:1
1,E+08
multiplexer START
count
1,E+06

1,E+04
Synthesize RTL-Code of Only use AND2, OR2,
1,E+02 Application INVERTER, D_FF
Input Count N
1,E+00
At this stage we have a VERILOG netlist for our
application, which is made of a few simple primitives

Simple Multiplexer Benes + Multicast


Fat Tree Omega (r=2) Fat Tree Omega (r=4)
Read the VERILOG Simple
Fat Tree Omega (r=32) primitives
netlist into Mapping tool
Figure 7: Gate consumption of the Fat Tree Omega network compared to library
other listed concepts

Analyze netlist and FPGA Core


100 delay
partition logic to fit to Architecture
EPLA Architecture

At this stage we have a software model


10 of the FPGA core definitions

Create bit stream for


EPLA core and write out
1

DONE
Input Count N

Figure 8: Delay of the FTON for different values of r Figure 10: Tool flow
(same color coding like above)

A first synthesis of the VHDL model is done with a


F. Tool Flow commercial tool and the resulting netlist is created in verilog.
In the previous section the hardware architecture of the FPGA During the synthesis step the compiler is restricted to a set of
core was discussed. This chapter introduces some aspects of simple cells (e.g. AND, OR, NOT, flip-flop) to simplify the
the mapping tool flow, i.e. the software flow, which is applied following processing and reduce the primitive library. In the
to convert a VHDL model to an EPLA load. This flow is next step the verilog netlist is read into a tool, which does the
depicted in Figure 10. fitting of the netlist to the architecture as described below.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 165


G. Decomposing the Logic Cones J. Creating the Load File for the FPGA
The logic cones of the first synthesis step are not appropriate Based on the results of the decomposition and the place &
to be fit into a k-input LUT. Logic cones, which have more route steps, the load file is generated by the tool. The LUT
than k input variables must be split into pieces with k inputs at logic is created as a truth table to be loaded into the LUT
maximum. This splitting process is called decomposition and RAMs. The single storage elements of the bypass multiplexer
can be done based on BDD (binary decision diagram) [7] and the interconnection networks form a long shift register.
operations. The decomposition topic has been covered in The configuration data for these (e.g. bypass, interconnection
academic circles very deeply and therefore several configuration) are created as a bit stream file to be loaded into
implementations are freely available. For this job the package the storage element chain.
BDS-pga 2.0 [8] has been chosen for the decomposition step.
This package, however, does neither support verilog input nor
allows sequential elements in the netlist. Therefore a new tool IV. UTILIZING THE EPLA
has been developed, which only re-uses the decomposing The EPLA concept described in this paper can be utilized for
algorithms of BDS-pga. The result of this decomposition step several purposes:
is a data structure, which consists of a tree of AND, OR, XOR,
XNOR, MUX operators applied to the input signals of a k-ary x Updating of ASIC functionality e.g. because
chunk of logic. implemented standards have changed
The ASIC can be manufactured before standards are
H. Placement settled finally. Once the standard definition phase is
Placing is the process, which assigns particular logic cone finished the ASIC can be adapted to the new version by
pieces to a dedicated EPLA LE. Starting with the data updating the built-in EPLA circuit.
structure of the decomposing step, chunks of logic are x Fixing of ASIC bugs
searched, which fit into the current LE allocation. Metrics for In case a bug is present in the ASIC functionality it might
this search are: be fixed by loading an error correction configuration in
x How well does a chunk fill the potentially partly used LE the EPLA. An architecture, which supports such a kind of
x How many inputs can be reused when allocating the application, is shown in Figure 11.
chunk to the LE x Saving additional accompanying FPGA devices on the
These criteria are evaluated for every LE, which is not yet full. board
The LE, which has the best ranking, is chosen to house the Sometimes functionality, which was defined after the
logic cone piece under investigation. The result of this process manufacturing of the ASIC, is implemented in additional
is a list of (partly) used LEs, which carry the logic of the FPGA devices on the board. Such FPGA device can be
netlist. saved, if the ASIC itself contains such an option.
x Deployment of the same ASIC in different hardware
I. Routing
environments
Routing is the process of defining interconnections between Depending on the application it may make sense to
LEs. Starting with the above list of logic pieces, an develop a universal ASIC for multiple purposes, which
interconnection table is created, which tells, which LE output incorporates an EPLA. The adaptation of the ASIC for a
is connected to which LE inputs. This interconnection table concrete function is done by loading the EPLA with a
can be regarded as a graph and thus graph theoretical particular configuration. This approach, however, makes
considerations can be applied. Remember, that such LEs must only sense, if the number of different configurations
be packed to clusters, which have tight interconnection. The exceeds the number of 50 or the configurations are not
inter-cluster-connections shall be reduced to a minimum, known at design time. The reason for this is that the
because such connections are expensive with respect to EPLA EPLA consumes a lot more ASIC silicon than the
resources. This can be achieved by applying a graph function, which can be realized in the EPLA.
partitioning algorithm to the interconnection table. There are x Using the EPLA as a bed for variable circuit monitors
many graph partitioning algorithms with the Kernighan/Lin This approach is extraordinary in particular and therefore
[9] being the most famous. Unfortunately, this algorithm is this concept is described in detail in the next section.
O(n3) and therefore not suitable for the size of our problem.
However, alternative algorithms like the Fiduccia/Mattheyses Figure 11 illustrates the test and modification point
algorithm [22] have been developed to overcome this architecture for bug fixing and monitoring purposes. This
deficiency. The original version of this algorithm only architecture gives the chance to insert a module at anticipated
supports splitting into two partitions. Therefore there was a ASIC path locations. If no monitor and no bug fixing is
need to generalize this algorithm to multi-partitions. As a required, the EPLA is just a pass through for the paths. If a
result of this partitioning step LE clusters of almost equal LE monitor is required e.g. to locate a bug location, the EPLA
counts are found, which have minimized inter-cluster configuration is modified with the insertion of such a circuit
connections. (shown in red) at a particular path location.

166 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Expressions (SERE) and the usual LTL operator’s represent
the Foundation Language (FL). Additionally, the Optional
logic
Branching Extensions (OBE) provides a CTL-like set of
EPLA operators. From the multitude of PSL operators, the standard
defines a subset of only a few simple operators given in Table
logic
1 that can be used to express all other more complex
operators.

Table 1: Simple Subset


logic FL SERE
ASIC b r
Core s r
logic (p) r1 ; r2
s! r1 : r2
Figure 11: Architecture for variable test and modification point insertion
!b r1 | r2

A. Utilizing Cores as a Bed for Runtime Monitors The mapping of the subset to the operators is defined by so-
In the previous chapter a method for the embedding of called rewrite rules. Each PSL expression is recursively
technology independent, synthesizable FPGA cores in an constructed from its basic parts. For example, to construct the
ASIC was introduced. The advantage of this method is that all automaton for the expression always{a; b}, first the automaton
kind of circuits can be implemented after the ASIC was for {a; b} is constructed and then used by the algorithm for the
manufactured. This also includes modules, which support the always operator. The result of each step of the construction
validation engineer during his work. A problem during this algorithm is a nondeterministic finite automaton with epsilon
phase in general is the observabililty of the design. This transitions (ϵ-NFA). ϵ-transitions are executed without any
lacking observability is the main reason, why identifying the input. First, all the ϵ-transitions are eliminated. The result still
location of a design error on hardware level is a time contains non-deterministic transitions. In order to generate
consuming job. And thus it might be useful to be able to synthesizable VHDL code it has to be converted into a
integrate monitors in the design at runtime. Such monitors do deterministic finite automaton (DFA). Listing 1 describes the
concurrent checking of the behavior of the design and flag algorithm, which transform an NDFA into a DFA.
errors if something suspicious happens. Monitors could be
embedded in the hardcoded part of an ASIC. Then they only Listing 1: NDFA → DFA
are of limited benefit, because their scope is very restricted.
A → generate automat (DFA)
Here the EPLA technology can be helpful. A user-defined
A1 given automat
monitor is loaded into the FPGA core as needed. Such for all states Z in A1 do
monitors can be derived from pre-existing PSL [12][23] find all following states of Z
assertions. This synthesis step is described in the following combine all transition conditions
section. create new transitions from Z to combinations of the following
states
B. Automata Construction set start state
PSL comes with distinct flavors for each different
implementation language. In this work we concentrate on the
VHDL flavor, but the algorithm can also be applied to all In the first step all outgoing transitions from the start state are
other languages. PSL is divided into four layers of abstraction. determined, which do not contain ϵ-transitions. Afterwards all
The boolean layer contains only the expressions of the combinations of these transitions conditions and all following
underlying flavor, such as the VHDL operators and, or and states are generated. These new states and transitions were
not. All temporal aspects are contained in the next layer, the added to the new deterministic automaton. This process will
temporal layer. It supports expressions like always and until. be performed for all states of the non-deterministic automaton.
The verification layer and modeling layer describe how the During this algorithm different final states are generated. One
property must be used and they specify aspects of the or more final states are definitely correct. Additionally there is
verification environment. To generate a checker automaton it one more final state to signal the first failure of the expression.
is only necessary to consider the boolean and the temporal As long as no final state is reached, the result of the expression
layer. In fact, the boolean layer is transparent to the generation is still pending. The main state of the automaton is the final
algorithm, since the expressions in this layer can be virtually error state, because the designer is only interested in the point
substituted by simple boolean variables. Thus, only the in time when the property actually fails.
operators of the temporal layer will influence the generation The generation process of a specification is supported by the
algorithm. The IEEE PSL standard [13] supports three sets of specification platform ”SpecScribe” [14]. This tool offers the
operators in the temporal layer. Sequential Extended Regular possibility to define requirements, components and

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 167


dependencies in a hierarchical way. One of these requirements It is planned to automatically introduce an error reporting
can be defined as ”verification property”. The PSL expression infrastructure into the design that connects all monitor
is added. After that, a finite state machine will be generated. automata, collects the results and reports the corresponding
To simulate a monitoring automaton or implement it into a data either to a processor within or to an entity outside the
design, different representations can be generated chip.
automatically, e.g. synthesizable VHDL or SystemC.
VI. REFERENCES
C. Results [1] Homepage of FPGA Manufacturer Abound Logic:
Currently the EPLA fitting tool is in experimental status. In http://www.aboundlogic.com/
[2] Design & Reuse Article: Adaptive silicon’s MSA 2500 programmable
comparison to commercial tools, which reach an estimated logic core TSMC test chips are fully functional. June 6, 2001
ratio between physical gate count to logic gate count of 20-50, [3] Voros, N., Rosti, A., Hübner, M., eds.: Dynamic System
the achieved factor of > 50 is still subject to further Reconfiguration in Heterogeneous Platforms - The MORPHEUS
investigations. Approach. Springer (2009)
[4] Freeman, R.H.: Configurable electrical circuit having configurable
The PSL approach has been verified so far on a Xilinx FPGA logic elements and Configurable Interconnects. (Feb 19, 1988) Google
board with a Virtex-II Pro. The board was programmed with a Patent Repository.
design, which receives packages and echoes them after being [5] MathWorld, W.: Introduction to big-o-notation.
processed. A finite state machine was integrated into this http://mathworld.wolfram.com/AsymptoticNotation.html
[6] Erickson, J.: Cs 497: Concrete models of computation.
design. The final error state is indicated by an LED, whenever http://compgeom.cs.uiuc.edu/jeffe/teaching/497/13-circuits.pdf (Spring
the PSL expression failed. The structure of the board is shown 2003)
in Figure 12. [7] Bryant, R.E.: Graph-based algorithms for boolean function
manipulation. Computers, IEEE Transaction (1986) C-35(8):677-691.
[8] Homepage of BDS-pga: http://www.ecs.umass.edu/ece/tessier/rcg/bds-
Board pga-2.0/
Virtex-II Pro [9] Kernighan, B.W., Lin, S.: An efficient heuristic procedure for
PC
partitioning graphs. Bell Sys. Tech. J., Vol. 49, 2, pp. 291-308 (1970)
MAC
Config TX [10] Hendrickson, B., Leland, R.: The chaco user’s guide: Version 2.0.
Unit Sandia National Laboratories (1994)
RX [11] IEEE: Standard VHDL Language Reference Manual. IEEE Std 1076-
2002 (Revision of IEEE Std 1076) (2002)
Register

[12] Eisner, C., Fisman, D.: A Practical Introduction to PSL (Series on


Debug Integrated Circuits and Systems). Springer-Verlag New York, Inc.
Unit (2006)
recv_ line_loop_en [13] IEEE: Standard for Property Specification Language PSL. IEEE Std
got_ debug
L debug 1850 (2005)
E monitor automaton [14] Pross, U., Richter, A., Langer, J., Markert, E., Heinkel, U.: “Specscribe
D – Specification data capture, analysis and exploitation”, Software
Demonstration at DATE’08 University Booth (2008)
Figure 12: Structure of demo board [15] Charles Clos: “A study of non-blocking switching networks”, Bell
System Technical Journal 32, March 1953
[16] Yuanyuan Yang: “A Class of Interconnection Networks”, IEEE
V. CONCLUSION AND OUTLOOK TRANSACTIONS ON COMPUTERS, VOL. 47, NO. 8, August 1998
[17] Yuanyuan Yang, Gerald M. Masson: “The Necessary Conditions for
This paper presented an approach for the synthesis of Clos-Type Nonblocking Multicast Networks”, IEEE TRANSACTIONS
technology independent FPGA cores (EPLA) as part of SoC ON COMPUTERS, VOL. 48, NO. 1, November 1999
[18] Václav E. Beneš: "Mathematical Theory of Connecting Networks and
ASIC designs. Such cores can be used for a wide spectrum of Telephone Traffic", Academic Press, 1965
applications. [19] Achille Pattavina: “Switching theory: architectures and performance in
As an example an interesting concept was presented, which broadband ATM networks”, Wiley 1998, ISBN-13: 978-0471963387
supports the test engineer in his work. In the case of functional [20] Bruno Beauquier, Eric Darrot; „On Arbitrary Waksman Networks and
their Vulnerability”, INSTITUT NATIONAL DE RECHERCHE EN
deficiencies dedicated monitors are loaded on the technology INFORMATIQUE ET EN AUTOMATIQUE, October 1999
independent, synthesizable EPLA core in order to narrow the [21] C.E. Leiserson, “Fat-trees: Universal Networks for Hardware-Efficient
specific location of a design error. For this purpose assertions Supercomputing,” IEEE Transactions on Computers, 34(10):892-901,
written in PSL are synthesized and loaded into the EPLA. Oct. 1985
[22] C. M. Fiduccia and R. M. Mattheyses. „A linear time heuristic for
The status of the described scenario is currently experimental. improving network partitions“, 19th Design Automation Conference,
Therefore the fitting results of a given RTL code to an EPLA 1982
with respect to area and timing optimizations are still a matter [23] Accelera, “Property Specification Language Reference Manual”, v1.1,
of further investigations. June 9, 2004
[24] John R. Hauser and John Wawrzyneck, Garp, “A MIPS Processor with
Another topic under investigation is the ambition to
a Reconfigurable Coprocessor”, Proceedings of the IEEE Symposium
reconfigure the EPLA as fast as possible in order to reduce the on Field-Programmable Custom Computing Machines (FCCM '97,
reconfiguration impact to a minimum. April 16-18, 1997).

168 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


A New Client Interface Architecture for the
Modified Fat Tree (MFT) Network-on-Chip
(NoC) Topology
Abdelhafid Bouhraoua and Muhammad E. S. Elrabaa
Computer Engineering Department
King Fahd University of Petroleum and Minerals
PO Box 969, 31261 Dhahran, Saudi Arabia
{abouh,elrabaa}@kfupm.edu.sa; orwa@diraneyya.com

Abstract— A new client interface for the Modified Fat Tree In order to overcome these limitations a new client
(MFT) Network-on-Chip (NoC) is presented. It is directly interface architecture is proposed. This interface aims at
inspired from the findings related to lane utilization and reducing the number of parallel FIFOs into a single
maximum FIFO sizes found in simulations of the MFT. A new centralized FIFO. The modification of the client interface
smart arbitration circuit that efficiently realizes a round-robin opens the door for a more practical implementation of the
scheduler between the receiving lanes has been developed.
Simulation results show a clear viability and efficiency of the
MFT NoC. This paper presents the new client interface
proposed architecture. The limited number of the active architecture and its different circuitry. It also shows through
receiving links has been verified by simulations. Simulations simulation that the new architecture draws on the practical
also show that the central FIFO size need not be very large. results to reduce the amount of required hardware
resources. MFT NoCs are first briefly reviewed in the next
Keywords — Networks-On-Chip, Systems-on-Chip, ASICs, section. The newly proposed client interface is then
Interconnection Networks, Fat Tree, Routing presented in section 3. Simulations results are presented in
section 4 followed by conclusions in section 5.
I. INTRODUCTION
There has been a significant amount of effort made in the II.MODIFIED FAT TREE NOCS
area of NoCs, and the focus has mostly been on proposing
new topologies, and routing strategies. However, recently MFT is a new class of NoCs based on a sub-class of
the trend has shifted towards engineering solutions and Multi-Stage Interconnection Networks topology (MIN).
providing design tools that are more adapted to reality. For More particularly, a class of bidirectional folded MINs;
example, power analysis of NoC circuitry has intensively chosen for its properties of enabling adaptive routing. This
been studied [1, 2], more realistic traffic models have been class is well known in the literature under the name of Fat
proposed [3], and more adapted hardware synthesis Trees (FT) [8]. The FT has been enhanced by removing
methodologies have been developed. contention from it as detailed in [7]. Below is a brief
description of the FT and MFT network topologies.
However, high throughput architectures haven’t been
addressed enough in the literature although the need for it A. FT Network Topology
started to become visible [4]. Most of the efforts were A FT network, Figure 1, is organized as a matrix of
based on a regular mesh topology with throughputs routers with n rows; labeled from 0 to n-1; and 2(n-1)
(expressed as a fraction of the wire speed) not exceeding columns; labeled from 0 to 2(n-1) -1. Each router of row 0
30% [5]. In [6, 7] a NoC topology based on a modified Fat has 2 clients attached to it (bottom side). The total number
Tree (MFT) was proposed to address the throughput issue. of clients of a network of n rows is 2n clients. The routers
The conventional Fat Tree topology was modified by of other rows are connected only to other routers. So, in
adding enough links such that contention was completely (r+1)
general, a router at row r can reach 2 clients.
eliminated thus achieving a throughput of nearly 100% [6]
while eliminating any buffering requirement in the routers.
B. MFT Topology
Also, simplicity of the routing function, typical of Trees,
meant that the router architecture is greatly simplified. Contention is removed in MFT by increasing the number
These results did not come without a price, mainly the high of output ports in the downward direction of each router [6,
number of wires at the edge of the network in this case. 7], Figure 2. At each router, the downward output ports
Also buffering was pushed to the edge of the network at the (links) are double the number of upper links. Then the input
client interfaces. Many of these issues have been discussed ports of the adjacent router (at the lower level), to which it
in [6, 7]. is connected are also doubled. This is needed to be able to
connect all the output ports of the upper stage router. This

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 169


will continue till the client is reached where it will have 2n terms of area for the MFT network since the routers are
– 1 input links. Each of these I/P links will feature a FIFO bufferless and have a very low gate count. Also, extensive
buffer as called for by the original MFT architecture [6, 7]. simulations with different traffic generators showed that
only a small fraction of FIFO lanes are active per client
Column 0 Column 3 simultaneously [6, 7].

R R R R
In order to reduce the wasted FIFOs space represented
by the original MFT’s client interface, a newly designed
interface is proposed, Figure 4. It is made of two parts; an
upper part consisting of several bus-widener structures that
R R R R will be named parallelizers from this point forward and a
lower part that is simply a single centralized FIFO memory
to which all the outputs of the different parallelizers are
R R R R connected through a single many-to-one multiplexer.

C C C C C C C C Receiving Links

Row 0

Figure 1: Regular Fat Tree Topology (8 clients)

R R R R

R R R R

Ctrl Ctrl Ctrl


R R R R
Many-To-One MUX

C C C C C C C C
FIFO Allocation/
Deallocation
Figure 2 – Modified FT Topology Central
FIFO

III. CLIENT INTERFACE


Client Side
As was explained in section II, the original MFT
architecture requires 2n – 1 input FIFOs at the client Figure 4: Block Diagram of the New Client Interface.
interface. The sizes of these FIFOs is set by the NoC Each one of the parallelizers is made of two layers. The
designer depending on many factors such as the first layer is a collection of registers connected in parallel to
communication patterns among clients, emptying (data the incoming data bus from one of the receiving ports.
consumption) rate by a client, application requirements Packet data is received into one of these registers one word
(latency), …etc. [7]. Figure 3 below shows the structure of at a time. When this layer is full, an entire line made by
the client interface in the original MFT. concatenating all the registers of the first layer is
transferred to a second set of registers (the second layer in
Down Links (from router)
the parallelizer) in a single clock cycle. The ratio between
Up the width of the parallel bus and the width of a single word
F IFO

FIFO
FIFO

FIFO

FIFO

Link is called the parallelization factor.

Packets portions from different sources are received on


Client/IP different parallelizers simultaneously and independently.
When the first portion of a packet is received and
Figure 3: Client interface of the original MFT transferred to the second layer of the parallelizer, a flag is
set to request transfer of a new packet to the FIFO. The
It is evident that although these FIFOs may be of a small control logic responsible for these transfers will first
size, their structure represents the largest part of the cost in attempt to reserve space in the FIFO corresponding to one

170 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


packet. The condition here for this architecture to produce IV. SIMULATION RESULTS
efficient results is the adoption of a fixed packet size. This Simulations were carried out using a cycle-accurate C-
condition simplifies the space allocation in the FIFO and based custom simulator (developed in-house) that supports
alleviates the control logic from any space or fragmentation uniform and non-uniform destination address distribution
management due to variable size allocation and disposal. In as well as bursty and non-bursty traffic models. The packet
the case the FIFO is full and no space could be reserved, size was fixed at 64 words. Only bursty traffic with non-
the request is rejected and the backpressure mechanism is uniform address generation was used. The varying
triggered on that requesting port. parameters were: the network size, central FIFO size and
The control logic continuously transfers the received the parallelizer factor value.
packet words to the FIFO. Every time a packet portion Five injection rates corresponding to 0.5, 0.6, 0.7, 0.8
enters the second layer of registers in one of the and 0.9 words(flits)/cycle/client were simulated. The first
parallelizers a flag is set to indicate the presence of data. result confirming the viability of the solution was the
Those parallelizers which are currently receiving packets throughput that matched the input rate in most of the cases
are said to be active. Only active parallelizers are and with a maximum difference lower than 1% in few
continuously polled to check the presence of data. The cases.
polling follows a round-robin policy. A single clock cycle In all the figures that follow, the latency is expressed in
is used to process the currently selected parallelizer. clock cycles.

Q D Q D
_ _
Q Q
en[0] en[1]

D Q D Q
_ _
Q Q
1 1
0 0

Q D Q D
_ _
Q
en[n-2]
Q
en[n-1]
Figure 6: Latency Comparison with the original client
interface.
D Q D Q
_ _
Q Q Latency values were reduced dramatically because of the
1 1 output rate of the FIFO. Dual-port memories are the natural
0 0 choice for implementing FIFOs. Both data buses are
generally the same size on both ports of the dual-port
memory. Therefore, the FIFO data bus has the same size as
Figure 5: Intelligent Request Propagation Circuitry. the paralellizers bus. A wide output bus translates in fewer
clock cycles to read or write an entire packet. More packets
Polling the active parallelizers only supposes some are moved per unit of time which means that packets spend
mechanism to “skip” all the non-active parallelizers less time in the FIFO waiting to be sent out leading to
between two active ones. In order to avoid wasting clock smaller latencies as shown in Figure 6. It is important to
cycles crossing those non-active parallelizers, a special note that the latency figures across the network did not
request propagation circuit has been designed. Figure 5 change and is expected to be small as the entire network is
shows this circuit’s schematic. The upper set of flip-flops bufferless.
correspond to the status flag indicating whether a Figure 7 shows a subset of the obtained simulation
parallelizer is active or not while the lower one is used to results. The 32 clients network results are shown (left to
indicate which parallelizer is selected to transfer its data right) for parallelization factor values of 8, 16 and 32 for
during a given clock cycle. The multiplexers are used to different central FIFO sizes (from 8 packets to 32 packets).
instantly skip the non-active parallelizers. The latency figures are very low compared to those
As a result of this fast polling scheme packets arriving obtained with the previous architecture of the client
simultaneously on different parallelizers may be received in interface (Figure 6). The latency range corresponding to a
different order. Packet order from the same source is still wider parallelizer is lower than the range corresponding to
guaranteed though because the network is bufferless. a narrower one. The other results (not shown here for lack
of space) are similarly lower for wider parallelizers. These
findings confirm the efficiency of the proposed solution.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 171


Figure 7 - Simulation Results: latency (clock cycles) versus injection rate (flits/clock cycle/client) for three
parallelization factors (8, 16 and 32 from left to right) for several central FIFO sizes (8, 16, 24 and 32 packets).

The central FIFO size has little impact on the results


which favors size reduction as a FIFO with a size as low as ACKNOWLEDGMENTS
8 packets can produce acceptable results.
This work was supported by King Fahd University of
A tentative synthesis of the new structure yielded about
Petroleum and Minerals (KFUPM) through grant #
35K gates per client for a parallelization factor of 8 for a
IN070367.
network of 64 clients. This represents approximately an
area of 0.185 mm2 for the 0.13 µ technology. Added to that
the dual-port SRAM area of 0.009, 0.018, 0.028 and 0.038 REFERENCES
mm2 for a FIFO size that accommodates respectively 8, 16,
24 and 32 packets. This represents a significant [1] E. Nilsson and J. Öberg, “Reducing power and latency in 2-D mesh
improvement compared to the 2.1 mm2 occupied by the NoCs using globally pseudochronous locally synchronous clocking”,
CODES+ISSS 2004.
client interface in the previous architecture and which uses
[2] E. Nilsson and J. Öberg, “Trading off power versus latency using
63 SRAM FIFOs of 2K-Bytes each. GPLS clocking in 2D-mesh NoCs”, ISSCS 2005.
[3] V. Soteriou, H. Wang and L. Peh, “A Statistical Traffic Model for
V. CONCLUSIONS On-Chip Interconnection Networks”, in Proceedings of the 14th IEEE
Intl. Symp. on Modeling, Analysis, and Simulation of Computer and
A new architecture of the client interface of the MFT Telecommunication Systems (MASCOTS '06), Sept. 2006, pp 104-
NoC has been proposed. This new architecture 116.
considerably reduces the hardware resources necessary to [4] Freitas H.C., Navaux P. “A High Throughput Multi-Cluster NoC
implement the receiving client interface. Detailed block Architecture”, 11th IEEE International Conference on
Computational Science and Engineering, July 16-18, 2008 Sao
diagrams and of this architecture have been shown and Paulo, Brazil, pages 56-63
described. Its operations and step by step behavior have [5] K. Goossens, J. Dielissen, A. Radulescu, “Æthereal network on chip:
been described as well. A new arbitration circuit that concepts, architectures, and implementations”, IEEE Design and Test
intelligently “skips” disabled request lines to realize an of Computers, Volume 22, Issue 5, Sept.-Oct. 2005 Page(s)414 –
efficient round-robin where no clock cycles are wasted is 421.
presented. Simulations have given clear evidence on the [6] A. Bouhraoua and Mohammed E.S. El-Rabaa, “A High-Throughput
Network-on-Chip Architecture for Systems-on-Chip Interconnect,”
viability of a single centralized FIFO that is simultaneously
Proceedings of the International Symposium on System-on-Chip
filled by several, yet limited number, of receiving links. (SOC06), 14-16 November 2006, Tampere, Finland.
The limited number of the active receiving links has been [7] A. Bouhraoua and Mohammed E.S. El-Rabaa, “An Efficient
verified by simulations. The simulation results have shown Network-on-Chip Architecture Based on the Fat Tree (FT)
a considerable reduction of latency compared with the Topology”, Special Issue on Microelectronics, Arabian Journal of
previous solution. They have also shown the little impact of Science and Engineering,, Dec. 2007, pp 13-26.
the FIFO size on the latency which implies that a larger size [8] C. Leiserson, "Fat-Trees: Universal Networks forHardware-Efficient
Supercomputing", IEEE Transactions on Computers, Vol. C-34, no.
FIFO is not necessary. The client interface synthesis 10, pp. 892-901, October 1985.
yielded smaller area than in the previous architecture of the
client interface by one order of magnitude.

172 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Implementation of Conditional Execution on a
Coarse-Grain Reconfigurable Array
Fabio Garzia, Roberto Airoldi, Jari Nurmi
Department of Computer Systems
Tampere University of Technology
33101 Tampere, FI
Email: name.surname@tut.fi

Abstract—This paper presents a method to implement per clock cycle. Operations supported are integer and floating-
switch/case type conditional execution on a coarse-grain recon- point arithmetics, shifting, LUT-based and boolean. In addi-
figurable array based on a SIMD paradigm. The implementation tion, CREMA supports 16 different inter-PE interconnections
do not introduce dedicated hardware but it utilizes only the
functional units of the processing elements composing the ma- divided into nearest-neighbor, interleaved, and global intercon-
chine and the possibility to reconfigure each processing element nections.
at run-time in one clock cycle. The method is employed to map CREMA is based on a template that is customized according
algorithms for linear search or calculation of the maximum value to the requirements of a kernel to execute. This is the reason
on vectorized data. for the term “mapping adaptiveness”. The idea is that the
application developer maps a certain kernel onto CREMA
I. I NTRODUCTION
template. This mapping generates a set of contexts that are
During recent years academic and industrial research groups used in the run-time execution of the kernel by CREMA.
have proposed several coarse-grain reconfigurable architec- Based on the set of contexts required, a minimal version of
tures (CGRA). The typical CGRA is characterized by a CREMA is generated, in which all the unused functional units
one or two dimensional array of processing elements (PEs), and interconnections are removed from the initial template.
modeled on general-purpose computer systems. A common The reduced version of CREMA implements run-time re-
choice is to provide very simple processing elements based on configuration, because each PE may support more than one
programmable arithmetic-logic modules and characterized by operation or connection. The run-time choice between the
multiplexing logic for the interconnections between PEs. It is functional units and the interconnections provided in hard-
the case of Morphosys [1] and Montium [2]. These machines ware is performed setting a configuration word in each PE.
are particularly suitable to map SIMD applications. However, Changing the configuration word corresponds to changing
they provide poor support for control tasks. The main issue the current functionality of the PE. In practice, configuration
is the mapping of conditional execution. In Morphosys the words are stored in a PE memory, that can host more than one
designers introduced the support for guarded execution and configuration words allowing a one-cycle reconfiguration.
pseudo-branches [1] using dedicated logic to store branch
tags and implement predication. In Montium [2] an external III. I MPLEMENTATION OF A switch/case STATEMENT
sequencer takes care of conditional execution. A control flow mechanism based on a switch/case statement
In this work we propose an alternative approach to imple- is implemented using these features of the CREMA template:
ment switch/case type of conditional execution on a coarse- 1) the Look-Up Table (LUT) among its functional units;
grain reconfigurable machine. This approach uses the func- 2) the possibility for any PE to acquire the configuration
tional units already provided in each PE of the machine and the word from the upper PE instead of using its own
possibility to reconfigure functionalities and interconnections configuration memory;
in one clock cycle. 3) the run-time reconfiguration performed in one clock
The paper is organized as follows. First we present some cycle.
details of the reconfigurable device. Then we describe the A switch/case conditional statement is based on the execu-
implementation of the switch/case conditional execution and tion of different operations according to the value of a selection
some practical example of its employment. Finally we draw variable. To implement it on CREMA array, we assume that
some conclusions. the possible values of the selection variable can be used to
address a memory, i.e., they compose a sequence of integers
II. CREMA A RCHITECTURE starting from zero or can be easily converted into such a
CREMA [3] is a Coarse-grain REconfigurable array with sequence using arithmetic or logic operations. The idea is that
Mapping Adaptiveness. Its architecture is based on a matrix this values can select one location of a LUT instantiated in
of 4 × 8 coarse-grain processing elements (PEs) that can one PE. In addition, consider that each PE can be configured
process two 32-bit inputs and generates two 32-bit outputs by the upper PE instead of its own configuration memory.

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 173


be performed in N clock cycles by CREMA, where N is the
number of elements of the array.
One of the implementation issues in these algorithms is to
match the condition with a switch/case statement, that implies
that a value is associated to each result of the condition
evaluation. Our approach does not insert adhoc logic for
condition evaluation, therefore the value must come from one
of the provided PE operation. In most cases, the solution is
simple. If we need to evaluate the condition x > C, we can
perform in a PE the operation x − C and then consider the
most significant bit by shifting the result by 31, so that we
get 0 or 1 according to the condition. This means that we
need only two PEs. On the other hand, the condition x = C
requires a subtraction and then a bitwise OR between all the
bits of the result.
The same mechanism was used to implement an iterative
algorithm for the search of a maximum inside a vector. The
algorithm is composed of several steps. At each step the
Fig. 1. Implementation of a switch/case with four branches. elements are grouped in pairs and the maximum inside each
pair is stored into the memory. At each step we get a new
vector that is half of the size of the previous one. The steps are
Therefore, we can map a selection mechanism followed by a iterated until only one element is left. Therefore the execution
conditional operation using two consecutive PEs in a column requires log2 n steps. The algorithm employs another feature
of the array (see Fig. 1). The first PE (“LD” in the figure) of CREMA that allows to write the results in the output
loads a value from its LUT addressed by the condition variable memory according to a specific pattern. The algorithm was
and the second PE (in black in the figure) implements the used in the implementation of a W-CDMA cell search on a
functionality specified by the output of the LUT. platform based on CREMA [4].
There is no theoretical limitation on the size of the
switch/case, because the size of the LUT can be decided V. C ONCLUSION
at design time. However, there is a practical constraint. The In this paper we propose a method to map switch/case
different operations required by the branches of the switch/case conditional execution on a coarse-grain reconfigurable array.
must be implemented using different configurations of the The method does not require additional logic for predication or
same PE. This is not always possible. An alternative is to use branch support, but it is based on the possibility to reconfigure
the PE as a selector, whose only task consists of taking a value the PE functionality in one clock cycle and to map condi-
from one of the possible execution paths mapped using more tion evaluation and conditional operations onto the PEs. The
than one PE. Fig. 1 depicts a situation in which the switch/case method has been employed for the mapping of linear search
is characterized by 4 branches, each branch is mapped on a on vectorized data or calculation of the maximum value inside
set of PE (“Branch #”) and the condition is also evaluated a vector.
using two PEs. This way the implementation of the switch/case
ACKNOWLEDGMENT
mechanism is constrained only by the array resources.
Notice that the switch/case mechanism can be used to This work was partially funded by a grant awarded by
implement an if-then-else. It is sufficient to convert the if- the Finnish Cultural Foundation, which is gratefully acknowl-
then-else in a switch/case with two branches. However, the edged.
condition evaluation may be more difficult to map. R EFERENCES
[1] M. Anido, A. Paar, and N. Bagherzadeh, “Improving the operation
IV. T EST C ASES autonomy of simd processing elements by using guarded instructions and
The mechanism illustrated here can be used to implement pseudo branches,” in Proceedings of Euromicro Symposium on Digital
System Design, 2002, pp. 148 – 155.
a linear search on vectorized data. For example, it is possible [2] G. Smit, P. Heysters, M. Rosien, and B. Molenkamp, “Lessons Learned
to check in a vector of integers if there are elements greater from Designing the MONTIUM - a Coarse-grained Reconfigurable Pro-
than, less than or equal to a fixed value. Due to the nature cessing Tile,” in Proc. International Symposium on System-on-Chip, 2004,
pp. 29–32.
of CREMA execution, the outcome is always another vector. [3] F. Garzia, W. Hussain, and J. Nurmi, “Crema: A coarse-grain recon-
For example, this vector may have ones in the position of figurable array with mapping adaptiveness,” in Proceedings of the 19th
the elements that satisfy the required condition and zeros International Conference on Field Programmable Logic and Applications
(FPL2009). Prague, CZ: IEEE, September 2009, pp. 708–712.
elsewhere. Such a vector can be used for further processing by [4] F. Garzia, “From run-time reconfigurable coarse-grain arrays to
CREMA, like adding together all the ones to know how many application-specific accelerator design,” Ph.D. dissertation, Tampere Uni-
elements satisfy the fixed condition. These linear searches can versity of Technology, December 2009.

174 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Dynamically Reconfigurable Architectures for High
Speed Vision Systems
Omer Kilic, Peter Lee
School of Engineering and Digital Arts
University of Kent
Canterbury, KENT, CT2 7NT, England
Email: {O.Kilic, P.Lee}@kent.ac.uk

Abstract—High-Speed Vision applications have very specific advantages of using reconfigurable architectures and Section
and demanding needs. Many vision tasks have high compu- 4 will provide details about the proposed system. Section 5
tational requirements and real-time throughput constraints. In will outline the challenges and Section 6 will provide the
addition to a large number of fine-grain computations, vision
tasks also have complex data flow which varies significantly in a conclusion for this paper.
single application and across different applications.
Conventional architectures, like Microprocessors and standard II. V ISION S YSTEMS
Digital Signal Processors (DSP) are not optimised for these kinds The goal of Machine Vision is to robustly extract useful,
of highly specific applications and as a result do not perform well high level information from images and video. The type of
because of their mostly sequential nature. By using a combination
of reconfigurable architectures that can adapt themselves to the high-level information that is useful depends on the applica-
requirements of the system dynamically, significant performance tion. Vision systems have a wide range of applications from
improvements can be achieved. highly specialised instrumentation and process automation
tasks to general consumer electronics, although this proposed
I. I NTRODUCTION architecture mainly focuses on providing a flexible low-cost
This PhD project focuses on evaluating the performance framework for industrial instrumentation tasks.
gains achieved by utilising a combination of processing de- Vision algorithms have several stages of processing. The
vices, each with different characteristics and strengths to lowest level operations, performed on raw sensor data usually
satisfy the requirements of high speed vision systems. In the involve a lot of arithmetic operations on pixel values. These
proposed system, a combination of a Field Programmable operations can benefit from the parallel nature of the architec-
Gate Array (FPGA), a Central Processing Unit (CPU) and ture in use quite significantly and reconfigurable architectures
a Graphics Processing Unit (GPU) will be employed. The have been most widely used for accelerating these algorithms.
FPGA portion of the system will be responsible for image Examples of computations at this level include filtering, edge
acquisition and the low-level pre-processing tasks such as detection, and edge-based segmentation, among others[1].
filtering, edge detection and thresholding and the CPU will An example application that can benefit from our proposed
act as the system supervisor overseeing the operation of the system is Stereo Vision. Stereo Vision is a traditional method
FPGA and the GPU sections of the system. The CPU will also for acquiring 3-dimensional information from a stereo image
be responsible for coordinating the outside connectivity of the pair. The instruction cycle time delay caused by numerous
system, which can be in the form of network communication repetitive operations causes the real-time processing of stereo
or standard embedded interfaces. For processing data on vision to be difficult when using a conventional computer[2].
the GPU, CUDATM (Compute Unified Device Architecture) By abstracting the different layers of the Stereo Vision system
framework will be used. to run in parallel on different devices, significant performance
While there are some custom highly specialised systems improvements can be expected.
available that utilise this sort of heterogeneous combination of
devices, there is a lack of an open, non-proprietary platform, III. R ECONFIGURABLE A RCHITECTURES
hence the ultimate aim of this project is to define a flexible Reconfigurable architectures utilise hardware that can be
low-cost framework for high speed vision systems that utilises adapted at run-time to facilitate greater flexibility without
commercial off-the-shelf (COTS) peripherals, where vision compromising performance. Fine-grain and coarse-grain par-
tasks can be described in a high level system description allelism can be exploited because of this adaptability, which
package. The system description package will define the provides significant performance advantages compared to con-
overall purpose of the system and it will also provide flexible ventional microprocessors[3]. An example of this adaptability
constructs that enable the system to adapt itself to the changes could be an image processing system that dynamically adjusts
in the operating environment. the window size of the algorithm employed. The ability to
In Section 2, a general overview of processing requirements change the functionality of the device during run-time also
of Vision Systems will be discussed. Section 3 will outline makes these architectures more flexible compared to traditional

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 175


methods like using Application Specific Integrated Circuits applications. The speed advantage of FPGAs derives from the
(ASIC) or static instruction-set processors. Depending on the fact that programmable hardware is customised to a particular
application, they may also provide cost-savings by reducing algorithm. Thus the FPGA can be configured to contain exactly
the required logic density. and only those operations that appear in the algorithm. In
Most research on reconfigurable computing has been done contrast, the design of a fixed instruction set processor must
based on Field Programmable Gate Arrays (FPGAs) due to accommodate all possible operations that an algorithm might
their flexible configuration and their shorter design cycle require for all possible data types[5].
compared to ASICs. The FPGA approach adds design flex- Because of the flexible nature of FPGAs, reconfigurable
ibility and adaptability with optimal device utilization while architectures often make use of these devices. Modern FPGAs
conserving both board space and system power, which is often have native support for dynamic reconfigurability and contain
not the case with DSP chips. Currently, FPGAs are capable of dedicated DSP resources and high speed memory interfaces
supporting multi-million gate designs on a chip and the com- which makes them particularly suitable for low level image
puter vision community has become aware of the potential for pre-processing tasks.
massive parallelism and high computational density in FPGAs.
In addition, the software-like properties afforded by Hardware FPGA Device
Description Languages (HDLs) such as encapsulation and Static Portion

parameterization allow creating more abstract, modular and Input PCI/E


reusable architectures for image algorithms.[4] Camera(s) Controller Interface
HOST

IV. P ROPOSED S YSTEM


Reconfigurability Memory External
As discussed earlier, the proposed system will employ Controller Interface Memory
FPGA and GPU co-processors, connected to a CPU host
machine which will act as the co-ordinator of the entire RC RC RC
RC
system. This host machine will be responsible for parsing the Block Block Block Block
system description package and off-loading relevant process-
ing tasks on to the co-processors. It will also be responsible for RC RC RC RC
Block Block Block Block
interfacing the system to the outside world, which can be in
the form of network communication or conventional embedded RC RC RC RC
interfaces. Block Block Block Block

The system description package will define the overall Reconfigurable Portion
purpose of the system and will have flexible constructs to
provide adaptability to the changes in the processing tasks. Fig. 1. FPGA Architecture
By providing a flexible format, the need for constant mon-
itoring and the reconfiguration overhead of the system can Figure 1 outlines the proposed FPGA Co-processor archi-
be reduced. In order to maximise the performance potential tecture. The static part of the system consists of:
of the inherent parallel nature of the FPGA and the GPU • Input controller, responsible for data/image acquisition
devices, concurrent programming languages (and languages from outside sources, such as a single or several high
with concurrency support) are being investigated for use within speed camera(s)
the system description package. • PCI Express Interface, to transfer data in and out of the
The system can essentially be partitioned into two sub- FPGA co-processor and to provide host machine access
systems: to the Reconfigurability Controller within the device
• The low-level image acquisition and pre-processing unit, • Memory Interface, to provide external high speed mem-
powered by an FPGA device which will have an array of ory access to the system that may be used by certain
reconfigurable blocks that can be modified to customise algorithms and possibly for buffering the data between
the operation of the device dynamically the FPGA and the host machine
• Further processing and characterisation unit, powered by • Reconfigurability Controller, that deals with the operation
the CPU and the GPU on the host machine that can and (re)configuration of the reconfigurable blocks. Host
further process the images coming from the FPGA unit machine controls the operation of the Reconfigurability
These units are tightly coupled with each other and pre- Controller depending on the system description package
venting bottlenecks will be a very important part of the The dynamic part of the system has reconfigurable blocks
implementation. A high speed PCI Express interface will be arranged in a flexible matrix structure which enables the
employed to provide connectivity between these sub-systems. system to continue processing even when a number of these
blocks are under reconfiguration. The functionality of these
A. FPGA Co-processor blocks can be dynamically altered to enhance the performance
Programmable Logic Devices, mainly FPGAs, have always and/or the accuracy of the algorithm used or simply to replace
been the choice for high speed and computationally intensive the running algorithms with a new set. This provides flexibility

176 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


so the system will be able to adapt itself to the changes in the investigation of bus macros and generation of partial
operating environment easily. bitstreams in an efficient and effective way.
• The structure of dynamically reconfigurable blocks with
B. Host Machine emphasis on details such as block size (granularity of the
After the initial image acquisition and pre-processing stages, blocks) and support for the FPGA specific silicon features
the image data will be transferred to the host machine where such as DSP blocks, Block RAMs, etc.
further processing and classification can commence. This host • The arrangement of dynamically reconfigurable blocks
machine will have an x86 CPU and it will run a customised in a flexible matrix structure and placement issues on the
operating system. The co-processors will be connected to this actual silicon device
system via PCI Express interface. • Getting the PCI Express communication between the

A CUDA framework capable NVIDIAGPU R will be used FPGA and the Host System working and coming up with
as an arithmetic accelerator to enhance the performance of a solution that deals with the prevention of bottlenecks
certain image processing algorithms that can benefit the between the devices
massively parallel nature of a GPU device. CUDA is the • Designing the Input Controller and the physical hardware

programming language provided by NVIDIA to run general so that the system can be interfaced with high speed
purpose applications on NVIDIA GPUs. It incorporates an cameras
API (Application Programmer Interface), a runtime, couple • The definition of a comprehensive system description

of higher level libraries and a device driver for the underlying package format, where the operation of the entire system
GPU. The most important thing about the CUDA is that it has accompanied by the relevant processing directives can be
almost addressed some of the inherited general purpose com- described in a flexible and simple format. The parallel
puting problems with GPUs. CUDA’s API for the programmer nature of the system is a major factor in the choice of
is an extension to the C programming language and CUDA a language for the system package format and several
allows developer to scatter data around the DRAM as well programming languages with concurrency features are
as it features a parallel data cache or on chip shared memory being investigated.
for bringing down the bottleneck between the DRAM and the After the initial implementation phase, the system will be
GPU[6]. In the past GPUs have been used as general-purpose benchmarked with a few common algorithms and applications
computational units by wrapping computations in graphics such as the Hough Transform, Object Recognition/Tracking
function libraries but with the emergence of CUDA this level and Stereo Vision. The outcome of these tests will help us
of abstraction is not necessary anymore. fine tune the system and improve the overall performance.
VI. C ONCLUSION
Host A multi-device, dynamically reconfigurable architecture is
proposed to satisfy the requirements of high-speed/real-time
PCI/E CUDA vision systems.
FPGA Driver Interface
GPU
Initial specification of this system has been defined and
Process
implementation of different sub-systems is under way.
System
Scheduler
Monitor Parser If successful, authors believe that this system will provide
a flexible low-cost framework for high speed vision systems.
R EFERENCES
Fig. 2. Host Architecture
[1] M. A. Iqbal and U. S. Awan, “Run-time reconfigurable instruction set
processor design: Rt-risp,” in Proc. 2nd International Conference on
Figure 2 outlines the elements of the host machine. These Computer, Control and Communication IC4 2009, 17–18 Feb. 2009, pp.
elements consist of: 1–6.
[2] S. Jin, J. Cho, X. D. Pham, K. M. Lee, S.-K. Park, M. Kim, and J. W. Jeon,
• System Parser, that reads the system description package “Fpga design and implementation of a real-time stereo vision system,”
and the Scheduler that offloads relevant processing tasks Circuits and Systems for Video Technology, IEEE Transactions on, vol. 20,
no. 1, pp. 15 –26, jan. 2010.
to the co-processors [3] K. Bondalapati and V. K. Prasanna, “Reconfigurable computing systems,”
• CUDA Interface, that manages and coordinates CUDA Proceedings of the IEEE, vol. 90, no. 7, pp. 1201–1217, July 2002.
processing tasks running on the GPU [4] C. Torres-Huitzil, S. Maya-Rueda, and M. Arias-Estrada, “A reconfig-
urable vision system for real-time applications,” Dec. 2002, pp. 286–289.
• PCI Express Driver, that interfaces the host machine with [5] M. Gokhale and P. S. Graham, Reconfigurable Computing: Accelerating
the FPGA co-processor Computation with Field-Programmable Gate Arrays. Springer, 2005,
http://www.amazon.co.uk/dp/0387261052.
V. C HALLENGES [6] N. Karunadasa and D. Ranasinghe, “Accelerating high performance
applications with cuda and mpi,” pp. 331 –336, dec. 2009.
Challenges in the development of this system include:
• Understanding the operation of Partial Dynamic Re-
configuration on the FPGA device used, which include

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 177




178 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


Virtual SoPC rad-hardening for satellite applications
SYRIUS project
1 3
L. Barrandon , L. Lagadec N. Julien4 C. Moy5 T. Monédière6
T. Capitaine2 CACS team CACS team SCEE team OSA department
MIS laboratory Lab-STICC Lab-STICC SUPELEC/IETR XLIM laboratory
Amiens, France Brest, France Lorient, France Rennes, France Limoges, France
1 2 3 4
ludovic.barrandon@u-picardie.fr, thierry.capitaine@u-picardie.fr, loic.lagadec@univ-brest.fr, nathalie.julien@univ-ubs.fr,
5 6
christophe.moy@supelec.fr, thierry.monediere@xlim.fr

Abstract— Our contribution addresses the specific problematic of


the design of satellites’ on-board computers based on the use of II. TECHNICAL SHIFTS
non rad-hard devices while keeping the security and functioning
safety constraints. The topics dealt with in this context will be A. Killing three birds with one stone
based on the dynamic programming related to the exploitation of Present space digital hardware designs mainly use
data provided by the development of auto diagnostic tools for expensive and ITAR-constrained rad-hard devices and ASICs.
embedded re-programmable devices, considering safety/power Using off-the-shelf Systems on Programmable Chip (SoPC) (1)
consumption trade-offs.
would dramatically reduce the costs in terms of hardware (2)
Keywords-virtual layer; auto-diagnostic; non rad-hard FPGAs;
shorten time-to-market (enabling design-and-reuse) and (3)
satellite on-board computer; power/energy consumption; SDR. offer the possibility to use up-to-date technologies. This last
remark associated with SoPC reconfigurability lead to consider
many potential improvements for space embedded systems
I. INTRODUCTION which are intended to be studied in the frame of SYRIUS:
Cosmic rays and solar winds (protons) can induce two
types of failures in space-borne digital devices: single-event • Remote redefinition of the satellite’s mission;
upsets (SEUs are transient and cause bit inversions) and single- • Software Defined Radio (SDR) and Smart Radio [4]
event latch-ups (SELs are permanent and destroy logic and with protocol testing and auto-adaptive functionalities;
routing resources). Anti-fuse FPGAs can circumvent the SEUs
issue but do not permit reconfiguration: triple redundancy • Design under safety and power/energy constraints;
techniques are needed to prevent malfunctions due to SELs. • “Programmed death”: the SoPC will experience a
The main motivation of the SYRIUS project (SYstèmes gradual destruction of its resources until a minimum
embaRqués générIques reconfigUrables pour Satellites i.e. functional state is reach whereas rad-hard devices can
Generic and reconfigurable embedded systems for satellites) is become permanently faulty at anytime.
to design a generic and reconfigurable embedded system taking
into account the up-to-date methods and technologies to ensure B. Auto-diagnostic and DPR
reliability, low power/energy consumption and flexibility as The detection and localization of faults associated with the
stated in [1]. The use of non radiation-hardened devices stands Dynamic Partial Reconfiguration (DPR) is of major interest:
for a technological paradigm to enable space electronics the goal is to take advantage of the reconfigurability and the
industries not to implement either rad-hard device (expensive, regularity of the SRAM FPGAs’ structure to consider unused
difficult to maintain, under ITAR -International Traffic in logic as spare resources in case of SELs. This can be done
Arms Regulations- and technologically old-fashioned) or thanks to a priori allocation using spare configurations,
ASICs (long design processes, extremely costly, poorly/not selecting the best partitioning granularity (programmable-logic-
reconfigurable and “right-first-time” in 60% of the cases). block level, triple modular redundancy with standby
The system will be composed of the following « modules »: configurations or modular level) or with dynamic processes to
auto-diagnostic, smart patch-antenna arrays, dynamic partial re-route or repair algorithms [5]. Built-in self test (BIST)
reconfiguration for reliability and power/energy consumption methods are to be implemented and can be inspired from [6]
optimization, digital radio. They will integrate the spatial- (off-line method) or [7] where roving self-test areas are
domain environmental constraints so as to obtain the related exhaustively displaced across the FPGA while in operation and
certifications and validations (radiations, vibrations and from [8] implementing competing configurations.
thermal test-bench) mandatory to integrate and launch this
platform on a 30x30x30cm satellite (French National Space C. Methodology
Agency –CNES– RISTRETTO format [2]). Communication, 1) Virtual layer
reconfiguration and exploitation of embedded scientific In the software domain, the notion of virtual machine (VM)
experimentations will be done via our GENSO ground station is renowned for its ability to isolate the physical target (OS and
[3] (i.e. several current or future projects will be able to join the hardware) from the application layer. This method ensures both
satellite’s payload as scientific experimentations). portability and sustainability. In the embedded systems

ReCoSoC 2010 May 17-19, 2010, Karlsruhe, Germany 179


domain, constrained environments can take advantage of the state-of-the art results about auto-adaptive coupling to optimize
VM approach: the resources needed by the virtual layer can be the Earth-satellite radio link budget and, consequently, to
compensated by the design of a VM optimized according to the reduce the electrical power and energy consumption.
physical target. In the fault-tolerant context, the common
method consists in post-processing the netlists to insert A secured management sub-system, necessary to any
redundancy before place-and-route steps. TMRTool [9] and satellite, will ensure the vital tasks and the mission redefinition
BL-TMR [10] software can handle this technique. Its weakness by dynamic reprogrammation of digital devices via a ground
is due to the need to implement redundancy for each new station. In this context, an auto-adaptive and multi-protocol
application, that’s why we believe that factorizing this effort SDR sub-system will be implemented.
within a virtual layer is of particular interest: it would enable to
implement the design without manipulating the netlists. IV. PERSPECTIVES
The hardware targets are usually considered as reliable Designing this generic space-borne on-board computer
which is not true in this non rad-hard context. A challenge is to aims at reducing costs, time-to-market and validation cycles
implement failure schemes and correction methods in the keeping high reliability constraints. Preparing the launch of this
virtual layer to ensure robustness while relaxing failure system for testing purpose will drive our developments.
management tools.
ACKNOWLEDGMENT
2) Safety engineering and power/energy consumption
Whereas research is prosperous in both domains, there are The SYRIUS project has been submitted to the French
few studies dealing simultaneously with these two constraints National Research Agency and is intended to involve two
[11]. We are willing to develop an innovative design technique societies (Steel Electronique and Nova Nano) and the AMSAT
based on a state-graph driven system-level modeling. It will association for amateur radio.
provide an optimum solution led by both quality of service and
power/energy consumption and implement decision algorithms REFERENCES
to reconfigure the application. [1] Sghairi, M.; Aubert, J.-J.; Brot, P.; de Bonneval, A.; Crouzet, Y.;
Laarouchi, Y., "Distributed and Reconfigurable Architecture for Flight
III. HARDWARE ARCHITECTURE Control System", 28th Digital Avionics Systems Conference, DASC
2009, IEEE/AIAA, 23-29 Oct. 2009, pp. 6.B.2-1 - 6.B.2-10.
The SYRIUS system is composed of a ground section for [2] M. Saleman, D. Hernandez, C. Lambert, “RISTRETTO: A French Space
radio-communications, scientific experimentation, failure Agency Initiative for Student Satellite in Open Source and International
recovery and fault analysis and the satellite (payload and on- Cooperation,” AIAA/USU Conf. on Small Satellites, Logan, UT, USA,
Aug. 10-13, 2009, SSC09-VII-8.
board computer itself). This last module stands for the core of
the system and is multi purpose: analog and digital radio [3] T. Capitaine, V. Bourny, L. Barrandon, J. Senlis, A. Lorthois, “A
satellite tracking system designed for educational and scientific
management, power management, autodiagnostic, fault purposes”, ESA 4S (Small Satellite Systems and Services) Symposium
recovery and reconfiguration management. 31 May - 4 June 2010, Funchal, Madeira.
[4] W. Jouini, C. Moy, J. Palicot, “On decision making for dynamic
Autodiagnostic, failure recovery configuration adaptation problem in cognitive radio equipments: a multi-
armed bandit based approach”, 6th Karlsruhe Workshop on Software
Ground
station

Scientific experiments Radios, March 3 - 4, 2010.


[5] M. G. Parris, “Optimizing Dynamic Logic Realizations for Partial
Radio applications Reconfiguration of Field Programmable Gate Arrays”, School of
Electrical Engineering and Computer Science, University of Central
Florida, Orlando.
144Mhz/430Mhz/1.2Ghz/2.4Ghz Antenna arrays [6] T. Nandha Kumar, C. Wai Chong, “An automated approach for locating
multiple faulty LUTs in FPGA” Microelectronics Reliability 48 (2008)
pp 1900-1906.
RF and mixed– SoPC [7] J.M. Emmert, C.E. Stroud, M. Abramovici, “Online Fault Tolerance for
signal front-end Smart radio
FPGA Logic Blocks”, IEEE Trans. on VLSI Syst. 2007 vol 15, 216-226.
[8] R.F. Demara, K. Zhang, “Autonomous FPGA fault handling through
competitive runtime reconfiguration”, NASA/DoD Conference on
RAM (bitstreams) On-board computer Evolvable Hardware, Washington D.C., U.S.A., 2005, 109-116.
[9] Xilinx, “TMR tool”, www.xilinx.com/ise/optional_prod/tmrtool.htm
[10] Brigham young University, “BYU EDIF Tools Home Page“,
Power supply Scientific experiments http://sourceforge.net/projects/byuediftools/.
Satellite

[11] Toshinori Sato, Toshimasa Funaki, "Dependability, Power, and


Performance Trade-off on a Multicore Processor", Asia & South Pacific
Sensors, actuators, solars panels Design Automation Conf., ASPDAC, 21-24 March 2008, pp. 714–719.
[12] Saou-Wen Su, Shyh-Tirng Fang, Kin-Lu Wong “A Low-Cost Surface-
Mount Monopole Antenna for 2.4/5.2/5.8-GHz Band Operation”
Figure 1. Hardware architecture of the SYRIUS platform. Microwave and Optical technology letters, vol 36, March 2003.
[13] C. Ying and Y.P. Zhang “Integration of Ultra-Wideband Slot Antenna
A multi-frequency patch-antenna array will be designed to on LTCC Substrate”, Electronics Letters, 27 May 2004, vol 40, N°11.
be mounted on each face of a satellite ([12], [13]), using the

180 May 17-19, 2010, Karlsruhe, Germany ReCoSoC 2010


ISBN 978-3-86644-515-4

ISSN 1869-9669
ISBN 978-3-86644-515-4 9 783866 445154

Você também pode gostar