10 06 28 08 23 52 Pratheesh

H A N D B O O K
EMBEDDED
SYSTEMS
2006 by Taylor & Francis Group, LLC
Publ i shed Books
Industrial Communication Technology Handbook
Edited by Richard Zurawski
Embedded Systems Handbook
Edited by Richard Zurawski
For t hc omi ng Books
Electronic Design Automation for Integrated Circuits Handbook
Luciano Lavagno, Grant Martin, and Lou Scheffer
Ser i es Edi t or
RICHARD ZURAWSKI
I NDUST RI AL I NF ORMAT I ON T E CHNOL OGY SE RI E S
H A N D B O O K
E di t e d by
R I C H A R D Z U R AWS K I
EMBEDDED
SYSTEMS
A CRC title, part of the Taylor & Francis imprint, a member of the
Taylor & Francis Group, the academic division of T&F Informa plc.
Boca Raton London New York
To my wife, Celine
International Advisory Board
Alberto Sangiovanni-Vincentelli, University of California, Berkeley, U.S. (Chair)
Giovanni De Micheli, Stanford University, U.S.
Stephen A. Edwards, Columbia University, U.S.
Aarti Gupta, NEC Laboratories, Princeton, U.S.
Rajesh Gupta, University of California, San Diego, U.S.
Axel Jantsch, Royal Institute of Technology, Sweden
Wido Kruijtzer, Philips Research, The Netherlands
Luciano Lavagno, Cadence Berkeley Laboratories, Berkeley, U.S., and Politecnico di Torino, Italy
Robert de Simone, INRIA, France
Grant Martin, Tensilica, U.S.
Pierre G. Paulin, ST Microelectronics, Canada
Antal Rajnk, Volcano AG, Switzerland
Franoise Simonot-Lion, LORIA, France
Thomas Weigert, Motorola, U.S.
Reinhard Wilhelm, University of Saarland, Germany
Lothar Thiele, Swiss Federal Institute of Technology, Switzerland
Preface
Introduction
The purpose of the Embedded Systems Handbook is to provide a reference useful to a broad range of
professionals and researchers from industry and academia involved in the evolution of concepts and
technologies, as well as development and use of embedded systems and related technologies.
The book provides a comprehensive overview of the eld of embedded systems and applications. The
emphasis is on advanced material to cover recent signicant research results and technology evolution and
developments. It is primarily aimed at experienced professionals from industry and academia, but will
also be useful to novices with some university background in embedded systems and related areas. Some
of the topics presented in the book have received limited coverage in other publications either owing to
the fast evolution of the technologies involved, or material condentiality, or limited circulation in the
case of industry-driven developments.
The book covers extensively the design and validation of real-time embedded systems, design and
verication languages, operating systems and scheduling, timing and performance analysis, power aware
computing, security in embedded systems, the design of application-specic instruction-set processors
(ASIPs), system-on-chip(SoC) andnetwork-on-chip(NoC), testing of core-basedICs, network embedded
systems and sensor networks, and embedded applications to include in-car embedded electronic systems,
intelligent sensors, and embedded web servers for industrial automation.
The book contains 46 contributions, written by leading experts from industry and academia directly
involved in the creation and evolution of the ideas and technologies treated in the book.
Many of the contributions are from industry and industrial research establishments at the forefront of
the developments shaping the eld of embedded systems: Cadence Systems and Cadence Berkeley Labs
(USA), CoWare (USA), Microsoft (USA), Motorola (USA), NEC Laboratories (USA), Philips Research
(The Netherlands), ST Microelectronics (Canada), Tensilica (USA), Volcano (Switzerland), etc.
The contributions from academia and governmental research organizations are represented by some
of the most renowned institutions such as Columbia University, Duke University, Georgia Institute of
Technology, Princeton University, Stanford University, University of California at Berkeley/Riverside/
San Diego/Santa Barbara, University of Texas at Austin/Dallas, Virginia Tech, Washington University
from the United States; Delft University of Technology (Netherlands), IMAG (France), INRIA/IRISA
(France), LORIA-INPL (France), Malardalen University (Sweden), Politecnico di Torino (Italy), Royal
Institute of Technology KTH (Sweden), Swiss Federal Institute of Technology ETHZ (Switzerland),
Technical University of Berlin (Germany), Twente University (The Netherlands), Universidad Politecnica
de Madrid (Spain), University of Bologna (Italy), University of Nice Sophia Antipolis (France), University
of Oslo (Norway), University of Pavia (Italy), University of Saarbrucken (Germany), University of Toronto
(Canada), and many others.
The material presented is in the formof tutorials, surveys, and technology overviews. The contributions
are grouped into sections for cohesive and comprehensive presentation of the treated areas. The reports
on recent technology developments, deployments, and trends frequently cover material released to the
profession for the rst time.
The book can be used as a reference (or prescribed text) for university (post)graduate courses: Section I
(Embedded Systems) provides core material on embedded systems. Selected illustrations of actual
applications are presented in Section VI (Embedded Applications). Sections II and III (System-on-Chip
Design, and Testing of Embedded Core-Based Integrated Circuits) offer material on recent advances in
system-on-chip design and testing of core-based ICs. Sections IV and V (Networked Embedded Systems,
and Sensor Networks) are suitable for a course on sensor networks.
x Preface
The handbook is designed to cover a wide range of topics that comprise the eld of embedded sys-
tems and applications. The material covered in this volume will be of interest to a wide spectrum of
professionals and researchers from industry and academia, as well as graduate students, from the elds of
electrical and computer engineering, computer science and software engineering, as well as mechatronic
engineering.
It is an indispensable companion for those who seek to learn more about embedded systems and
applications, and those who want to stay up to date with recent technical developments in the eld. It is
also a comprehensive reference for university or professional development courses on embedded systems.
Organization
Embedded systems is a vast eld encompassing numerous disciplines. Not every topic, however important,
can be covered in a book of reasonable volume without supercial treatment. Choices need to be made
with respect to the topics covered, balance between research material and reports on novel industrial
developments and technologies, balance between so-calledcoretopics and newtrends, and other aspects.
The time-to-market is another important factor in making those decisions, along with the availability
of qualied authors to cover the topics.
One of the main objectives of any handbook is to give a well-structured and cohesive description of
fundamentals of the area under treatment. It is hoped that the section Embedded Systems has achieved this
objective. Every effort was made to make sure that each contribution in this section contains introductory
material to assist beginners with the navigation through more advanced issues. This section does not
strive to replicate or replace university level material, but, rather, tries to address more advanced issues,
and recent research and technology developments.
To make this book timely and relevant to a broad range of professionals and researchers, the book
includes material reecting state-of-the-art trends to cover topics such as design of ASIPs, SoC com-
munication architectures including NoC, design of heterogeneous SoC, as well as testing of core-based
integrated circuits. This material reports on new approaches, methods, technologies, and actual sys-
tems. The contributions come from the industry driving those developments, industry-afliated research
institutions, and academic establishments participating in major research initiatives.
Application domains have had a considerable impact on the evolution of embedded systems, in terms
of required methodologies and supporting tools, and resulting technologies. A good example is the accel-
erated evolution of the SoC design to meet demands for computing power posed by DSP, network and
multimedia processors. SoCs are slowly making inroads into the area of industrial automation to imple-
ment complex eld-area intelligent devices which integrate the intelligent sensor/actuator functionality by
providing on-chip signal conversion, data and signal processing, and communication functions. There is
a growing tendency to network eld-area intelligent devices around industrial communication networks.
Similar trends appear in the automotive electronic systems where the Electronic Control Units (ECUs)
are networked by means of safety-critical communication protocols such as FlexRay, for instance, for
the purpose of controlling vehicle functions such as electronic engine control, anti-locking break system,
active suspension, etc. The design of this kind of networked embedded system (this also includes hard
real-time industrial control systems) is a challenge in itself due to the distributed nature of processing
elements, sharing a common communication medium and safety-critical requirements. With the auto-
motive industry increasingly keen on adopting mechatronic solutions, it was felt that exploring, in detail,
the design of in-vehicle electronic embedded systems would be of interest to the readers of this book.
The applications part of the book also touches the area of industrial automation (networked control
systems) where the issues are similar. In this case, the focus is on the design of web servers embedded in
the intelligent eld-area devices, and the security issues arising from internetworking.
Sensor networks are another example of networked embedded systems, although, the embedding
factor is not soevident as inother applications; particularly for wireless andself-organizing networks where
the nodes may be embedded in the ecosystem, battleeld, or a chemical plant, for instance. The area of
Preface xi
wireless sensor networks has now evolved into a relative maturity. Owing to novelty, and growing import-
ance, it has been included in the book to give a comprehensive overview of the area, and present new
research results which are likely to have a tangible impact on further developments and technology.
The specics of the design automation of integrated circuits have been deliberately omitted in this book
to keep the volume at a reasonable size and in view of the publication of another handbook which covers
these aspect in a comprehensive way: The Electronic Design Automation for Integrated Circuits Handbook,
CRC Press, FL, 2005, Editors: Luciano Lavagno, Grant Martin, and Lou Scheffer.
The aim of the Organization section is to provide highlights of the contents of the individual chapters
to assist readers with identifying material of interest, and to put topics discussed in a broader context.
Where appropriate, a brief explanation of the topic under treatment is provided, particularly for chapters
describing novel trends, and with novices in mind. The book is organized into six sections: Embed-
ded Systems, System-on-Chip Design, Testing of Embedded Core-Based Integrated Circuits, Networked
Embedded Systems, Sensor Networks, and Embedded Applications.
I Embedded Systems
This section provides a broad introduction to embedded systems. The presented material offers a com-
bination of fundamental and advanced topics, as well as novel results and approaches, to cover the area
fairly comprehensively. The presented topics include issues in real-time and embedded systems, design
and validation, design and verication languages, operating systems, timing and performance analysis,
power aware computing, and security.
Real-Time and Embedded Systems
This subsection provides a context for the material covered in the book. It gives an overview of real-time
and embedded systems and their networking to include issues, methods, trends, applications, etc.
The focus of the chapter Embedded Systems: Toward Networking of Embedded Systems is on network-
ing of embedded systems. It briey discusses the rationale for the emergence of these kinds of systems,
their benets, types of systems, diversity of application domains and requirements arising from that, as
well as security issues. Subsequently, the chapter discusses the design methods for networked embedded
systems, which fall into the general category of system-level design. The methods overviewed focus on
two separate aspects, namely the network architecture design and the system-on-chip design. The design
issues and practices are illustrated by examples from the automotive application domain. After that, the
chapter introduces selected application domains for networked embedded systems, namely: industrial
and building automation control, and automotive control applications. The focus of the discussion is on
the networking aspects. The chapter gives an overview of the networks used in industrial applications,
including the industrial Ethernet and its standardization process; building automation control; and net-
works for automotive control and other applications from the automotive domain but the emphasis
is on networks for safety critical solutions. Finally, general aspects of wireless sensor/actuator networks
are presented, and illustrated by an actual industrial implementation of the concept. At the end of the
chapter, a few paragraphs are dedicated to the security issues for networked embedded systems.
An authoritative introduction to real-time systems is provided in Real-Time in Embedded Systems. The
chapter covers extensively the areas of design and analysis, with some examples of analysis, as well as
tools; operating systems (an in-depth discussion of real-time embedded operating systems is presented in
the chapter Real-Time Embedded Operating Systems Standards and Perspectives); scheduling (the chapter
Real-Time Embedded Operating Systems: The Scheduling and Resource Management Aspects presents an
authoritative descriptionand analysis of real-time scheduling); communications to include descriptions of
selected eldbus technologies and Ethernet for real-time communications; and component based design,
as well as testing and debugging. This is essential reading for anyone interested in the area of real-time
systems.
xii Preface
Design and Validation of Embedded Systems
The subsection Design and Validation of Embedded Systems contains material presenting design methodo-
logy for embeddedsystems andsupporting tools, as well as selectedsoftware andhardware implementation
aspects. Models of Computation (MoC) which are essentially abstract representations of computing
systems are used throughout to facilitate design and validation stages of systems development and
approaches to validation as well as available methods and tools. The verication methods, together
with an overview of verication languages, are presented in subsection Design and Verication Lan-
guages. In addition, the subsection presents novel research material including a framework used to
introduce different models of computation particularly suited to the design of heterogeneous multi-
processor SoC, and a mathematical model of embedded systems based on the theory of agents and
interactions.
A comprehensive introduction to the design methodology for embedded systems is presented in the
chapter Design of Embedded Systems. It gives an overview of the design issues and stages. Then, the
chapter presents, in quite some detail, the functional design, function/architecture and hardware/software
codesign, and hardware/software coverication and hardware simulation. Subsequently, the chapter dis-
cusses selected software and hardware implementation issues. While discussing different design stages and
approaches, the chapter also introduces and evaluates supporting tools.
An excellent introduction to the topic of models of computation, particularly for embedded systems, is
presented in the chapter Models of Embedded Computation. The chapter introduces the origin of MoC, and
the evolution from models of sequential and parallel computation to attempts to model heterogeneous
architectures. In the process, the chapter discusses, in relative detail, selected nonfunctional properties
such as power consumption, component interaction in heterogeneous systems, and time. It also presents a
new framework used to introduce four different models of computation, and shows how different time
abstractions can serve different purposes and needs. The framework is subsequently used to study the
coexistence of different computational models; specically the interfaces between two different MoCs and
the renement of one MoC into another. This part of the chapter is particularly relevant to the material
on the design of heterogeneous multiprocessor SoC presented in the section System-on-Chip Design.
A comprehensive survey of selected models of computation is presented in the chapter Modeling
Formalisms for Embedded System Design. The surveyed formalisms include Finite State Machines (FSM),
Finite State Machines with Datapath (FSMD), Moore machine, Mealy machine, Codesign Finite State
Machines (CFSM), Program State Machines (PSM), Specication and Description Language (SDL),
Message Sequence Charts (MSC), Statecharts, Petri nets, synchronous/reactive models, discrete event
system, Dataow Models, etc. The presentation of individual models is augmented by numerous
examples.
The chapter System Validation briey discusses approaches to requirements capture, analysis and
validation, and surveys available methods and tools to include: descriptive formal methods such as
VDM, Z, B, RAISE (Rigorous Approach to Industrial Software Engineering), CASL (Common Algebraic
Specication Language), SCR (Software Cost Reduction), and EVES; deductive veriers: HOL, Isabelle,
PVS, Larch, Nqthm, and Nuprl; state exploration tools: SMV (Symbolic Model Verier), Spin, COSPAN
(COordination SPecication Analysis), MEIJE, CADP, and Murphi. It also presents a mathematical model
of embedded systems based on the theory of agents and interactions. To underline a novelty of this form-
alism, classical theories of concurrency are surveyed to include process algebras, temporal logic, timed
automata, (Gurevichs) ASM (Abstract State Machine), and rewriting logic. As an illustration, the chapter
presents a specication of a simple scheduler.
Design and Verication Languages
This section gives a comprehensive overview of languages used to specify, model, verify, and program
embedded systems. Some of those languages embody different models of computation discussed in
the previous section. A brief overview of Architecture Description Languages (ADL) is presented in
Preface xiii
Embedded Applications (Automotive Networks); the use of this class of languages, in the context of
describing in-car embedded electronic systems, is illustrated through the EAST-ADL language.
An authoritative introduction to a broad range of languages used in embedded systems is presen-
ted in the chapter Languages for Embedded Systems. The chapter surveys some of the most representative
and widely used languages. Software languages: assembly languages for complex instruction set computers
(CISC), reduced instruction set computers (RISC), digital signal processors (DSPs) and very-long instruc-
tion word processors (VLIWs), and for small (4- and 8-bit) microcontrollers; the C and C++Languages;
Java; and real-time operating systems. Hardware languages: Verilog andVHDL. Dataowlanguages: Kahn
Process Networks and Synchronous Dataow (SDF). Hybrid languages: Esterel, SDL, and SystemC. Each
group of languages is characterized for their specic application domains and illustrated with ample code
examples.
An in-depth introduction to synchronous languages is presented in The Synchronous Hypothesis and
Synchronous Languages. Before introducing the synchronous languages, the chapter discusses the concept
of synchronous hypothesis: the basic notion, mathematical models, and implementation issues. Sub-
sequently, it overviews the structural languages used for modeling and programming synchronous
applications. Imperative languages, Esterel and SyncCharts, provide constructs to deal with control-
dominated programs. Declarative languages, Lustre and Signal, are particularly suited for applications
based on intensive data computation and dataow organization. Future trends are also covered.
The chapter Introduction to UML and the Modeling of Embedded Systems gives an overview of the
use of UML (Unied Modeling Language) for modeling embedded systems. The chapter presents a
brief overview of UML and discusses UML features suited to represent the characteristics of embedded
systems. The UML constructs, the language use, and other issues are introduced through an example
of an automatic teller machine. The chapter also briey discusses a standardized UML prole (a spe-
cication language instantiated from the UML language family) suitable for modeling of embedded
systems.
A comprehensive survey and overview of verication languages is presented in the chapter Verication
Languages. It describes languages for verication of hardware, software, and embedded systems. The focus
is on the support that a verication language provides for dynamic verication based on simulation,
as well as static verication based on formal techniques. Before discussing the languages, the chapter
provides some background on verication methods. This part introduces basics of simulation-based
verication, formal verication, and assertion-based verication. It also discusses selected logics that
form the basis of languages described in the chapter: propositional logic, rst-order predicate logic,
temporal logics, andregular and-regular languages. The hardware vericationlanguages (HVLs) covered
include: e, OpenVera, Sugar/PSL, and ForSpec. The languages for software verication overviewed include
programming languages: C/C++, and Java; and modeling languages: UML, SDL, and Alloy. Languages
for SoCs and embedded systems verication include system-level modeling languages: SystemC, SpecC,
and SystemVerilog. The chapter also surveys domain-specic verication efforts, such as those based on
Esterel and hybrid systems.
Operating Systems and Quasi-Static Scheduling
This subsectionoffers a comprehensive introductiontoreal-time andembeddedoperating systems tocover
fundamentals and selected advanced issues. To complement this material with new developments, it gives
an overview of the operating system interfaces specied by the POSIX 1003.1 international standard and
related to real-time programming and introduces a class of operating systems based on virtual machines.
The subsection also includes research material on quasi-static scheduling.
The chapter Real-Time Embedded Operating Systems: Standards and Perspectives provides a compre-
hensive introduction to the main features of real-time embedded operating systems. It overviews some
of the main design and architectural issues of operating systems: system architectures, process and
thread model, processor scheduling, interprocess synchronization and communication, and network sup-
port. The chapter presents a comprehensive overview of the operating system interfaces specied by
xiv Preface
the POSIX 1003.1 international standard and related real-time programming. It also gives a short
description of selected open-source real-time operating systems to include eCos, Clinux, RT-Linux and
RTAI, and RTEMS. The chapter also presents a fairly comprehensive introduction to a class of operating
systems based on virtual machines.
Task scheduling algorithms and resource management policies, put in the context of real-time
systems, are the main focus of the chapter Real-Time Embedded Operating Systems: The Schedul-
ing and Resource Management Aspects. The chapter discusses in detail periodic task handling to
include Timeline Scheduling (TS), Rate-Monotonic (RM) scheduling, Earliest Deadline First (EDF)
algorithm, and approaches to handle tasks with deadlines less than periods scheme; and aperi-
odic task handling. Protocols for accessing shared resources discussed include Priority Inherit-
ance Protocol (PIP) and Priority Ceiling Protocol (PCP). Novel approaches, which provide ef-
cient support for real-time multimedia systems, for handling transient overloads and execution
overruns in soft real-time systems working in dynamic environments are also mentioned in the
chapter.
The chapter Quasi-Static Scheduling of Concurrent Specications presents methods aimed at efcient
synthesis of uniprocessor software with an aim to improve speed of the scheduled design. The proposed
approach starts froma specication represented in terms of concurrent communicating processes, derives
an intermediate representation based on Petri nets or Boolean Dataow Graphs, and nally attempts
to obtain a sequential schedule to be implemented on a processor. The potential benets result from
replacement of explicit communication among processes by data assignment and reduced amount of
context switches due to a reduction of the number of processes.
Timing and Performance Analysis
Many embedded systems, particularly hard real-time systems, impose strict restrictions on the execution
time of tasks which are required to be completed within certain time bounds. For this class of systems,
schedulability analysis requires the upper bounds for the execution times of all tasks to be known in
order to verify whether the system meets its timing requirements. The chapter Determining Bounds on
Execution Times presents architecture of the aiT timing-analysis tool and an approach to timing analysis
implemented in the tool. In the process, the chapter discusses cache-behavior prediction, pipeline analysis,
path analysis using integer linear programming, and other issues. The use of this approach is put in the
context of upper bounds determination. Inaddition, the chapter gives a brief overviewof other approaches
to timing analysis.
The validation of nonfunctional requirements of selected implementation aspects such as deadlines,
throughputs, buffer space, power consumption, etc., comes under performance analysis. The chapter
Performance Analysis of Distributed Embedded Systems discusses issues behind performance analysis and its
role in the design process. It also surveys a few selected approaches to performance analysis for distributed
embedded systems to include simulation-based methods, holistic scheduling analysis, and compositional
methods. Subsequently, the chapter introduces the performance network approach, as stated by authors,
inuenced by the worst-case analysis of communication networks. The presented approach allows one to
obtain upper and lower bounds on quantities such as end-to-end delay and buffer space; it also covers
all possible corner cases independent of their probability.
Power Aware Computing
Embedded nodes, or devices, are frequently battery powered. The growing power dissipation, with
the increase in density of integrated circuits and clock frequency, has a direct impact on the cost of
packaging and cooling, as well as reliability and lifetime. These and other factors make the design
for low power consumption a high priority for embedded systems. The chapter Power Aware Embed-
ded Computing presents a survey of design techniques and methodologies aimed at reducing static and
dynamic power dissipation. The chapter discusses energy and power modeling to include instruction
Preface xv
level and function level power models, micro-architectural power models, memory and bus models, and
battery models. Subsequently, the chapter discusses system/application level optimizations which explore
different task implementations exhibiting different power/energy versus quality-of-service characterist-
ics. Energy efcient processing subsystems: voltage and frequency scaling, dynamic resource scaling, and
processor core selection, are also overviewed in the chapter. Finally, the chapter discusses energy efcient
memory subsystems: cache hierarchy tuning, novel horizontal and vertical cache partitioning schemes,
dynamic scaling of memory elements, software controlled memories, scratch-pad memories, improving
access patterns to on-chip memory, special purpose memory subsystems for media streaming, and code
compression, and interconnect optimizations.
Security in Embedded Systems
There is a growing trendfor networking of embeddedsystems. Representative examples of suchsystems can
be found in automotive, train, and industrial automation domains. Many of those systems are required
to be connected to other networks to include LAN, WAN, and the Internet. For instance, there is a
growing demand for remote access to process data at the factory oor. This, however, exposes systems
to potential security attacks, which may compromise their integrity and cause damage. The limited
resources of embedded systems pose considerable challenge for the implementation of effective security
policies which, in general, are resource demanding. An excellent introduction to the security issues in
embedded systems is presented in the chapter Design Issues in Secure Embedded Systems. The chapter
outlines security requirements in computing systems, classies abilities of attackers, and discusses security
implementation levels. Security constraints in the embedded systems designs discussed include energy
considerations, processing power limitations, exibility and availability requirements, and cost of imple-
mentation. Subsequently, the chapter presents the main issues in the design of secure embedded systems.
It also covers, in detail, attacks and countermeasures of cryptographic algorithm implementations in
embedded systems.
II System-on-Chip Design
Multi-Processor Systems-on-Chip (MPSoC), which combine the advantages of parallel processing with
the high integration levels of SoCs, emerged as a viable solution to meet the demand for computational
power required by applications such as network and media processors. The design of MPSoCs typically
involves integration of heterogeneous hardware and software IP components. However, the support for
reuse of hardware and software IP components is limited, thus potentially making the design process
labor-intensive, error-prone, and expensive. Selected component-based design methodologies for the
integration of heterogeneous hardware and software IP components are presented in this section together
with other issues such as design of ASIPs, communication architectures to include NoC, and platform
based design, to mention some. Those topics are presented in eight chapters introducing the SoC concept
and design issues; design of ASIPs; SoC communication architectures; principles and guidelines for
the NoC design; platform-based design principles; converter synthesis for incompatible protocols; a
component-based design automation approach for multiprocessor SoC platforms; an interface-centric
approach to the design and programming of embedded multiprocessors; and an STMicroelectronics
developed exploration multiprocessor SoC platform.
A comprehensive introduction to the SoC concept, in general, and design issues is provided in the
chapter System-on-Chip and Network-on-Chip Design. The chapter discusses basics of SoC; IP cores and
virtual components; introduces the concept of architectural platforms and surveys selected industry
offerings; and provides a comprehensive overview of the SoC design process.
A retargetable framework for ASIP design is presented in A Novel Methodology for the Design of
Application-Specic Instruction-Set Processors. The framework, which is based on machine descriptions
in the LISA language, allows for automatic generation of software development tools including HLL
C-compiler, assembler, linker, simulator, and graphical debugger frontend. In addition, synthesizable
xvi Preface
hardware description language code can be derived for architecture implementation. The chapter also
gives an overview of various machine description languages in the context of their suitability for the
design of ASIP; discusses the ASIPs design ow, and the LISA language.
On-chip communication architectures are presented in the chapter State-of-the-Art SoC Communica-
tion Architectures. The chapter offers an in-depth description and analysis of three most relevant, from
industrial and research viewpoints, architectures to include ARM developed AMBA (Advanced Micro-
Controller Bus Architecture) and new interconnect schemes, namely Multi-Layer AHB and AMBA AXI;
IBM developed CoreConect; and STMicroelectronics developed STBus. In addition, the chapter surveys
other architectures such as Wishbone, Sonics SiliconBackplane Micronetwork, Peripheral Interconnect
Bus (PI-Bus), Avalon, andCoreFrame. The chapter alsooffers analysis of selectedarchitectures andextends
the discussion of on-chip interconnects to NoC.
Basic principles and guidelines for the NoC design are introduced in Network-on-Chip Design for
Gigascale Systems-on-Chip. It discusses a rationale for the design paradigm shift of SoC communication
architectures from shared busses to NoCs; and briey surveys related work. Subsequently, the chapter
presents details of NoC building blocks to include switch, network interface, and switch-to-switch links.
In discussing the design guidelines, the chapter uses a case study of a real NoCarchitecture (Xpipes) which
employs some of the most advanced concepts in NoC design. It also discusses the issue of heterogeneous
NoC design, and the effects of mapping the communication requirements of an application onto a
domain-specic NoC.
An authoritative discussion of the platform-based design (PBD) concept is provided in the chapter
Platform-Based Design for Embedded Systems. The chapter introduces PBD principles and outlines the
interplay between micro-architecture platforms andApplication ProgramInterface (API), or programmer
model, which is a unique abstract representation of the architecture platform via the software layer. The
chapter also introduces three applications of PBD: network platforms for communication protocol design,
fault-tolerant platforms for the designof safety-critical applications, andanalog platforms for mixed-signal
integrated circuit design.
An approach to synthesis of interface converters for incompatible protocols in a component-
based design automation is presented in Interface Specication and Converter Synthesis. The chapter
surveys several approaches for synthesizing converters illustrated by simple examples. It also intro-
duces more advanced frameworks based on abstract algebraic solutions that guarantee converter
correctness.
The chapter Hardware/Software Interface Design for SoC presents a component-based design automa-
tion approach for MPSoC platforms. It briey surveys basic concepts of MPSoC design and discusses
some related platform and component-based approaches. It provides a comprehensive overview of
hardware/software IP integration issues to include bus-based and core-based approaches, integrating soft-
ware IP, communicationsynthesis (the concept is presentedindetail inInterface Specicationand Converter
Synthesis), andIPderivation. The focal point of the chapter is a newcomponent-baseddesignmethodology
and the design environment for the integration of heterogeneous hardware and software IP components.
The presented methodology, which adopts the automatic communication synthesis approach and uses a
high-level API, generates both hardware and software wrappers, as well as a dedicated operating systemfor
programmable components. The IP integration capabilities of the approach and accompanying software
tools are illustrated by redesigning a part of a VDSL modem.
The chapter Design and Programming of Embedded Multiprocessors: An Interface-Centric Approach
presents a design methodology for implementing media processing applications as MPSoCs centered
around the Task Transaction Level (TTL) interface. The TTL interface can be used to build
executable specications; it also provides a platform interface for implementing applications as
communicating hardware and software tasks on a platform infrastructure. The chapter introduces
the TTL interface in the context of the requirements, and discusses mapping technology which
supports structured design and programming of embedded multiprocessor systems. The chapter also
presents two case studies of implementations of TTL interface on different architectures: a multi-DSP
Preface xvii
architecture, using an MP3 decoder application to evaluate this implementation; and a smart-imaging
multiprocessor.
The STMicroelectronics developed StepNP
TM
exible MPSoC platform and its key architectural com-
ponents are described in A MultiProcessor SoC Platform and Tools for Communications Applications. The
platform was developed with an aim to explore tool and architectural issues in a range of high-speed
communications applications, particularly packet processing applications used in network infrastructure
SoCs. Subsequently, the chapter reviews the MultiFlex modeling and analysis tools developed to support
the StepNP platform. The MultiFlex environment supports two parallel programming models: a distrib-
uted systemobject component (DSOC) message passing model and a symmetrical multiprocessing (SMP)
model using shared memory. It maps these models onto the StepNP MPSoCplatform. The use of the plat-
form and supporting environment are illustrated by two examples mapping IPv4 packet forwarding and
trafc management applications onto the StepNP platform. Detailed results are presented and discussed
for a range of architectural parameters.
III Testing of Embedded Core-Based Integrated Circuits
The ever-increasing circuit densities and operating frequencies, as well as the use of the SoC designs, have
resulted in enormous test data volume for todays embedded core-based integrated circuits. According
to the Semiconductor Industry Association, in the International Technology Roadmap for Semiconductors
(ITRS), 2001 Edition, the density of ICs can reach 2 billion transistors per square cm, and 16 billion
transistors per chip are likely by 2014. Based on that, according to some estimates (A. Khoche and J.
Rivoir, I/O bandwidth bottleneck for test: is it real? Test Resource Partitioning Workshop, 2002), the test
data volume for ICs in 2014 is likely to increase 150 times in reference to 1999. Some other problems
include the growing disparity between performance of the design and the automatic test equipment which
makes at-speed testing, particularly of high-speed circuits, a challenge and results in increasing yield loss;
high cost of manually developed functional tests; and growing cost of high-speed and high-pincount
testers. This section contains two chapters introducing new techniques addressing some of the issues
indicated above.
The chapter Modular Testing and Built-In Self-Test of Embedded Cores in System-on-Chip Integrated
Circuits presents a survey of techniques that have been proposed in the literature for reducing test time
and test data volume. The techniques surveyed rely on modular testing of embedded cores and built-in
self test (BIST). The material on modular testing of embedded cores in a system-on-a-chip describes
wrapper design and optimization, test access mechanism(TAM) design and optimization, test scheduling,
integrated TAMoptimization and test scheduling, and modular testing of mixed-signal SOCs. In addition,
the chapter reviews a recent deterministic BIST approach in which a recongurable interconnection
network (RIN) is placed between the outputs of the linear-feedback shift register (LFSR) and the inputs
of the scan chains in circuit under test. The RIN, which consists only of multiplexer switches, replaces the
phase shifter that is typically used in pseudo-random BIST to reduce correlation between the test data bits
that are fed into the scan chains. The proposed approach does not require any circuit redesign and it has
minimal impact on circuit performance.
Hardware-based self-testing techniques (BIST) have limitations due to performance, area, and design
time overhead, as well as problems causedby the applicationof nonfunctional patterns (whichmay result in
higher power consumptionduring testing, over-testing, yieldloss problems, etc.). The embeddedsoftware-
based self-testing technique has a potential to alleviate the problems caused by using external testers, as well
as structural BIST problems. The embedded software-based self-testing utilizes on-chip programmable
resources (such as embedded microprocessors and DSPs) for on-chip test generation, test delivery, signal
acquisition, response analysis, and even diagnosis. The chapter Embedded Software-Based Self-Testing for
SoC Design discusses processor self-test methods targeting stuck-at faults and delay faults; presents a brief
description of a processor self-diagnosis method; presents methods for self-testing of buses and global
xviii Preface
interconnects as well as other nonprogrammable IP cores on SoC; describes instruction-level design-for-
testability (Df T) methods based on insertion of test instructions to increase the fault coverage and reduce
the test application time and test program size; and outlines DSP-based self-test for analog/mixed-signal
components.
IV Networked Embedded Systems
Networked embedded systems (NES) are essentially spatially distributed embedded nodes (implemented
on a board, or a single chip in future) interconnected by means of wireline or/and wireless communication
infrastructure and protocols, interacting with the environment (via sensor/actuator elements) and each
other, and, possibly, a master node performing some control and coordination functions to coordinate
computing and communication in order to achieve certain goal(s). An example of a network embedded
system may be an in-vehicle embedded network comprising a collection of ECUs networked by means of
safety-critical communication protocols, such as FlexRay or TTP/C, for the purpose of controlling vehicle
functions, such as electronic engine control, anti-locking brake system, active suspension, etc. (for details
of automotive applications see the last section in the book).
An excellent introduction to NES is presented in the chapter Design Issues in Networked Embedded Sys-
tems. This chapter outlines some of the most representative characteristics of NES, and surveys potential
applications. It also explains design issues for large-scale distributed NES such as environment interac-
tion, life expectancy of nodes, communication protocol, recongurability, security, energy constraints,
operating systems, etc. Design methodologies and tools are discussed as well.
The topic of middleware for NES is addressed in Middleware Design and Implementation for Networked
Embedded Systems. This chapter discusses the role of middleware in NES and the challenges in design and
implementation, such as remote communication, location independence, reuse of the existing infrastruc-
ture, providing real-time assurances, providing a robust DOCmiddleware, reducing middleware footprint,
and support for simulation environments. The focal points of the chapter are the sections describing the
design and implementation of nORB (a small footprint real-time object request broker tailored to spe-
cic embedded sensor/actuator applications), and the rationale behind the adopted approach, namely to
address the NES design and implementation challenges.
V Sensor Networks
The distributed (wireless) sensor networks are a relatively new and exciting proposition for collecting
sensory data in a variety of environments. The design of this kind of network poses a particular challenge
due to limited computational power and memory size, bandwidth restrictions, power consumption
restriction if battery powered, communication requirements, and unattended mode of operation in
case of inaccessible and/or hostile environments, to mention some. It provides a fairly comprehensive
discussion of the design issues related to, in particular, self-organizing wireless networks. It introduces
fundamental concepts behind sensor networks, discusses architectures, energy-efcient Medium Access
Control (MAC), time synchronization, distributed localization, routing, distributed signal processing,
security, and it surveys selected software solutions.
A general introduction to the area of wireless sensor networks is provided in Introduction to Wireless
Sensor Networks. A comprehensive overview of the topic is provided in Issues and Solutions in Wireless
Sensor Networks, which introduces fundamental concepts, selected application areas, design challenges,
and other relevant issues.
The chapter Architectures for Wireless Sensor Networks provides an excellent introduction to various
aspects of the architecture of wireless sensor networks. It includes the description of a sensor node
architecture and its elements: sensor platform, processing unit, communication interface, and power
source. In addition, it presents a mathematical model of power consumption by a node, to account for
energy consumption by radio, processor, and sensor elements. The chapter also discusses architectures
Preface xix
of wireless sensor networks developed on the protocol stack approach and EYES project approach. In the
context of the EYES project approach, which consists of only two key systemabstraction layers, namely the
sensor and networking layer and the distributed services layer, the chapter discusses distributed services
that are required to support applications for wireless sensor networks and approaches adopted by various
projects.
Energy efciency is one of the main issues in developing MAC protocols for wirelesss sensor networks.
This is largely due to unattended operation and battery-based power supply, and a need for collabora-
tion as a result of limited capabilities of individual nodes. Energy-Efcient Medium Access Control offers
a comprehensive overview of the issues involved in the design of MAC protocols. It contains a discus-
sion of MAC requirements for wireless sensor networks such as hardware characteristics of the node,
communication patterns, and others. It surveys 20 medium access protocols specially designed for sensor
networks and optimized for energy efciency. It also discusses qualitative merits of different organizations;
contention-based, slotted, and TDMA-based protocols. In addition, the chapter provides a simulation-
based comparison of the performance and energy efciency of four MAC protocols: Low Power Listening,
S-MAC, T-MAC, and L-MAC.
The knowledge of time at a sensor node may be essential for the correct operation of the system. Time
Division Multiple Access (TDMA) scheme (adopted in TTP/C and FlexRay protocols, for instance see
section on automotive applications) requires the nodes to be synchronized. The time synchronization
issues in sensor networks are discussed in Overview of Time Synchronization Issues in Sensor Networks.
The chapter introduces basics of time synchronization for sensor networks. It also describes design
challenges and requirements in developing time synchronization protocols such as the need to be robust,
energy aware, able to operate correctly in absence of time servers (server-less), be light-weight, and
to offer a tunable service. The chapter also overviews factors inuencing time synchronization such as
temperature, phase noise, frequency noise, asymmetric delays, and clock glitches. Subsequently, different
types of timing techniques are discussed: Network Time Protocol (NTP), Timing-sync Protocol for Sensor
Networks (TPSN), Reference-Broadcast Synchronization (RBS), and Time-Diffusion Synchronization
Protocol (TDP).
The knowledge of the location of nodes is essential for the base station to process information from
sensors, and to arrive at valid and meaningful results. The localization issues in ad hoc wireless sensor
networks are discussed in Distributed Localization Algorithms. The focus of this presentation is on three
distributed localization algorithms for large-scale ad hoc sensor networks which meet the basic require-
ments for self-organization, robustness, and energy efciency: ad hoc positioning by Niculescu and Nath,
N-hop multilateration by Savvides et al., and robust positioning by Savarese et al. The selected algorithms
are evaluated by simulation.
In order to forward information from a sensor node to the base station or another node for processing,
the node requires routing information. The chapter Routing in Sensor Networks provides a comprehensive
survey of routing protocols usedinsensor networks. The presentationis dividedintoat routing protocols:
Sequential Assignment Routing (SAR), direct diffusion, minimum cost forwarding approach, Integer
Linear Program (ILP) based routing approach, Sensor Protocols for Information via Negotiation (SPIN),
geographic routing protocols, parametric probabilistic routing protocol, and Min-MinMax; and cluster-
based routing protocols: LowEnergy Adaptive Clustering Hierarchy (LEACH), Threshold sensitive Energy
Efcient sensor Network protocol (TEEN), and two-level clustering algorithm.
Due totheir limitedresources, sensor nodes frequently provide incomplete informationonthe objects of
their observation. Thus the complete information has to be reconstructed from data obtained from many
nodes, frequently providing redundant data. The distributed data fusion is one of the major challenges
in sensor networks. The chapter Distributed Signal Processing in Sensor Networks introduces a novel
mathematical model for distributed information fusion, which focuses on solving a benchmark signal
processing problem (spectrum estimation) using sensor networks.
With deployment of sensor networks in areas such as battleeld or factory oor, security becomes
of paramount importance, and a challenge. The existing solutions are impractical due to limited cap-
abilities (processing power, available memory, and available energy) of sensor nodes. The chapter
xx Preface
Sensor Network Security gives an introduction to selected specic security challenges in wireless sensor
networks: denial of service and routing security, energy efcient condentiality and integrity, authentic-
ated broadcast, alternative approaches to key management, and secure data aggregation. Subsequently,
it discusses in detail some of the proposed approaches and solutions: SNEP and TESLA protocols
for condentiality and integrity of data, the LEAP protocol, and probabilistic key management for key
management, to mention some.
The chapter Software Development for Large-Scale Wireless Sensor Networks presents basic concepts
related to software development for wireless sensor networks, as well as selected software solutions.
The solutions include: TinyOS, a component-based operating system, and related software packages;
MAT, a byte-code interpreter; and TinyDB, a query processing system for extracting information from
a network of TinyOS sensor nodes. SensorWare, a software framework for wireless sensor networks,
provides querying, dissemination, and fusion of sensor data, as well as coordination of actuators. MiLAN
(Middleware Linking Applications and Networks), a middleware concept, aims to exploit information
redundancy provided by sensor nodes. EnviroTrack, a TinyOS-based application, provides a convenient
way to program sensor network applications that track activities in their physical environment. SeNeTs, a
middleware architecture for wireless sensor networks, is designed to support the pre-deployment phase.
The chapter also discusses software solutions for simulation, emulation, and test of large-scale sensor
networks: TinyOS SIMulator (TOSSIM), a simulator based on the TinyOS framework; EmStar, a software
environment for developing anddeploying applications for sensor networks consisting of 32-bit embedded
Microserver platforms; and SeNeTs, a test and validation environment.
VI Embedded Applications
The last sectioninthe book, Embedded Applications, focuses onselected applications of embedded systems.
It covers automotive eld, industrial automation, and intelligent sensors. The aim of this section is to
introduce examples of the actual embedded applications in fast-evolving areas which, for various reasons,
have not received proper coverage in other publications, particularly in the automotive area.
Automotive Networks
The automotive industry is aggressively adopting mechatronic solutions to replace or duplicate existing
mechanical/hydraulic systems. The embedded electronic systems together with dedicated communication
networks and protocols play pivotal roles in this transition. This subsection contains three chapters that
offer a comprehensive overviewof the area by presenting topics, such as networks and protocols, operating
systems and other middleware, scheduling, safety and fault tolerance, and actual development tools, used
by the automotive industry.
This section begins with a contribution entitled Design and Validation Process of In-Vehicle Embedded
Electronic Systems that provides a comprehensive introduction to the use of embedded systems in auto-
mobiles, their design and validation methods, and tools. The chapter identies and describes a number
of specic application domains for in-vehicle embedded systems, such as power train, chassis, body,
and telematics and HMI. It then outlines some of the main standards used in the automotive industry
to ensure interoperability between components developed by different vendors; this includes networks
and protocols, as well as operating systems. The surveyed networks and protocols include (for details
of networks and protocols see The Industrial Communication Technology Handbook, CRC Press, 2005,
Richard Zurawski, editor) Controller Area Network (CAN), Vehicle Area Network (VAN), J1850, TTP/C
(Time-TriggeredProtocol), FlexRay, Local Interconnect Network (LIN), Media OrientedSystemTransport
(MOST), and IDB-1394. This material is followed by a brief introduction of OSEK/VDX (Offene Systeme
und deren schnittstellen fr die Elektronik im Kraft-fahrzeug), a multitasking operating system that
has become a standard for automotive applications in Europe. The chapter introduces a new language,
EAST-ADL, which offers support for an unambiguous description of in-vehicle embedded electronic
Preface xxi
systems at each level of their development. The discussion of the design and validation process and related
issues is facilitated by a comprehensive case study drawn from actual PSA Peugeot-Citron application.
This case study is essential reading for those interested in the development of this kind of embedded
system.
The planned adoption of X-by-wire technologies in automotive applications pushed the automotive
industry into the realm of safety critical systems. There is a substantial body of literature on safety critical
issues and fault tolerance, particularly when applied to components and systems. Less has been published
on safety-relevant communication services and fault-tolerant communication systems as mandated in
X-by-wire technologies in automotive applications. This is largely due to the novelty of fast-evolving
concepts and solutions, which is pursued mostly by industrial consortia. Those two topics are presented
in detail in Fault-Tolerant Services for Safe In-Car Embedded Systems. The material on safety-relevant
communication services discusses some of the main services and functionalities that the communication
system should provide to facilitate the design of fault-tolerant automotive applications. This includes ser-
vices supporting reliable communication, such as robustness against electromagnetic interference (EMI),
time-triggered transmission, global time, atomic broadcast, and avoidingbabbling-idiots.Also discussed
are higher-level services that provide fault-tolerant mechanisms that belong conceptually to layers above
MAC in the OSI reference model, namely group membership service, management of nodes redundancy,
support for functioning mode, etc. The chapter also discusses fault tolerant communication protocols to
include TTP/C, FlexRay, and variants of CAN (TTCAN, RedCAN, and CANcentrate).
TheVolcanoconcept for designandimplementationof in-vehicle networks using the standardizedCAN
and LIN communication protocols is presented in the chapter Volcano Enabling Correctness by Design.
This chapter provides an in-depth description of the Volcano approach and a suite of software tools,
developed by Volcano Communications Technologies AG, which supports requirements capture, model-
based design, automatic code generation, and system-level validation capabilities. This is an example of
an actual development environment widely used by the automotive industry.
Industrial Automation
The current trend for exible and distributed control and automation has accelerated the migration of
intelligence and control functions to the eld devices; particularly sensors and actuators. The increased
processing capabilities of those devices were instrumental in the emergence of a trend for networking of
eld devices around industrial data networks, thus making access to any device from any place in the
plant, or even globally, technically feasible. The benets are numerous, including increased exibility,
improved system performance, and ease of system installation, upgrade, and maintenance. Embed-
ded web servers are increasingly used in industrial automation to provide HumanMachine Interface
(HMI), which allows for web-based conguration, control and monitoring of devices and industrial
processes.
Anintroductiontothe designof embeddedwebservers is presentedinthe chapter EmbeddedWeb Servers
in Distributed Control Systems. The focus of this chapter is on Field Device Web Servers (FDWS). The
chapter provides a comprehensive overview of the context in which the embedded web servers are usually
implemented, as well as the structure of an FDWS application with the presentation of its component
packages and the mutual relationship between the content of the packages and the architecture of a typical
embedded site. All this is discussed in the context of an actual FDWS implementation and application
deployed at one of the Alstom (France) sites.
Remote access to eld devices may lead to many security challenges. The embedded web servers are
typically run on processors with limited memory and processing power. These restrictions necessitate
a deployment of lightweight security mechanisms. Vendor tailored versions of standard security protocol
suites such as Secure Sockets Layer (SSL) and IP Security Protocol (IPSec) may still not be suitable due
to excessive demand for resources. In applications restricted to the Hypertext Transfer Protocol (HTTP),
Digest Access Authentication (DAA), which is a security extension to HTTP, offers an alternative and
viable solution. Those issues are discussed in the chapter HTTP Digest Authentication for Embedded Web
xxii Preface
Servers. This chapter overviews mechanisms and services, as well as potential applications of HTTP Digest
Authentication. It also surveys selected embedded web server implementations for their support for DAA.
This includes Apache 2.0.42, Allegro RomPager 4.05, and GoAhead 2.1.2.
Intelligent Sensors
The advances in the design of embedded systems, availability of tools, and falling fabrication costs allowed
for cost-effective migration of the intelligence and control functions to the eld devices, particularly
sensors and actuators. Intelligent sensors combine computing, communication, and sensing functions.
The trend for increased functional complexity of those devices necessitates the use of formal descriptive
techniques and supporting tools throughout the design and implementation process. The chapter Intelli-
gent Sensors: Analysis and Design tackles some of those issues. It reviews some of the main characteristics
of the generic intelligent sensor formal model; subsequently, it discusses an implementation of the model
using the CAP language, which was developed specically for the design of intelligent sensors. A brief
introduction to the language is also provided. The whole development process is illustrated by using an
example of a simple distance measuring system comprising an ultrasonic transmitter and two receivers.
Locating Topics
To assist readers with locating material, a complete table of contents is presented at the front of the book.
Each chapter begins with its own table of contents. Two indexes are provided at the end of the book: the
index of authors contributing to the book, together with the titles of their contributions, and a detailed
subject index.
Richard Zurawski
Acknowledgments
My gratitude goes to Luciano Lavagno, Grant Martin, and Alberto Sangiovanni-Vincentelli who have
provided advice and support while preparing this book. This book would never have had a chance to
take off without their assistance. Andreas Willig helped with identifying some authors for the section on
Sensor Networks. Also, I would like to thank the members of the International Advisory Board for their
help with the organization of the book and selection of authors. I have received tremendous cooperation
from all contributing authors. I would like to thank all of them for that. I would like to express gratitude
to my publisher Nora Konopka, and other Taylor and Francis staff involved in the book production,
particularly Jessica Vakili, Elizabeth Spangenberger, and Gail Renard. My love goes to my wife who
tolerated the countless hours I spent on preparing this book.
About the Editor
Dr. Richard Zurawski is president of ISA Group, San Francisco and Santa Clara, CA, involved in providing
solutions to Fortune 1000 companies. Prior to that, he held various executive positions with San Francisco
Bay area based companies. Dr. Zurawski is a cofounder of the Institute for Societal Automation, Santa
Clara, a research and consulting organization.
Dr. Zurawski has close to thirty years of academic and industrial experience, including a regular
professorial appointment at the Institute of Industrial Sciences, University of Tokyo, and a full-time
R&D advisor position with Kawasaki Electric Corp., Tokyo. He provided consulting services to Kawasaki
Electric, Ricoh, and Toshiba Corporations, Japan, and participated in 1990s in a number of Japanese
Intelligent Manufacturing Systems programs.
Dr. Zurawski has served as editor at large for IEEE Transactions on Industrial Informatics, and associate
editor for IEEE Transactions on Industrial Electronics; he also served as associate editor for Real-Time
Systems: The International Journal of Time-Critical Computing Systems, Kluwer Academic Publishers. He
was a guest editor of four special sections in IEEE Transactions on Industrial Electronics and a guest editor
of a special issue of the Proceedings of the IEEE dedicated to industrial communication systems. In 1998,
he was invited by IEEE Spectrum to contribute material on Java technology to Technology 1999: Analysis
and Forecast Issues. Dr. Zurawski is series editor for The Industrial Information Technology Series, Taylor
and Francis Group, Boca Raton, FL.
Dr. Zurawski has served as a vice president of the Institute of Electrical and Electronics Engineers
(IEEE) Industrial Electronics Society (IES), and was on the steering committee of the ASME/IEEE Journal
of Microelectromechanical Systems. In 1996, he received the Anthony J. Hornfeck Service Award from the
IEEE Industrial Electronics Society.
Dr. Zurawski has served as a general, program, and track chair for a number of IEEE conferences and
workshops, and has published extensively on various aspects of formal methods in the design of real-time,
embedded, and industrial systems, MEMS, parallel and distributed programming and systems, as well as
control and robotics. He is the editor of The Industrial Information Technology Handbook (2004), and The
Industrial Communication Technology Handbook (2005), both published by Taylor and Francis Group.
Dr. Richard Zurawski received his M.Sc. in informatics and automation, University of Mining and
Metallurgy, Krakow, Poland, and his Ph.D. incomputer science, La Trobe University, Melbourne, Australia.
Contributors
ParhamAarabi
Department of Electrical and
Computer Engineering
University of Toronto
Ontario, Canada
Jos L. Ayala
Dpto. Ingenieria Electronica
E.T.S.I. Telecomunicacion
Ciudad Universitaria s/n
Madrid, Spain
Joo Paulo Barros
Universidade Nova de Lisboa
Faculdade de Cincias e
Tecnologia
Dep. Eng. Electrotcnica
Caparica, Portugal
Ali Alphan Bayazit
Princeton University
Princeton, New Jersey
Luca Benini
Dipartimento Elettronica
Informatica Sistemistica
University of Bologna
Bologna, Italy
Essaid Bensoudane
Advanced System Technology
STMicroelectronics
Ontario, Canada
Ivan Cibrario Bertolotti
IEIIT National Research
Council
Turin, Italy
Davide Bertozzi
Dipartimento Elettronica
Informatica Sistemistica
Bologna, Italy
Jan Blumenthal
Institute of Applied
Microelectronics and
Computer Science
Dept. of Electrical
Engineering and
Information
Technology
University of Rostock
Rostock, Germany
Gunnar Braun
CoWare Inc.
Aachen, Germany
Giorgio C. Buttazzo
Dip. di Informatica e
Sistemistica
University of Pavia
Pavia, Italy
Luca P. Carloni
EECS Department
University of California at
Berkeley
Berkeley, California
Wander O. Cesrio
SLS Group
TIMA Laboratory
Grenoble, France
Krishnendu Chakrabarty
Duke University
Durham, North Carolina
S. Chatterjea
Faculty of Electrical Engineering,
Mathematics, and Computer
Science
University of Twente
Enschede
The Netherlands
Kwang-Ting (Tim) Cheng
University of California
Santa Barbara, California
Anik Costa
Universidade Nova de Lisboa,
Tecnologia
Caparica, Portugal
Mario Crevatin
Corporate Research
ABB Switzerland Ltd
Baden-Dattwil, Switzerland
Fernando De Bernardinis
EECS Department
Berkeley
xxviii Contributors
Erwin de Kock
Philips Research
Eindhoven, The Netherlands
Giovanni De Micheli
Gates Computer Science
Stanford University
Stanford, California
Robert de Simone
INRIA
Sophia-Antipolis, France
Eric Dekneuvel
University of Nice Sophia
Antipolis
Biot, France
S. Dulman
Science
Enschede
The Netherlands
Stephen A. Edwards
Department of Computer Science
Columbia University
New York, New York
Gerben Essink
Philips Research
A. G. Fragopoulos
University of Patras
Patras, Greece
Shashidhar Gandham
The Department of Computer
Science
The University of Texas at Dallas
Richardson, Texas
Christopher Gill
and Engineering
Washington University
St. Louis, Missouri
Frank Golatowski
Computer Science
Dept. of Electrical Engineering
and Information Technology
Rostock, Germany
Lus Gomes
Universidade Nova de Lisboa
Tecnologia
Caparica, Portugal
Aarti Gupta
NEC Laboratories America
Rajesh Gupta
and Engineering
San Diego
San Diego, California
Sumit Gupta
Tallwood Venture Capital
Palo Alto, California
Marc Haase
Computer Science
Rostock, Germany
Gertjan Halkes
Science
Delft University of Technology
Delft, The Netherlands
Matthias Handy
Computer Science
and Information
Technology
Rostock, Germany
Hans Hansson
and Engineering
Mlardalen University
Vsters, Sweden
P. Havinga
Science
Enschede
The Netherlands
ystein Haugen
Department of Informatics
University of Oslo
Oslo, Norway
Tomas Henriksson
Philips Research
Andreas Hoffmann
CoWare Inc.
Aachen, Germany
T. Hoffmeijer
Science
Enschede
The Netherlands
J. Hurink
Science
Enschede
The Netherlands
Margarida F. Jacome
University of Texas at Austin
Austin, Texas
Omid S. Jahromi
Bioscrypt Inc.
Markham, Ontario, Canada
Contributors xxix
Axel Jantsch
Department for Microelectronics
Royal Institute of Technology
Kista, Sweden
A. A. Jerraya
SLS Group
TIMA Laboratory
Grenoble, France
J. V. Kapitonova
Glushkov Institute of Cybernetics
National Academy of Science of
Ukraine
Kiev, Ukraine
Alex Kondratyev
Cadence Berkeley Labs
Wido Kruijtzer
Philips Research
Koen Langendoen
Science
Michel Langevin
STMicroelectronics
Ontario, Canada
Luciano Lavagno
Cadence Berkeley Laboratories
Berkeley, California; and
Dipartimento di Elettronica
Politecnico di Torino, Italy
A. A. Letichevsky
Glushkov Institute of Cybernetics
National Academy of Science
of Ukraine
Kiev, Ukraine
Marisa Lpez-Vallejo
Dpto. Ingenieria Electronica
E.T.S.I. Telecomunicacion
Ciudad Universitaria s/n
Madrid, Spain
Damien Lyonnard
STMicroelectronics
Ontario, Canada
Yogesh Mahajan
Grant Martin
Tensilica Inc.
Santa Clara, California
Birger Mller-Pedersen
Department of Informatics
University of Oslo
Oslo, Norway
Ravi Musunuri
Science
Richardson, Texas
Nicolas Navet
Institut National Polytechnique
de Lorraine
Nancy, France
Gabriela Nicolescu
Ecole Polytechnique
de Montreal
Montreal, Quebec
Canada
AchimNohl
CoWare Inc.
Aachen, Germany
Mikael Nolin
and Engineering
Vsters, Sweden
Thomas Nolte
and Engineering
Vsters, Sweden
Claudio Passerone
Dipartimento di Elettronica
Politecnico di Torino
Turin, Italy
Roberto Passerone
Cadence Design Systems, Inc.
Berkeley Cadence Labs
Hiren D. Patel
Electrical and Computer
Engineering
Virginia Tech
Blacksburg, Virginia
Maulin D. Patel
Science
Richardson, Texas
Pierre G. Paulin
STMicroelectronics
Ontario, Canada
Chuck Pilkington
STMicroelectronics
Ontario, Canada
Claudio Pinello
EECS Department
Berkeley
Dumitru Potop-Butucaru
IRISA
Rennes, France
Antal Rajnk
Advanced Engineering Labs
Volcano Communications
Technologies AG
Tagerwilen, Switzerland
Anand Ramachandran
University of Texas at Austin
Austin, Texas
Niels Reijers
Science
xxx Contributors
Alberto L.
Sangiovanni-Vincentelli
EECS Department
Berkeley
Udit Saxena
Microsoft Corporation
Seattle, Washington
Guenter Schaefer
Institute of Telecommunication
Systems
Technische Universitt Berlin
Berlin, Germany
D. N. Serpanos
Patras, Greece
Marco Sgroi
EECS Department
Berkeley
Sandeep K. Shukla
Electrical and Computer
Engineering
Virginia Tech
Blacksburg, Virginia
Franoise Simonot-Lion
Institut National Polytechnique
de Lorraine
Nancy, France
YeQiong Song
Universit Henri Poincar
Nancy, France
Weilian Su
Broadband and Wireless
Networking Laboratory
School of Electrical and Computer
Engineering
Georgia Institute of Technology
Atlanta, Georgia
Venkita Subramonian
and Engineering
St. Louis, Missouri
Jacek Szymanski
ALSTOM Transport
Centre Meudon La Fort
Meudon La Fort, France
Jean-Pierre Talpin
IRISA
Rennes, France
Lothar Thiele
Department Information
Technology and Electrical
Engineering
Computer Engineering and
Networks Laboratory
Swiss Federal Institute of
Technology
Zurich, Switzerland
Pieter van der Wolf
Philips Research
V. A. Volkov
Glushkov Institute of
Cybernetics
National Academy of Science
of Ukraine
Kiev, Ukraine
Thomas P. von Hoff
ABB Switzerland Ltd
Corporate Research
Baden-Dattwil, Switzerland
A. G. Voyiatzis
Patras, Greece
Flvio R. Wagner
UFRGS Instituto de
Informtica
Porto Alegre, Brazil
Ernesto Wandeler
Department Information
Technology and Electrical
Engineering
Computer Engineering and
Networks Laboratory
Technology
Zurich, Switzerland
Yosinori Watanabe
Cadence Berkeley Labs
Thomas Weigert
Global Software Group
Motorola
Schaumburg, Illinois
Reinhard Wilhelm
University of Saarland
Saarbruecken, Germany
Richard Zurawski
ISA Group
San Francisco, California
Contents
SECTION I Embedded Systems
Real-Time and Embedded Systems
1 Embedded Systems: Toward Networking of Embedded Systems
Luciano Lavagno and Richard Zurawski . . . . . . . . . . . . . 1-1
2 Real-Time in Embedded Systems Hans Hansson, Mikael Nolin, and
Thomas Nolte . . . . . . . . . . . . . . . . . . . . . . . . 2-1
Design and Validation of Embedded Systems
3 Design of Embedded Systems Luciano Lavagno and
Claudio Passerone . . . . . . . . . . . . . . . . . . . . . . 3-1
4 Models of Embedded Computation Axel Jantsch . . . . . . . . . 4-1
5 Modeling Formalisms for Embedded System Design Lus Gomes, Joo
Paulo Barros, and Anik Costa . . . . . . . . . . . . . . . . . 5-1
6 System Validation J.V. Kapitonova, A.A. Letichevsky, V.A. Volkov,
and Thomas Weigert . . . . . . . . . . . . . . . . . . . . . 6-1
Design and Verication Languages
7 Languages for Embedded Systems Stephen A. Edwards . . . . . . 7-1
8 The Synchronous Hypothesis and Synchronous Languages
Dumitru Potop-Butucaru, Robert de Simone, and Jean-Pierre Talpin . 8-1
9 Introduction to UML and the Modeling of Embedded Systems
ystein Haugen, Birger Mller-Pedersen, and Thomas Weigert . . . 9-1
10 Verication Languages Aarti Gupta, Ali Alphan Bayazit, and
Yogesh Mahajan . . . . . . . . . . . . . . . . . . . . . . . 10-1
Operating Systems and Quasi-Static Scheduling
11 Real-Time Embedded Operating Systems: Standards and Perspectives
Ivan Cibrario Bertolotti . . . . . . . . . . . . . . . . . . . . 11-1
xxxii Contents
12 Real-Time Operating Systems: The Scheduling and Resource
Management Aspects Giorgio C. Buttazzo . . . . . . . . . . . 12-1
13 Quasi-Static Scheduling of Concurrent Specications
Alex Kondratyev, Luciano Lavagno, Claudio Passerone, and
Yosinori Watanabe . . . . . . . . . . . . . . . . . . . . . . 13-1
Timing and Performance Analysis
14 Determining Bounds on Execution Times Reinhard Wilhelm . . . 14-1
15 Performance Analysis of Distributed Embedded Systems
Lothar Thiele and Ernesto Wandeler . . . . . . . . . . . . . . . 15-1
Power Aware Computing
16 Power Aware Embedded Computing Margarida F. Jacome and
Anand Ramachandran . . . . . . . . . . . . . . . . . . . . . 16-1
Security in Embedded Systems
17 Design Issues in Secure Embedded Systems A.G. Voyiatzis,
A.G. Fragopoulos, and D.N. Serpanos . . . . . . . . . . . . . . 17-1
SECTION II System-on-Chip Design
18 System-on-Chip and Network-on-Chip Design Grant Martin . . . 18-1
19 A Novel Methodology for the Design of Application-Specic
Instruction-Set Processors Andreas Hoffmann, Achim Nohl, and
Gunnar Braun . . . . . . . . . . . . . . . . . . . . . . . . 19-1
20 State-of-the-Art SoC Communication Architectures Jos L. Ayala,
Marisa Lpez-Vallejo, Davide Bertozzi, and Luca Benini . . . . . . 20-1
21 Network-on-Chip Design for Gigascale Systems-on-Chip
Davide Bertozzi, Luca Benini, and Giovanni De Micheli . . . . . . 21-1
22 Platform-Based Design for Embedded Systems Luca P. Carloni,
Fernando De Bernardinis, Claudio Pinello,
Alberto L. Sangiovanni-Vincentelli, and Marco Sgroi . . . . . . . 22-1
23 Interface Specication and Converter Synthesis Roberto Passerone . 23-1
24 Hardware/Software Interface Design for SoC Wander O. Cesrio,
Flvio R. Wagner, and A.A. Jerraya . . . . . . . . . . . . . . . 24-1
25 Design and Programming of Embedded Multiprocessors: An
Interface-Centric Approach Pieter van der Wolf, Erwin de Kock,
Tomas Henriksson, Wido Kruijtzer, and Gerben Essink . . . . . . 25-1
Contents xxxiii
26 A Multiprocessor SoC Platform and Tools for Communications
Applications Pierre G. Paulin, Chuck Pilkington, Michel Langevin,
Essaid Bensoudane, Damien Lyonnard, and Gabriela Nicolescu . . . 26-1
SECTION III Testing of Embedded Core-Based Integrated
Circuits
27 Modular Testing and Built-In Self-Test of Embedded Cores in
System-on-Chip Integrated Circuits Krishnendu Chakrabarty . . . 27-1
28 Embedded Software-Based Self-Testing for SoC Design
Kwang-Ting (Tim) Cheng . . . . . . . . . . . . . . . . . . . 28-1
SECTION IV Networked Embedded Systems
29 Design Issues for Networked Embedded Systems Sumit Gupta,
Hiren D. Patel, Sandeep K. Shukla, and Rajesh Gupta . . . . . . . 29-1
30 Middleware Design and Implementation for Networked Embedded
Systems Venkita Subramonian and Christopher Gill . . . . . . . 30-1
SECTION V Sensor Networks
31 Introduction to Wireless Sensor Networks S. Dulman, S. Chatterjea,
and P. Havinga . . . . . . . . . . . . . . . . . . . . . . . . 31-1
32 Issues and Solutions in Wireless Sensor Networks Ravi Musunuri,
Shashidhar Gandham, and Maulin D. Patel . . . . . . . . . . . 32-1
33 Architectures for Wireless Sensor Networks S. Dulman,
S. Chatterjea, T. Hoffmeijer, P. Havinga, and J. Hurink . . . . . . 33-1
34 Energy-Efcient Medium Access Control Koen Langendoen and
Gertjan Halkes . . . . . . . . . . . . . . . . . . . . . . . . 34-1
35 Overview of Time Synchronization Issues in Sensor Networks
Weilian Su . . . . . . . . . . . . . . . . . . . . . . . . . . 35-1
36 Distributed Localization Algorithms Koen Langendoen and
Niels Reijers . . . . . . . . . . . . . . . . . . . . . . . . . 36-1
37 Routing in Sensor Networks Shashidhar Gandham, Ravi Musunuri,
and Udit Saxena . . . . . . . . . . . . . . . . . . . . . . . 37-1
38 Distributed Signal Processing in Sensor Networks Omid S. Jahromi
and Parham Aarabi . . . . . . . . . . . . . . . . . . . . . . 38-1
xxxiv Contents
39 Sensor Network Security Guenter Schaefer . . . . . . . . . . . 39-1
40 Software Development for Large-Scale Wireless Sensor Networks
Jan Blumenthal, Frank Golatowski, Marc Haase, and
Matthias Handy . . . . . . . . . . . . . . . . . . . . . . . 40-1
SECTION VI Embedded Applications
Automotive Networks
41 Design and Validation Process of In-Vehicle Embedded Electronic
Systems Franoise Simonot-Lion and YeQiong Song . . . . . . . 41-1
42 Fault-Tolerant Services for Safe In-Car Embedded Systems
Nicolas Navet and Franoise Simonot-Lion . . . . . . . . . . . . 42-1
43 Volcano Enabling Correctness by Design Antal Rajnk . . . . . 43-1
44 Embedded Web Servers in Distributed Control Systems
Jacek Szymanski . . . . . . . . . . . . . . . . . . . . . . . 44-1
45 HTTP Digest Authentication for Embedded Web Servers
Mario Crevatin and Thomas P. von Hoff . . . . . . . . . . . . . 45-1
Intelligent Sensors
46 Intelligent Sensors: Analysis and Design Eric Dekneuvel . . . . . 46-1
I
Embedded Systems
Real-Time and
Embedded Systems
1 Embedded Systems: Toward Networking of Embedded Systems
Luciano Lavagno and Richard Zurawski
2 Real-Time in Embedded Systems
Hans Hansson, Mikael Nolin, and Thomas Nolte
1
Embedded Systems:
Toward Networking
of Embedded Systems
Luciano Lavagno
Cadence Berkeley Laboratories and
Richard Zurawski
ISA Group
1.1 Networking of Embedded Systems . . . . . . . . . . . . . . . . . . . . . 1-1
1.2 Design Methods for Networked Embedded
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-3
1.3 Networks Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-5
Networked Embedded Systems in Industrial Automation
Networked Embedded Systems in Building Automation
Automotive Networked Embedded Systems Sensor Networks
1.4 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-14
1.1 Networking of Embedded Systems
The last two decades have witnessed a remarkable evolution of embedded systems from being assembled
from discrete components on printed circuit boards, although, they still are, to systems being assembled
from Intellectual Property (IP) components dropped onto silicon of the system on a chip. Systems on
a chip offer a potential for embedding complex functionalities, and to meet demanding performance
requirements of applications such as DSPs, network, and multimedia processors. Another phase in this
evolution, already in progress, is the emergence of distributed embedded systems; frequently termed as
networked embedded systems, where the word networked signies the importance of the networking
infrastructure and communication protocol. Anetworked embedded systemis a collection of spatially and
functionally distributed embedded nodes interconnected by means of wireline or wireless communication
infrastructure and protocols, interacting with the environment (via a sensor/actuator elements) and each
other, and, possibly, a master node performing some control and coordination functions, to coordinate
computing and communication in order to achieve certain goal(s). The networked embedded systems
appear in a variety of application domains such as, automotive, train, aircraft, ofce building, and
industrial primarily for monitoring and control, environment monitoring, and, in future, control,
as well.
There have been various reasons for the emergence of networked embedded systems, inuenced largely
by their application domains. The benet of using distributed systems and an evolutionary need to replace
point-to-point wiring connections in these systems by a single bus are some of the most important ones.
1-1
1-2 Embedded Systems Handbook
The advances in design of embedded systems, tools availability, and falling fabrication costs of
semiconductor devices and systems, have allowed for infusion of intelligence into eld devices such as
sensors and actuators. The controllers used with these devices provide typically on-chip signal conversion,
data processing, and communication functions. The increased functionality, processing, and communic-
ation capabilities of controllers have been largely instrumental in the emergence of a widespread trend for
networking of eld devices around specialized networks, frequently referred to as eld area networks.
The eld area networks, or eldbuses [1] (eldbus is, in general, a digital, two-way, multi-drop commu-
nication link) as commonly referred to, are, in general, networks connecting eld devices such as sensors
and actuators with eld controllers (for instance, Programmable Logic Controllers [PLCs] in industrial
automation, or Electronic Control Units [ECUs] in automotive applications), as well as manmachine
interfaces, for instance, dashboard displays in cars.
In general, the benets of using those specialized networks are numerous, including increased exibility
attained through combination of embedded hardware and software, improved system performance, and
ease of systeminstallation, upgrade, andmaintenance. Specically, inautomotive andaircraft applications,
for instance, they allowfor a replacement of mechanical, hydraulic, andpneumatic systems by mechatronic
systems, where mechanical or hydraulic components are typically conned to the end-effectors; just to
mention their two different application areas.
Unlike Local Area Networks (LANs), due to the nature of communication requirements imposed by
applications, eld area networks, by contrast, tend to have low data rates, small size of data packets, and
typically require real-time capabilities which mandate determinism of data transfer. However, data rates
above 10 Mbit/sec, typical of LANs, have already become a commonplace in eld area networks.
The specialized networks tend to support various communication media such as twisted pair cables,
ber optic channels, power line communication, radio frequency channels, infrared connections, etc.
Based on the physical media employed by the networks, they can be, in general, divided into three main
groups, namely: wireline-based networks using media such as twisted pair cables, ber optic channels
(in hazardous environments like chemical and petrochemical plants), and power lines (in building
automation); wireless networks supporting radio frequency channels, and infrared connections; and
hybrid networks composed of wireline and wireless networks.
Although the use of wireline-based eld area networks is dominant, the wireless technology offers a
range of incentives in a number of application areas. In industrial automation, for instance, wireless device
(sensor/actuator) networks can provide a support for mobile operation required in case of mobile robots,
monitoring, and control of equipment in hazardous and difcult to access environments, etc. In a wireless
sensor/actuator network, stations may interact with each other on a peer-to-peer basis, and with a base
station. The base station may have its transceiver attached to a cable of a (wireline) eld area network,
giving rise to a hybrid wirelesswireline system [2]. A separate category is the wireless sensor networks,
mainly envisaged to be used for monitoring purposes, which is discussed in detail in the book.
The variety of application domains impose different functional and nonfunctional requirements onto
the operation of networked embedded systems. Most of themare required to operate in a reactive way; for
instance, systems used for control purposes. With that comes the requirement for real-time operation, in
which systems are required to respond within a predened period of time, mandated by the dynamics of
the process under control. Aresponse, in general, may be periodic to control a specic physical quantity by
regulating dedicated end-effector(s), or aperiodic arising from unscheduled events such as out-of-bounds
state of a physical parameter or any other kind of abnormal conditions, or sporadic with no period
but with known minimum time between consecutive occurrences. Broadly speaking, systems which can
tolerate a delay in response are called soft real-time systems; in contrast, hard real-time systems require
deterministic responses to avoid changes in the system dynamics which potentially may have negative
impact on the process under control, and as a result may lead to economic losses or cause injury to human
operators. Representative examples of systems imposing hard real-time requirement on their operation
are y-by-wire in aircraft control, and steer-by-wire in automotive applications, to mention a few.
The need to guarantee a deterministic response mandates using appropriate scheduling schemes, which
are frequently implementedinapplicationdomainspecic real-time operating systems or customdesigned
Toward Networking of Embedded Systems 1-3
bare-bone real-time executives. Most of those issues (real-time scheduling and real-time operating
systems) are discussed in this book in a number of chapters.
The networked embedded systems used in safety-critical applications such as y-by-wire and steer-by-
wire require a high level of dependability to ensured that a system failure does not lead to a state in which
human life, property, or environment are endangered. The dependability issue is critical for technology
deployment; various solutions are discussed in this chapter in the context of automotive applications. One
of the main bottlenecks in the development of safety-critical systems is the software development process.
This issue is briey discussed in this chapter in the context of the automotive application domain.
As opposed to applications mandating hard real-time operation, such as the majority of industrial
automation controls or safety-critical automotive control applications, building automation control sys-
tems, for instance, seldom have a need for hard real-time communication; the timing requirements are
much more relaxed. The building automation systems tend to have a hierarchical network structure and
typically implement all seven layers of the ISO/OSI reference model [3]. In the case of eld area networks
employed in industrial automation, for instance, there is little need for the routing functionality and
end-to-end control. Therefore, typically, only the layers 1 (physical layer), 2 (data link layer, including
implicitly the mediumaccess control layer), and 7 (application layer, which also covers user layer) are used
in those networks.
This diversity of requirements imposed by different application domains (soft/hard real-time, safety
critical, network topology, etc.) necessitated different solutions, and using different protocols based on
different operationprinciples. This has resultedinplethora of networks developedfor different application
domains. Some of those networks will be overviewed in one of the subsequent sections.
With the growing trend for networking of embedded system and their internetworking with LAN,
Wide Area Network (WAN), and the Internet (for instance, there is a growing demand for remote access to
process data at the factory oor), many of those systems may become exposed to potential security attacks,
which may compromise their integrity and cause damage as a result. The limited resources of embedded
nodes pose considerable challenge for the implementation of effective security policies which, in general,
are resource demanding. These restrictions necessitate a deployment of lightweight security mechanisms.
Vendor tailored versions of standard security protocol suites, such as Secure Sockets Layer (SSL) and
IP Security Protocol (IPSec), may still not be suitable due to excessive demand for resources. Potential
security solutions for this kind of systems depend heavily on the specic device or system protected,
application domain, and extent of internetworking and its architecture. (The details of potential security
measures are presented in this book in two separate chapters.)
1.2 Design Methods for Networked Embedded Systems
Design methods for networked embedded systems fall into the general category of system-level design.
They include two separate aspects, which will be discussed briey. Arst aspect is the network architecture
design, in which communication protocols, interfaces, drivers, and computation nodes are selected and
assembled. A second aspect is the system-on-chip design, in which the best hardware/software partition
is selected, and an existing platform is customized, or a new chip is created for the implementation
of a computation or a communication node. Both aspects share several similarities, but so far have
generally been solved using ad hoc methodologies and tools, since the attempt to create a unied electronic
system-level design methodology have so far failed.
When one considers the complete networked system, including several digital and analog parts, many
more trade-offs can be played at the global level. However, it also means that the interaction between the
digital portion of the design activity and the rest is much more complicated, especially in terms of tools,
formats, and standards with which one must interoperate and interface.
In the case of network architecture design, tools such as OpNet and NS are used to identify commu-
nication bottlenecks, investigate the effect of parameters such as channel bit error rate, and analyze the
choice of coding, medium access, and error correction mechanisms on the overall system performance.
For wireless networks, tools such as Matlab and Simulink are also used, in order to analyze the impact of
detailed channel models, thanks to their ability to model both digital and analog components, as well as
physical elements, at a high level of abstraction. In all cases, the analysis is essentially functional, that is, it
takes into account only in a very limited manner effects such as power consumption, computation time,
and cost. This is the main limitation that will need to be addressed in the future, if one wants to model and
design in an optimal manner low power networked embedded systems, such as those that are envisioned
for wireless sensor network applications.
At the system-on-chip architecture level, the rst decision to be made is whether to use a platform
instance or design an Application-Specic Integrated Circuit (ASIC) from scratch. The rst option builds
on the availability of large libraries of IP, both in the form of processors, memories, and peripherals, from
major silicon vendors. These IP libraries are guaranteed to work together, and hence constitute what is
termed as a platform. Aplatformis a set of components, together with usage rules that ensure their correct
andseamless interoperation. They are usedtospeeduptime-to-market, by ensuring rapidimplementation
of complex architectures. Processors (and the software executing on them) provide exibility to adapt to
different applications and customizations (e.g., localization and adherence to regional standards), while
hardware IPs provide efcient implementation of commonly used functions. Congurable processors can
be adapted to the requirements of specic applications and via instruction extensions, offer considerable
performance and power advantages over xed instruction set architectures.
Thus, a platform is a single abstract model that hides the details of a set of different possible imple-
mentations as clusters of lower level components. The platform, for example, a family of microprocessors,
peripherals, and bus protocols, allows developers of application designs to operate without detailed know-
ledge of the implementation (e.g., the pipelining of the processor or the internal implementation of the
UART). At the same time, it allows platform implementors to share design and fabrication costs among a
broad range of potential users, broader than if each design was a one-of-a-kind type.
Designmethods that exploit the notionof platformgenerally start froma functional specication, which
is then mapped onto an architecture (a platforminstance) in order to derive performance information and
explore the design space. Full exploitation of the notion of platform results in better reuse, by decoupling
independent aspects that would otherwise tie, for example, a given functional specication to low level
implementation details. The guiding principle of separation of concerns distinguishes between:
1. Computation and communication. This separation is important because renement of computa-
tion is generally done by hand, or by compilation and scheduling, while communication makes use
of patterns.
2. Application and platform implementation, because they are often dened and designed indepen-
dently by different groups or companies.
3. Behavior and performance, which should be kept separate because performance information
can represent either nonfunctional requirements (e.g., maximum response time of an embed-
ded controller), or the result of an implementation choice (e.g., the worst-case execution time of
a task). Nonfunctional constraint verication can be performed traditionally, by simulation and
prototyping, or with static formal checks, such as schedulability analysis.
Tool support for system-on-chip architectural design is, so far, mostly limited to simulation and interface
generation. The rst category includes tools such as NC-SystemC from Cadence, ConvergenSC from
CoWare, and SystemStudio from Synopsys. Simulators at the system-on-chip level provide abstractions
for the main architectural components (processors, memories, busses, and hardware blocks) and permit
quick instantiation of complete platform instances from template skeletons. Interface synthesis can take
various forms, from the automated instantiation of templates offered by N2C from CoWare, to the
automated consistent le generation for software and hardware offered by Beach Solutions.
A key aspect of design problems in this space is compatibility with respect to specications, at the inter-
face level (bus and networking standards), instruction-set architecture level, and Application Procedural
Interface (API) level. Assertion-based verication techniques can be used to ease the problem of verifying
compliance with a digital protocol standard (e.g., for a bus).
Let us consider an example of a design ow in the automotive domain, which can be considered as a
paradigm of any networked embedded system. Automotive electronic design starts, usually 5 to 10 years
before the actual introduction of a product, when a car manufacturer denes the specications for its
future line of vehicles.
It is now an accepted practice to use the notion of platform also in this domain, so that the electronic
portion (as well as the mechanical one, which is outside the scope of this discussion) is modularized and
componentized, enabling sharing across different models. An ECU generally includes a microcontroller
(8, 16, and 32 bits), memory (SRAM, DRAM, and Flash), some ASIC or FPGA for interfacing, one or
more in-vehicle network interfaces (e.g., CAN [Controller Area Network] or FlexRay), and several sensor
and actuator interfaces (analog/digital and digital/analog converters, pulse-width modulators, power
transistors, display drivers, and so on).
The system-level design activity is performed by a relatively small team of architects, who know the
domain well (mechanics, electronics, and business), dene the specications for the electronic component
suppliers, and interface with the teams that specify the mechanical portions (body and engine). These
teams essentially use past experience to performtheir job, and currently have serious problems forecasting
the state of electronics ten years in advance.
Control algorithms are dened in the next design phase, when the rst engine models (generally
described using Simulink, Matlab, and StateFlow) become available, as a specication for both the elec-
tronic design and the engine design. An important aspect of the overall ow is that these models are not
frozen until much later, and hence both algorithmdesign and (often) ECUsoftware design must cope with
their changes. Another characteristic is that they are parametric models, sometimes reused across multiple
engine generations and classes, whose exact parameter values will be determined only when prototypes
or actual products will be available. Thus, control algorithms must consider both allowable ranges and
combinations of values for these parameters, and the capability to measure directly or indirectly their
values from the behavior of engine and vehicle. Finally, algorithms are often distributed over a network
of cooperating ECUs, thus deadlines and constraints generally span a number of electronic modules.
While control design progresses, ECU hardware design can start, because rough computational and
memory requirement, as well as interfacing standards, sensors, and actuators, are already known. At the
end of both control design and hardware design, software implementation can start. As mentioned earlier,
most of the software running on modern ECUs is automatically generated (model-based design).
The electronic subsystem supplier in the hardware implementation phase can use both off-the-shelf
components (such as memories), Application Specic Standard Products (ASSPs) (such as microcontrol-
lers and standard bus interfaces), and even ASICs and FPGAs (typically for sensor and actuator signal
conditioning and conversion).
The nal phase, called system integration, is generally performed by the car manufacturer again.
It can be an extremely lengthy and an expensive phase, because it requires the use of expensive detailed
models of the controlled system (e.g., the engine, modeled with DSP-based multiprocessors) or even
of actual car prototypes. The goal of integration is to ensure smooth subsystem communication (e.g.,
checking that there are no duplicate module identiers and that there is enough bandwidth in every
in-vehicle bus). Simulation support in this domain is provided by companies such as Vast and Axys (now
part of ARM), who sell both fast instruction-set simulators for the most commonly used processors in
the networked embedded system domain, and network simulation models exploiting either proprietary
simulationengines, for example, inthe case of Virtio, or standard simulators (HDL [Hardware Description
Language] or SystemC).
1.3 Networks Embedded Systems
1.3.1 Networked Embedded Systems in Industrial Automation
Although for the origins of eld area networks, one can look back as far as the end of 1960s in the
nuclear instrumentation domain, CAMACnetwork [4], and beginning of 1970s in avionics and aerospace
applications, MIL-STD-1553 bus [5], it was the industrial automation area which brought the main thrust
of developments. The need for integrationof heterogeneous systems, difcult at that time due to the lack of
standards, resultedintwomajor initiatives whichhave hada lasting impact onthe integrationconcepts, and
architecture of the protocol stack of eld area networks. These initiatives were TOP (Technical and Ofce
Protocol) [6] and MAP (Manufacturing Automation Protocol) [7] projects. The two projects exposed
some pitfalls of the full seven-layer stack implementations (ISO/OSI model [3]). As a result, typically, only
the layers 1 (physical layer), 2 (data link layer, including implicitly the medium access control layer), and
7 (application layer, which also covers user layer) are used in the eld area networks [8]; also prescribed
by the international eldbus standard, IEC 61158 [9]. In IEC 61158, functions of layers 3 and 4 are
recommended to be placed in either layers 2 or 7 network and transport layers are not required in
a single segment network typical of process and industrial automation (situation is different though in
building automation, for instance, where the routing functionality and end-to-end control may be needed
arising from a hierarchical network structure); functions of layers 5 and 6 are always covered in layer 7.
The evolution of eldbus technology which begun well over two decades ago has resulted in a multitude
of solutions reecting the competing commercial interests of their developers and standardization bodies,
both national and international: IEC [10], ISO [11], ISA[12], CENELEC [13], and CEN[14]. This is
also reected in IEC 61158 (adopted in 2000), which accommodates all national standards and user
organization championed eldbus systems. Subsequently, implementation guidelines were compiled into
communication proles, IEC 61784-1 [15]. Those communication proles identify seven main systems
(or communication prole families) known by brand names as Foundation Fieldbus (H1, HSE, H2)
used in process and factory automation; ControlNet and EtherNet/IP both used in factory automation;
PROFIBUS (DP, PA) used in factory and process automation respectively; PROFInet used in factory
automation; P-Net (RS 485, RS 232) used in factory automation and shipbuilding; WorldFIP used in
factory automation; INTERBUS, INTERBUS TCP/IP, and INTERBUS Subset used in factory automation;
Swiftnet transport, Swiftnet full stack used by aircraft manufacturers. The listed application areas are the
dominant ones.
Ethernet, the backbone technology for ofce networks, is increasingly being adoptedfor communication
in factories and plants at the eldbus level. The random and native CSMA/CD arbitration mechanism is
being replaced by other solutions allowing for deterministic behavior required in real-time communica-
tion to support soft and hard real-time deadlines, time synchronization of activities required to control
drives, for instance, and for exchange of small data records characteristic of monitoring and control
actions. The emerging Real-Time Ethernet (RTE), Ethernet augmented with real-time extensions, under
standardization by IEC/SC65C committee, is a eldbus technology which incorporates Ethernet for the
lower two layers in the OSI model. There are already a number of implementations, which use one of
the three different approaches to meet real-time requirements. First approach is based on retaining the
TCP/UDP/IP protocols suite unchanged (subject to nondeterministic delays); all real-time modications
are enforced in the top layer. Implementations in this category include Modbus/TPC [16] (dened by
Schneider Electric and supported by Modbus-IDA [17]), EtherNet/IP [18] (dened by Rockwell and
supported by the Open DeviceNet Vendor Association (ODVA) [19] and ControlNet International [20]),
P-Net (on IP) [21] (proposed by the Danish P-Net national committee), and Vnet/IP [22] (developed
by Yokogawa, Japan). In the second approach, the TCP/UDP/IP protocols suite is bypassed, the Ethernet
functionality is accessed directly in this case, RTE protocols use their own protocol stack in addition to
the standard IP protocol stack. The implementations in this category include Ethernet Powerlink (EPL)
[23] (denedby Bernecker +Rainer [B&R], andnowsupportedby the Ethernet PowerlinkStandardisation
Group [24]), TCnet (a Time-critical Control Network) [25] (a proposal from Toshiba), EPA (Ethernet
for Plant Automation) [26] (a Chinese proposal), and PROFIBUS CBA (Component-Based Automation)
[27] (dened by several manufacturers including Siemens, and supported by PROFIBUS International
[28]). Finally, in the third approach, the Ethernet mechanism and infrastructure are modied. The
implementations include SERCOS III [29] (under development by SERCOS), EtherCAT [30] (dened by
Beckhoff and supported by the EtherCat Technology Group [31]), and PROFINET IO [32] (dened by
several manufacturers including Siemens, and supported by PROFIBUS International).
The use of standard components such as protocol stacks, Ethernet controllers, bridges, etc., allows to
mitigate the ownership and maintenance cost. The direct support for the Internet technologies allows
for vertical integration of various levels of industrial enterprise hierarchy to include seamless integration
between automation and business logistic levels to exchange jobs and production (process) data; transpar-
ent data interfaces for all stages of the plant life cycle; the Internet- and web-enabled remote diagnostics
and maintenance, as well as electronic orders and transactions. In the case of industrial automation, the
advent and use of networking has allowed for horizontal and vertical integration of industrial enterprises.
1.3.2 Networked Embedded Systems in Building Automation
Another fast growing application area for networked embedded systems is building automation [33].
Building automation systems aim at the control of the internal environment, as well as the immediate
external environment of a building, or a building complex. At present, the focus of research and technology
development is on commercial type of buildings (ofce building, exhibition center, shopping complex,
etc.). In future, this will also include industrial type of buildings, which pose substantial challenges to
the development of effective monitoring and control solutions. Some of the main services to be offered
by the building automation systems typically include: climate control to include heating, ventilation, air
conditioning; visual comfort to cover articial lighting, control of day light; safety services such as re
alarm, and emergency sound system; security protection; control of utilities such as power, gas, water
supply, etc.; internal transportation systems to mention lifts, escalators, etc.
In terms of the quality of the service requirements imposed on the eld area networks, building
automation systems differ considerably from their counterparts in industrial automation, for instance.
There is seldom a need for hard real-time communication; the timing requirements are much more
relaxed. Trafc volume in normal operation is low. Typical trafc is event driven, and mostly uses peer-to-
peer communication paradigm. Fault tolerance and network management are important aspects. As with
industrial eldbus systems, there are a number of bodies involved in the standardization of technologies
for building automation, including the eld area networks.
The communication architecture supporting automation systems embedded in the buildings has typ-
ically three levels: eld, control, and management levels. The eld level, involves operation of elements
such as switches, motors, lighting cells, dry cells, etc. The peer-to-peer communication is perhaps most
evident at that level; toggling a switch should activate a lighting cell(s), for instance. The automation
level is typically used to evaluate new control strategies for the lower level in response to the changes in
the environment; reduction in the day light intensity, external temperature change, etc. LonWorks [34],
BACnet [35], and EIB/KNX [3639] are open system networks, which can be used at more than one
level of the communication architecture. A round up of LonWorks will be provided in the following, as a
representative example of specialized eld area networks used in building automation.
LonWorks (EIA-709), a trademark of Echelon Corp. [40], employs LonTalk protocol which implements
all seven layers of the ISO/OSI reference model. The LonTalk protocol was published as a formal standard
[41], and revised in 2002 [42].
In EIA-709, layer 2 supports various communication media such as twisted pair cables (78 Kbit/sec
[EIA-709.3] or 1.25 Mbit/sec), power line communication (4 Kbit/sec, EIA-709.2), radio frequency chan-
nel, infrared connections, ber optic channels (1.25 Mbit/sec), as well as IP connections based on the
EIA-852 protocol standard [43] in order to tunnel EIA-709 data packets through IP (Intranet, Inter-
net) networks. A p-persistent CSMA bus arbitration scheme is used on twisted pair cables. For other
communication media, the EIA-709 protocol stack uses the arbitration scheme dened for the very media.
The EIA-709 layer 3 supports a variety of different addressing schemes and advanced routing capa-
bilities. The entire routable address space of a LonTalk network is referred to as the domain (Figure 1.1).
Adomain is restricted to 255 subnets; a subnet allows for up to 127 nodes. The total number of addressable
nodes in a domain can reach 32385; up to 2
48
domains can be addressed. Domain gateways can be built
between logical domains in order to allow for a communication across domain boundaries. Groups can
be formed in order to send a single data packet to a group of nodes using a multicast addressed message.
Node x
Node x
Node 3 Node x
Group # 1
Subnet 1
Subnet 1
Subnet x
Domain 1
Subnet 2
Node 1
Node 1 Node 1 Node 2
Node 1 Node 2 Node 1 Node 2
Domain 2
Node 2
Subnet 2
Subnet x
S/N
S/N S/N
S/N
S/N
S/N S/N
S/N
Router
Router Router
Router
S/N
S/N S/N
S/N
S/N S/N
Domain gateway
FIGURE 1.1 Addressing elements in EIA-709 networks. (From D. Loy, Fundamentals of LonWorks/EIA 709
networks: ANSI/EIA 709 protocol standard (LonTalk). In The Industrial Communication Technology Handbook,
Zurawski, R. (Ed.), CRC Press, Boca Raton, FL, 2005. With permission.)
Routing is performed between different subnets only. An EIA-709 node can send a unicast addressed
message to exactly one node using either unique 48-bit node identication (Node ID) address or the
logical subnet/node address. A multicast addressed message can be sent to either a group of nodes (group
address), or all nodes in the subnet, or all nodes in the entire domain (broadcast address).
The EIA-709 layer 4 supports four types of services. The unacknowledged service transmits the data
packet from the sender to the receiver. The unacknowledged repeated service transmits the same data
packet a number of times. The number of retries is programmable. The acknowledged service transmits
the data packet and waits for an acknowledgment from the receiver. If not received by the transmitter, the
same data packet is sent again. The number of retries is programmable. The request response service sends
a request message to the receiver; the receiver must respond with a response message, for instance, with
statistics information. There is a provision for authentication of acknowledged transmissions, although
not very efcient.
Network nodes (which, typically, include Neuron chip, RAM/Flash, power source, clock, network
transceiver, and input/output interface connecting to sensor and actuator) can be based on the Echelons
Neuronchipseries manufacturedby Motorola, Toshiba, andCypress; recently alsobasedonother platform
independent implementations such as LoyTec LC3020 controller. The Neuron chips-based controllers are
programmed with the Echelons Neuron C language, which is a derivative of ANSI C. Other controllers
such as LC3020 are programmed with standard ANSI C. The basic element of Neuron C is the Network
Variable (NV) which can be propagated over the network. For instance, SNVT_temp variable repres-
ents temperature in degree Celsius; SNVT stands for Standard Network Variable Type. Network nodes
communicate with each other by exchanging NVs. Another way to communicate between nodes is by
using explicit messages. The Neuron C programs are used to schedule application events and to react to
incoming data packets (receiving NVs) from the network interface. Depending on the network media and
the network transceivers, a variety of network topologies are possible with LonWorks nodes, to include
bus, ring, star, and free topology.
As the interoperability on all seven OSI layers does not guarantee interworkable products, the LonMark
organization [44] has published interoperability guidelines for nodes that use the LonTalk protocol.
A number of task groups within LonMark dene functional proles (subset of all the possible protocol
features) for analog input, analog output, temperature sensor, etc. The task groups focus on various types
of applications such as home/utility, HVAC, lighting, etc.
LonBuilder and NodeBuilder are development and integration tools offered by Echelon. Both tools
allow writing Neuron C programs, to compile and link them and download the nal application into
the target node hardware. NodeBuilder supports debugging of one node at the time. LonBuilder, which
supports simultaneous debugging of multiple nodes, has a built in protocol analyzer and a network binder
to create communication relationships between network nodes. The Echelons LNS (network operating
system) provides tools that allow one to install, monitor, control, manage, and maintain control devices,
and to transparently perform these services over any IP-based network, including the Internet.
1.3.3 Automotive Networked Embedded Systems
Similar trends appear in the automotive electronic systems where the ECUs are networked by means of
one of automotive specic communication protocols for the purpose of controlling one of the vehicle
functions; for instance, electronic engine control, antilocking break system, active suspension, telematics,
tomentiona few. InReference 45, a number of functional domains have beenidentiedfor the deployment
of automotive networked embedded systems. They include the powertrain domain, involving, in general,
control of engine and transmission; the chassis domain involving control of suspension, steering and
braking, etc.; the body domain involving control of wipers, lights, doors, windows, seats, mirrors, etc.;
the telematics domain involving, mostly, the integration of wireless communications, vehicle monitoring
systems, and vehicle locationsystems; and the multimedia and HumanMachine Interface (HMI) domains.
The different domains impose varying constraints on the networked embedded systems in terms of
performance, safety requirements, andQuality of Services (QoSs). For instance, the powertrainandchassis
domains will mandate real-time control; typically, bounded delay is required, as well as fault-tolerant
services.
There are a number of reasons for the interest of the automotive industry in adopting mechatronic
solutions, known by their generic name as x-by-wire, aiming to replace mechanical, hydraulic, and pneu-
matic systems by electrical/electronic systems. The main factors seem to be economic in nature, improved
reliability of components, and increased functionality to be achieved with a combination of embed-
ded hardware and software. Steer-by-wire, brake-by-wire, or throttle-by-wire systems are representative
examples of those systems. But, it seems that certain safety-critical systems such as steer-by-wire and
brake-by-wire will be complemented with traditional mechanical/hydraulic backups, for safety reasons.
The dependability of x-by-wire systems is one of the main requirements, as well as constraints on the
adoption of this kind of systems. In this context, a safety-critical x-by-wire system has to ensure that a
system failure does not lead to a state in which human life, property, or environment are endangered; and
a single failure of one component does not lead to a failure of the whole x-by-wire system [46]. When
using Safety Integrity Level (SIL) scale, it is required for x-by-wire systems that the probability of a failure
of a safety-critical system does not exceed the gure of 10
9
per hour/system. This gure corresponds to
the SIL4 level. Another equally important requirement for the x-by-wire systems is to observe hard real-
time constraints imposed by the system dynamics; the end-to-end response times must be bounded for
safety-critical systems. Aviolation of this requirement may lead to performance degradation of the control
system, and other consequences as a result. Not all automotive electronic systems are safety critical. For
instance, system(s) to control seats, door locks, internal lights, etc., are not. Different performance, safety,
and QoS requirements dictated by various in-car application domains necessitate adoption of different
solutions, which, in turn, gave rise to a signicant number of communication protocols for automotive
applications. Time-triggered protocols based on TDMA (Time Division Multiple Access) medium access
control technology are particularly well suitedfor the safety-critical solutions, as they provide deterministic
access to the medium. In this category, there are two protocols, which, in principle, meet the requirements
of x-by-wire applications, namely TTP/C [47] and FlexRay [48] (FlexRay can support a combination of
both time-triggered and event-triggered transmissions). The following discussion will focus mostly on
TTP/C and FlexRay.
The TTP/C (Time-Triggered Protocol) is a fault-tolerant time-triggered protocol; one of two protocols
in the Time Triggered Architecture (TTA) [49]. The other one is a low cost eldbus protocol TTP/A
[50]. In TTA, the nodes are connected by two replicated communication channels forming a cluster.
In TTA, a network may have two different interconnection topologies, namely bus and star. In the bus
conguration, each node is connected to two replicated passive buses via bus guardians. The bus guardians
are independent units preventing associated nodes from transmitting outside predetermined time slots,
by blocking the transmission path; a good example may be a case of a controller with a faulty clock
oscillator which attempts to transmit continuously. In the star topology, the guardians are integrated in
to two replicated central star couplers. The guardians are required to be equipped with their own clocks,
distributed clock synchronization mechanism, and power supply. In addition, they should be located at a
distance from the protected node to increase immunity to spatial proximity faults. To cope with internal
physical faults, TTA employs partitioning of nodes in to so-called Fault-Tolerant Units (FTUs), each of
which is a collection of several stations performing the same computational functions. As each node is
(statically) allocated a transmission slot in a TDMA round, failure of any node or a frame corruption is
not going to cause degradation of the service. In addition, data redundancy allows, by voting process, to
ascertain the correct data value.
TTP/C employs synchronous TDMA medium access control scheme on replicated channels, which
ensures fault-tolerant transmission with known delay and bounded jitter between the nodes of a cluster.
The use of replicated channels, and redundant transmission, allows for the masking of a temporary fault
on one of channels. The payload section of the message frame contains up to 240 bytes of data protected by
a 24-bit CRCchecksum. In TTP/C, the communication is organized in to rounds. In a round, different slot
sizes may be allocated to different stations. However, slots belonging to the same stationare of the same size
in successive rounds. Every node must send a message in every round. Another feature of TTP/C is fault-
tolerant clock synchronization that establishes global time base without a need for a central time provider.
In the cluster, each node contains the message schedule. Based on that information, a node computes
the difference between the predetermined and actual arrival time of a correct message. Those differences
are averaged by a fault-tolerant algorithm, which allows for the adjustment of the local clock to keep it
in synchrony with clocks of other nodes in the cluster. TTP/C provides so-called membership service
to inform every node about the state of every other node in the cluster; it is also used to implement
the fault-tolerant clock synchronization mechanism. This service is based on a distributed agreement
mechanism, which identies nodes with failed links. A node with a transmission fault is excluded from
the membership until restarted with a proper state of the protocol. Another important feature of TTP/C
is a clique avoidance algorithm to detect and eliminate formation of cliques in case the fault hypothesis
is violated. In general, the fault-tolerant operation based on FTUs cannot be maintained if the fault
hypothesis is violated. In such a situation, TTA activates Never-Give-Up (NGU) strategy [46]. The NGU
strategy, specic to the application, is initiated by TTP/C in combination with the application with an aim
to continue operation in a degraded mode.
The TTA infrastructure, and the TTP/A and TTP/C protocols have a long history dating back to
1979 when the Maintainable Architecture for Real-Time Systems (MARSs) project started at the Technical
University of Berlin. Subsequently, the work was carriedout at theVienna University of Technology. TTP/C
protocol have been experimented with and considered for deployment for quite some time. However, to
date, there have been no actual implementations of that protocol involving safety-critical systems in
commercial automobiles, or trucks. In 1995, a proof of concept, organized jointly by Vienna University
of Technology and DaimlerChrysler, demonstrated a car equipped with a brake-by-wire system based
on time-triggered protocol. The TTA design methodology, which distinguishes between the node design
and the architecture design, is supported by a comprehensive set of integrated tools from TTTech. A range
of development and prototyping hardware is available from TTTech, as well. Austriamicrosystems offers
automotive certied TTP-C2 Communication Controller (AS8202NF).
Static segment Dynamic segment Symbol window
Optional
Network communication time Network idle time
Static slot Static slot Static slot
Mini-
slot
Mini-
slot
Mini-
slot
Mini-
slot
Mini-
slot
Mini-
slot
Communication cycle
FIGURE1.2 FlexRay communication cycle. (FromD. Millinger and R. Nossal, FlexRay Communication Technology.
In The Industrial Communication Technology Handbook, Zurawski, R. (Ed.), CRC Press, Boca Raton, FL, 2005. With
permission.)
FlexRay, which appears to be the frontrunner for future automotive safety-critical control applications,
employs a modied TDMA medium access control scheme on a single or replicated channel. The payload
section of a frame contains up to 254 bytes of data protected by a 24-bit CRC checksum. To cope with
transient faults, FlexRay also allows for a redundant data transmissionover the same channel(s) witha time
delay between transmissions. The FlexRay communication cycle comprises of a network communication
time, and network idle time, Figure 1.2. Two or more communication cycles can form an application
cycle. The network communication time is a sequence of static segment, dynamic segment, and symbol
window. The static segment uses a TDMA MAC protocol. The static segment comprises of static slots
of xed duration. Unlike in TTP/C, the static allocation of slots to a node (communication controller)
applies to one channel only. The same slot may be used by another node on the other channel. Also,
a node may possess several slots in a static segment. The dynamic segment uses a FTDMA (Flexible Time
Division Multiple Access) MAC protocol, which allows for a priority and demand driven access pattern.
The dynamic segment comprises of so-called mini-slots with each node allocated a certain number of
mini-slots, whichdo not have to be consecutive. The mini-slots are of a xedlength, andmuchshorter than
static slots. As the length of a mini-slot is not sufcient to accommodate a frame (a mini-slot only denes
a potential start time of a transmission in the dynamic segment), it has to be enlarged to accommodate
transmission of a frame. This in turn reduces the number of mini-slots in the reminder of the dynamic
segment. A mini-slot remains silent if there is nothing to transmit. The nodes allocated mini-slots toward
the end of the dynamic segment are less likely to get transmission time. This in turn enforces a priority
scheme. The symbol window is a time slot of xed duration used for network management purposes. The
networkidle time is a protocol specic time window, inwhichnotrafc is scheduledonthe communication
channel. It is used by the communication controllers for the clock synchronization activity; in principle,
similar to the one described for TTP/C. If the dynamic segment and idle window are optional, the idle
time, and minimal static segment are mandatory parts of a communication cycle; minimum two static
slots (degraded static segment), or four static slots for fault-tolerant clock synchronization are required.
With all that, FlexRay allows for three congurations: pure static; mixed, with both static and dynamic
bandwidth ratio depends on the application; and pure dynamic, where all bandwidth is allocated to the
dynamic communication.
FlexRay supports a range of network topologies offering a maximum of scalability and a consid-
erable exibility in the arrangement of embedded electronic architectures in automotive applications.
The supported congurations include bus, active star, active cascaded stars, and active stars with bus
extension. FlexRay also uses the bus guardians in the same way as TTP/C.
The existing FlexRay communication controllers support communication bit rates of up to 10 Mbit/sec
on two channels. The transceiver component of the communication controller also provides a set
of automotive network specic services. Two major services are alarm handling and wakeup control.
In addition to the alarm information received in a frame, an ECU also receives the alarm symbol from
the communication controller. This redundancy can be used to validate critical signals; for instance, an
air bag re command. The wakeup service is required where electronic components have a sleep mode to
reduce power consumption.
FlexRay is a joint effort of a consortium involving some of the leading car makers and technology pro-
viders to mention BMW, Bosch, DaimlerChrysler, General Motors, Motorola, Philips, and Volkswagen,
as well as Hyundai Kia Motors as a premium associate member with voting rights. DECOMSYS offers
Designer Pro, a comprehensive set of tools to support the development process of FlexRay based applic-
ations. The FlexRay protocol specication version 2.0 was released in 2004. The controllers are currently
available fromFreescale, and in future fromNEC. The latest controller version, MFR4200, implements the
protocol specication versions 1.0 and 1.1. Austriamicrosystems offers high-speed automotive bus trans-
ceiver for FlexRay (AS8221). The special physical layer for FlexRay is provided by Phillips. It supports the
topologies described above, and a data rate of 10 Mbit/sec on one channel. Two versions of the bus driver
will be available.
Time-Triggered Controller Area Network (TTCAN) [51], that can support a combination of both time-
triggered and event-triggered transmissions, utilize physical and data-link layer of the CANprotocol. Since
this protocol, as in the standard, does not provide necessary dependability services, it is unlikely to play
any role in fault-tolerant communication in automotive applications.
TTP/C and FlexRay protocols belong to class D networks in the classication published by the Society
for Automotive Engineers [52, 53]. Although the classication dates back to 1994, it is still a reasonable
guideline for distinction of different protocols based on data transmission speed and functions distributed
over the network, which comprises of four classes. Class A includes networks with a data rate less than
10 Kbit/sec. Some of the representative protocols are Local Interconnect Network (LIN) [54] and TTP/A
[50]. Class A networks are employed largely to implement the body domain functions. Class B networks
operate within the range of 10 Kbit/sec to 125 Kbit/sec. Some of the representative protocols are J1850
[55], low-speed CAN [56], and VAN (Vehicle Area Network) [57]. Class C networks operate within
the range of 125 Kbit/sec to 1 Mbit/sec. Examples of this class networks are high-speed CAN [58] and
J1939 [59]. Network in this class are used for the control of powertrain and chassis domains. High-speed
CAN, although used in the control of powertrain and chassis domains, is not suitable for safety-critical
applications as it lacks the necessary fault-tolerant services. Class D networks (not formally dened as
yet) includes networks with a data rate over 1 Mbit/sec. Networks to support the x-by-wire solutions fall
in to this class, to include TTP/C and FlexRay. Also, MOST (Media Oriented System Transport) [60] and
IDB-1394 [61], both for multimedia applications, belong to this class.
The cooperative development process of networked embedded automotive applications brings with
itself heterogeneity of software and hardware components. Even with the inevitable standardization of
those components, interfaces, and even complete system architectures, the support for reuse of hard-
ware and software components is limited. Thus potentially making the design of networked embedded
automotive applications labor-intensive, error-prone, and expensive. This necessitates the development of
component-based design integration methodologies. An interesting approach is based on platform-based
design [62], discussed in this book with a viewfor automotive applications. Some industry standardization
initiatives include: OSEK/VDX with its OSEKTime OS (OSEK/VDX Time-Triggered Operating Systems)
[63]; OSEK/VDX Communication [64] which species a communication layer that denes common
software interfaces and common behavior for internal and external communications among application
processes; and OSEK/VDX FTCom (Fault-Tolerant Communication) [65] a proposal for a software
layer to provide services to facilitate development of fault-tolerant applications on top of time-triggered
networks; HIS (Herstellerinitiative Software)[66] with a broad range of goals including standardization of
software modules, specication of process maturity levels, development of software test, development of
software tools, etc; ASAM (Association for Standardization of Automation and Measuring Systems) [67]
which develops, amongst other projects, a standardized XML based format for data exchange between
tools from different vendors.
One of the main bottlenecks in the development of safety-critical systems is the software development
process. The automotive industry clearly needs a software development process model and support-
ing tools suitable for the development of safety-critical software. At present, there are two potential
candidates: MISRA (Motor Industry Software Reliability Association) [68], which published recommen-
ded practices for safe automotive software. The recommended practices, although automotive specic,
do not support x-by-wire. IEC 61508 [69] is an international standard for electrical, electronic and pro-
grammable electronic safety related systems. IEC 61508 is not automotive specic, but broadly accepted
in other industries.
1.3.4 Sensor Networks
Another trend in networking of eld devices has emerged recently; namely, sensor networks, which is
another example of networked embedded systems. Here, the embedding factor is not as evident as in
other applications; particularly true for wireless and self-organizing networks where the nodes may be
embedded in the ecosystem or a battleeld, to mention some.
Although potential applications in the projected areas are still under discussion, the wireless
sensor/actuator networks are in the deployment stage by the manufacturing industry. The use of wireless
links with eld devices, such as sensors and actuators, allow for exible installation and maintenance,
mobile operation required in case of mobile robots, and alleviates problems with cabling. A wireless
communication system to operate effectively in the industrial/factory oor environment has to guarantee
high reliability, low and predictable delay of data transfer (typically, less than 10 msec for real time appli-
cations), support for high number of sensor/actuators (over 100 in a cell of a few meters radius), and low
power consumption, to mention some. In the industrial environments, the characteristic for the wireless
channel degradation artifacts can be compounded by the presence of electric motors or a variety of equip-
ment causing the electric discharge, which contributes to even greater levels of bit error and packet losses.
One way to partially alleviate the problem is either by designing robust and loss-tolerant applications
and control algorithms, or by trying to improve the channel quality; all subject of extensive research and
development.
In a wireless sensor/actuator network, stations may interact with each other on the peer-to-peer basis,
and with the base station. To leverage low cost, small size, and low-power consumptions, standard
Bluetooth (IEEE 802.15.1) 2.4 GHz radio transceivers [70, 71] may be used as the sensor/actuators com-
munication hardware. To meet the requirements for high reliability, low and predictable delay of data
transfer, and support for high number of sensor/actuators, custom optimized communication protocols
may be required as the commercially available solutions such as IEEE 802.15.1, IEEE 802.15.4 [72], and
IEEE 802.11 [7375] variants may not fulll all the requirements. The base station may have its transceiver
attached to a cable of a eldbus, giving rise to a hybrid wireless-wireline eldbus system [2].
A representative example of this kind of systems is a wireless sensor/actuator network developed
by ABB and deployed in a manufacturing environment [76]. The system, known as WISA (wireless
sensor/actuator) has been implemented in a manufacturing cell to network proximity switches, which are
some of the most widely used position sensors in automated factories to control positions of a variety of
equipment, including robotic arms, for instance. The sensor/actuators communication hardware is based
on a standard Bluetooth 2.4 GHz radio transceiver and low power electronics that handle the wireless
communication link. The sensors communicate with a wireless base station via antennas mounted in the
cell. For the base station, a specialized RF front end was developed to provide collision free air access
by allocating a xed TDMA time slot to each sensor/actuator. Frequency Hopping (FH) was employed
to counter both frequency-selective fading and interference effects, and operates in combination with
Automatic Retransmission Requests (ARQs). The parameters of this TDMA/FH scheme were chosen to
satisfy the requirements of up to 120 sensor/actuators per base station. Each wireless node has a response
or cycle time of 2 msec, to make full use of the available radio band of 80 MHz width. The FH sequences
are cell-specic and were chosen to have low cross-correlations to permit parallel operation of many
cells on the same factory oor with low self-interference. The base station can handle up to 120 wireless
sensor/actuators and is connected to the control system via a (wireline) eld bus. To increase capacity, a
number of base stations can operate in the same area. WISA provides wireless power supply to the sensors,
based on magnetic coupling [77].
1.4 Concluding Remarks
This chapter has presented an overview of trends for networking of embedded systems, their design, and
selected application domain specic network technologies. The networked embedded systems appear in
a variety of application domains to mention automotive, train, aircraft, ofce building, and industrial
automation. With the exception of building automation, the systems discussed in this chapter tend to be
conned to a relatively small area covered and limited number of nodes, as in the case of an industrial
process, an automobile, or a truck. In the building automation controls, the networked embedded systems
may take on truly large proportions in terms of area covered and number of nodes. For instance, in a
LonTalk network, the total number of addressable nodes in a domain can reach 32385; up to 2
48
domains
can be addressed.
The wireless sensor/actuator networks, as well as wireless-wireline hybrid networks, have started
evolving from the concept to actual implementations, and are poised to have a major impact on industrial,
home, and building automation at least in these application domains, for a start.
The networked embedded systems pose a multitude of challenges in their design, particularly for safety-
critical applications, deployment, and maintenance. The majority of the development environments and
tools for specic networking technologies do not have rm foundations in computer science or software
engineering models and practices making the development process labor-intensive, error-prone, and
expensive.
References
[1] R. Zurawski (Ed.), The Industrial Communication Systems, Special Issue. In Proceedings of the
IEEE, 93, June 2005.
[2] J.-D. Decotignie, P. Dallemagne, and A. El-Hoiydi, Architectures for the Interconnection of Wire-
less and Wireline Fieldbusses. In Proceedings of the 4th IFAC Conference on Fieldbus Systems and
Their Applications 2001 (FET 2001), Nancy, France, 2001.
[3] H. Zimmermann, OSI Reference Model: The ISO Model of Architecture for Open System
Interconnection. IEEE Transactions on Communications, 28, 425432, 1980.
[4] Costrell, CAMAC Instrumentation System Introduction and General Description. IEEE-
Transactions-on-Nuclear-Science, 18, 38, 1971.
[5] C.-A. Gifford, A Military Standard for Multiplex Data Bus. In Proceedings of the IEEE-1974,
National Aerospace and Electronics Conference, May 1315, 1974, Dayton, OH, USA, pp. 8588.
[6] N. Collins, Boeing Architecture and TOP (Technical and Ofce Protocol). In Networking:
A-Large-Organization-Perspective, April, 1986, Melbourne, FL, USA, pp. 4954.
[7] H.A. Schutz, The Role of MAP in Factory Integration. IEEE Transactions on Industrial Electronics,
35, 612, 1988.
[8] P. Pleinevaux and J.-D. Decotignie, Time Critical Communication Networks: Field Buses. IEEE
Network, 2, 5563, 1988.
[9] International Electrotechnical Commission, Digital data communications for measurement and
control Fieldbus for use in industrial control systems, Part 1: Introduction. IEC 61158-1, IEC,
2003.
[10] International Electrotechnical Commission (IEC). www.iec.ch.
[11] International Organization for Standardization (ISO). www.iso.org.
[12] Instrumentation Society of America (ISA). www.isa.org.
[13] Comit Europen de Normalisation Electrotechnique (CENELEC). www.cenelec.org.
[14] European Committee for Standardization (CEN). www.cenorm.be.
[15] International Electrotechnical Commission, Digital data communications for measurement and
control Part 1: Prole sets for continuous and discrete manufacturing relative to eldbus use in
industrial control systems, IEC 61784-1, IEC, 2003.
[16] International Electrotechnical Commission, Real Time Ethernet Modbus-RTPS, Proposal for a
Publicly Available Specication for Real-Time Ethernet, document IEC 65C/341/NP, 2004.
[17] www.modbus-ida.org.
[18] International Electrotechnical Commission, Real Time Ethernet: EtherNet/IP with Time Synchron-
ization, Proposal for a Publicly Available Specication for Real-Time Ethernet, document IEC,
65C/361/NP, IEC, 2004.
[19] www.odva.org.
[20] www.controlnet.org.
[21] International Electrotechnical Commission, Real Time Ethernet: P-NET on IP, Proposal for a Publicly
Available Specication for Real-Time Ethernet, document IEC, 65C/360/NP, IEC, 2004.
[22] International Electrotechnical Commission, Real Time Ethernet Vnet/IP, Proposal for a Publicly
[23] International Electrotechnical Commission, Real Real Time Ethernet EPL (ETHERNET Powerlink),
Proposal for a Publicly Available Specication for Real-Time Ethernet, document IEC, 65C/356a/NP,
IEC, 2004.
[24] www.ethernet-powerlink.org.
[25] International Electrotechnical Commission, Real Time Ethernet TCnet (Time-Critical Control
Network), Proposal for a Publicly Available Specication for Real-Time Ethernet, document IEC,
65C/353/NP, IEC, 2004.
[26] International Electrotechnical Commission, Real Time Ethernet EPA (Ethernet for Plant
Automation), Proposal for a Publicly Available Specication for Real-Time Ethernet, document IEC
65C/357/NP, IEC, 2004.
[27] J. Feld, PROFINET Scalable Factory Communication for all Applications. In Proceedings of the
2004 IEEE International Workshop on Factory Communication Systems, September 2224, 2004,
Vienna, Austria, pp. 3338.
[28] www.probus.org.
[29] International Electrotechnical Commission, Real Time Ethernet SERCOS III, Proposal for a Publicly
[30] International Electrotechnical Commission, Real Time Ethernet Control Automation Technology
(ETHERCAT), Proposal for a Publicly Available Specication for Real-Time Ethernet, document
IEC, 65C/355/NP, IEC, 2004.
[31] www.ethercat.org.
[32] International Electrotechnical Commission, Real-Time Ethernet PROFINET IO, Proposal for a
Publicly Available Specication for Real-Time Ethernet, document IEC, 65C/359/NP, IEC, 2004.
[33] Deborah Snoonian, Smart Buildings. IEEE Spectrum, 40, 1823, 2003.
[34] D. Loy, D. Dietrich, and H. Schweinzer, Open Control Networks, Kluwer, Dordrecht, 2004.
[35] Steven T. Bushby. BACnet: A Standard Communication Infrastructure for Intelligent Buildings.
Automation in Construction, 6, 529540, 1997.
[36] ENV 13154-2, Data Communication for HVAC Applications Field net Part 2: Protocols,
1998.
[37] EIA/CEA 776.5, CEBus-EIB Router Communications Protocol The EIB Communications Protocol,
1999.
[38] EN 50090-X, Home and Building Electronic Systems (HBES), 19942004.
[39] Konnex Association, Diegem, Belgium. KNX Specications, V. 1.1, 2004.
[40] www.echelon.com.
[41] Control Network Protocol Specication, ANSI/EIA/CEA-709.1-A, 1999.
[42] Control Network Protocol Specication, EIA/CEA Std. 709.1, Rev. B, 2002.
[43] Tunneling Component Network Protocols Over Internet Protocol Channels, ANSI/EIA/CEA 852,
2002.
[44] www.lonmark.org.
[45] F. Simonot-Lion, In-Car Embedded Electronic Architectures: How to Ensure Their Safety.
In Proceedings of the 5th IFAC International Conference on Fieldbus Systems and their Applica-
tions FeT2003, July 2003, Aveiro, Portugal.
[46] X-by-Wire Project, Brite-EuRam 111 Program, X-By-Wire Safety Related Fault Tolerant Systems
in Vehicles, Final report, 1998.
[47] TTTech Computertechnik GmbH. Time-Triggered Protocol TTP/C, High-Level Specication
Document, Protocol Version 1.1, November 2003. www.tttech.com.
[48] FlexRay Consortium, FlexRay Communication System, Protocol Specication, Version 2.0,
June 2004. www.exray.com.
[49] H. Kopetz and G. Bauer, The Time Triggered Architecture. Proceedings of the IEEE, 91, 112126,
2003.
[50] H. Kopetz et al., Specication of the TTP/A Protocol, University of Technology, Vienna, 2002.
[51] International Standard Organization, 11898-4, Road Vehicles Controller Area Network (CAN)
Part 4: Time-Triggered Communication, ISO, 2000.
[52] Society of Automotive Engineers, J2056/1 Class CApplication Requirements Classications. In SAE
Handbook, SAE, 1994.
[53] Society of Automotive Engineers, J2056/2 Survey of Known Protocols. SAE Handbook, Vol. 2, SAE,
1994.
[54] Antal Rajnak, The LIN Standard. In The Industrial Communication Technology Handbook, CRC
Press, Boca Raton, FL, 2005.
[55] Society of Automotive Engineers, Class B Data Communications Network Interface SAE J1850
Standard rev. nov96, 1996.
[56] International Standard Organization, ISO 11519-2, Road Vehicles Low Speed Serial Data
Communication Part 2: Low Speed Controller Area Network, ISO, 1994.
[57] International Standard Organization, ISO 11519-3, Road Vehicles Low Speed Serial Data
Communication Part 3: Vehicle Area Network (VAN), ISO, 1994.
[58] International Standard Organization, ISO 11898, Road Vehicles Interchange of Digital
Information Controller Area Network for High-speed Communication, ISO, 1994.
[59] SAE J1939 Standards Collection. www.sae.org.
[60] MOST Cooperation, MOST Specication Revision 2.3, August 2004. www.mostnet.de.
[61] www.idbforum.org.
[62] K. Keutzer, S. Malik, A.R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, System Level Design:
Orthogonalization of Concerns and Platform-Based Design. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 19(12), 15231543, 2000.
[63] OSEK Consortium, OSEK/VDX Operating System, Version 2.2.2, July 2004. www.osek-vdx.org.
[64] OSEK Consortium, OSEK/VDX Communication, Version 3.0.3, July 2004. www.osek-vdx.org.
[65] OSEK Consortium, OSEK/VDX Fault-Tolerant Communication, Version 1.0, July 2001.
www.osek-vdx.org.
[66] www.automotive-his.de.
[67] www.asam.de.
[68] www.misra.org.uk.
[69] International Electrotechnical Commission, IEC 61508:2000, Parts 17, Functional Safety of
Electrical/Electronic/Programmable Electronic Safety-Related Systems, 2000.
[70] Bluetooth Consortium, Specication of the Bluetooth System, 1999. www.bluetooth.org.
[71] Bluetooth Special Interest Group, Specication of the Bluetooth System, Version 1.1, December 1999.
[72] LAN/MAN Standards Committee, IEEE Standard for Information Technology Telecommuni-
cations and Information Exchange between Systems Local and Metropolitan Area Networks
Specic Requirements Part 15.4: Wireless Medium Access Control (MAC) and Physical Layer
(PHY) Specications for Low Rate Wireless Personal Area Networks (LR-WPANs), IEEE Computer
Society, Washington, 2003.
[73] LAN/MAN Standards Committee of the IEEE Computer Society, IEEE Standard for Information
Technology Telecommunications and Information Exchange between Systems Local and
Metropolitan Networks Specic Requirements Part 11: Wireless LAN Medium Access Control
(MAC) and Physical Layer (PHY) Specications: Higher Speed Physical Layer (PHY) Extension in
the 2.4 GHz band, 1999.
[74] LAN/MAN Standards Committee of the IEEE Computer Society, Information Technology
Telecommunications and Information Exchange between Systems Local and Metropolitan Area
Networks Specic Requirements Part 11: Wireless LAN Medium Access Control (MAC) and
Physical Layer (PHY) Specications, 1999.
[75] Institute of Electrical and Electronic Engineering Part 11: Wireless LAN Medium Access Control
(MAC) and Physical Layer (PHY) Specications, Amendment 4: Further Higher Data Rate Extension
in the 2.4 GHz Band, June 2003, aNSI/IEEE Std 802.11.
[76] Christoffer Apneseth, Dacfey Dzung, Snorre Kjesbu, Guntram Scheible, and Wolfgang
Zimmermann, Introducing Wireless Proximity Switches. ABB Review, 4, 4249, 2002.
www.abb.com/review.
[77] Dacfey Dzung, Christoffer Apneseth, and Jan Endresen, A Wireless Sensor/Actuator Communi-
cation System for Real-Time Factory Applications, private communication. IEEE Transactions on
Industrial Electronics (submitted).
2
Real-Time in
Embedded Systems
Hans Hansson,
Mikael Nolin,
and Thomas Nolte
2.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1
2.2 Design of RTSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3
Reference Architecture Models of Interaction Execution
Strategies Component-Based Design Tools for Design
of RTSs
2.3 Real-Time Operating Systems . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8
Typical Properties of RTOSs Mechanisms for Real-Time
Commercial RTOSs
2.4 Real-Time Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11
Introduction to Scheduling Ofine Schedulers
Online Schedulers
2.5 Real-Time Communications . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-13
Communication Techniques Fieldbuses Ethernet for
Real-Time Communication Wireless Communication
2.6 Analysis of RTSs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-18
Timing Properties Methods for Timing Analysis Example
of Analysis Trends and Tools
2.7 Component-Based Design of RTS . . . . . . . . . . . . . . . . . . . . . . 2-25
Timing Properties and CBD Real-Time Operating
Systems Real-Time Scheduling
2.8 Testing and Debugging of RTSs. . . . . . . . . . . . . . . . . . . . . . . . . 2-29
2.9 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-31
In this chapter we will provide an introduction to issues, techniques, and trends in real-time
systems (RTSs). We will specically discuss design of RTSs, real-time operating systems (RTOSs), real-time
scheduling, real-time communication, real-time analysis, as well as testing and debugging of RTSs. For
each of these areas, state-of-the-art tools and standards are presented.
2.1 Introduction
Consider the airbag in the steering-wheel of your car. It should after the detection of a crash (and only
then) inate just in time to softly catch your head to prevent it from hitting the steering-wheel; not too
early since this would make the airbag deate before it can catch you; nor too late since the exploding
2-1
airbag could then injure you by blowing up in your face and/or catch you too late to prevent your head
from banging into the steering wheel.
The computer controlled airbag systemis an example of a RTS. But RTSs come in many different avors,
including vehicles, telecommunication systems, industrial automation systems, household appliances, etc.
There is no commonly agreed upon denition of what a RTS is, but the following characterization is
(almost) universally accepted:
RTSs are computer systems that physically interact with the real world.
RTSs have requirements on the timing of these interactions.
Typically, the real-world interactions are via sensors and actuators, rather than the keyboard and screen
of your standard PC.
Real-time requirements typically express that an interaction should occur within specied timing
bound. It should be noted that this is quite different from requiring the interaction to be as fast as
possible.
Essentially all RTSs are embedded in products, and the vast majority of embedded computer systems are
RTSs. RTSs is the dominating application of computer technology, as more than 99%of the manufactured
processors (more than 8 billion in 2000 [1]) are used in embedded systems.
Returning to the airbag system, we note that, in addition to being a RTS it is a safety-critical system,
that is, a system that owing to severe risks of damage have strict Quality of Service (QoS) requirements,
including requirements on the functional behavior, robustness, reliability, and timeliness.
A typical strict timing property could be that a certain response to an interaction always must occur
within some prescribed time, for example, the charge in the airbag must detonate between 10 and 20 msec
from the detection of a crash; violating this must be avoided at any cost, since it would lead to something
unacceptable, such as having to spend a couple of months in hospital. A system that is designed to meet
strict timing requirements is often referred to as a hard RTS. In contrast, systems for which occasional
timing failures are acceptable possibly because this will not lead to something terrible are termed
soft RTS.
An illustrative comparison between hard and soft RTSs that highlights the difference between the
extremes is shown in Table 2.1. A typical hard RTS could in this context be an engine control system,
whichmust operate withsec-precision, andwhichwill severely damage the engine if timing requirements
fail by more than a few msec. A typical soft RTS could be a banking system, for which timing is important,
but where there are no strict deadlines and some variations in timing are acceptable.
Unfortunately, it is impossible tobuildreal systems that satisfy hardreal-time requirements, since, owing
to the imperfection of hardware (and designers) any system may break. The best that can be achieved is a
system that, with very high probability provides the intended behavior during a nite interval of time.
However, on the conceptual level hard real-time makes sense, since it implies a certain amount of rigor
in the way the system is designed, for example, it implies an obligation to prove that the strict timing
requirements are met.
TABLE 2.1 Typical Characteristics of Hard- and Soft-RTSs [2]
Characteristic Hard real-time Soft real-time
Timing requirements Hard Soft
Pacing Environment Computer
Peak-load performance Predictable Degraded
Error detection System User
Safety Critical Noncritical
Redundancy Active Standby
Time granularity Millisecond Second
Data les Small Large
Data integrity Short term Long term
Real-Time in Embedded Systems 2-3
Since the early 1980s a substantial research effort has provided a sound theoretical foundation
(e.g., [3, 4]) and many practically useful results for the design of hard RTSs. Most notably, hard RTS
scheduling has evolved into a mature discipline, using abstract, but realistic, models of tasks executing
on single CPU, multiprocessor, or distributed computer systems, together with associated methods for
timing analysis. Such schedulability analysis, for example, the well-known rate-monotonic analysis [57],
have also found signicant use in some industrial segments.
However, hard real-time scheduling is not the cure for all RTSs. Its main weakness is that it is based on
analysis of the worst possible scenario. For safety-critical systems this is of course a must, but for other
systems, where general customer satisfaction is the main criteria, it may be too costly to design the system
for a worst-case scenario that may not occur during the systems lifetime.
If we look at the other end of the spectrum, we nd the best-effort approach, which is still the dominating
approach in the industry. The essence of this approach is to implement the system using some best
practice, and then use measurements, testing, and tuning to make sure that the system is of sufcient
quality. On one hand such a system will hopefully satisfy some soft real-time requirement; the weakness
being that we do not know which. On the other hand, compared with the hard real-time approach, the
system can be better optimized for the available resources. A further difference is that hard RTS methods
essentially are applicable to static congurations only, whereas it is less problematic to handle dynamic
task creation etc., in best-effort systems.
Having identied the weaknesses of the hard real-time and best-effort approaches major efforts are
now put into more exible techniques for soft RTSs. These techniques provide analyzability (such as hard
real-time), together with exibility and resource efciency (such as best-effort). The basis for the exible
techniques are often quantied QoS characteristics. These are typically related to nonfunctional aspects,
such as timeliness, robustness, dependability, and performance. To provide a specied QoS, some sort of
resource management is needed. Such a QoS management is either handled by the application, by the
operating system (OS), by some middleware, or by a mix of the above. The QoS management is often a
exible online mechanism that dynamically adapts the resource allocation to balance between conicting
QoS requirements.
2.2 Design of RTSs
The main issue in designing RTSs is timeliness, that is, that the system performs its operations at proper
points in time. Not considering timeliness at the design phase will make it virtually impossible to analyze
and predict the timely behavior of the RTS. This section presents some important architectural issues for
embedded RTSs, together with some supporting commercial tools.
2.2.1 Reference Architecture
A generic system architecture for a RTS is depicted in Figure 2.1. This architecture is a model of any
computer-based system interacting with an external environment via sensors and actuators.
Since our focus is on the RTS we will look more into different organizations of that part of the generic
architecture in Figure 2.1. The simplest RTS is a single processor, but in many cases the RTS is a distributed
computer system consisting of a set of processors interconnected by a communications network. There
could be several reasons for making an RTS distributed, including:
The physical distribution of the application.
The computational requirements that may not be conveniently provided by a single CPU.
The need for redundancy to meet availability, reliability, or other safety requirements.
To reduce the cabling in the system.
Figure 2.2 shows an example of a distributed RTS. In a modern car, like the one depicted in the gure,
there are some 20 to 100 computer nodes (which in the automotive industry are called Electronic Control
Environment
RTS
Sensors Actuators
FIGURE 2.1 A generic RTS architecture.
SCM
MMS
RSM
SRM
GSM
SHM
SWS
LSM
CPM
SHM
PAS
ISM
UEM
MP1
CCM
DEM
SUM
SUB
AEM
REM
ATM
AUD
SRS
SAS PSM
BSC
SWM
DDM CEM
DIM
ICM
PDM
PHM
MP2
MMM
ICM ECM
BCM
TCM
FIGURE 2.2 Network infrastructure of Volvo XC90.
Units [ECUs]) interconnected with one or more communication networks. The initial motivation for
this type of electronic architecture in cars was the need to reduce the amount of cabling. However, the
electronic architecture has also led to other signicant improvements, including substantial pollution
reduction and new safety mechanisms, such as computer controlled Electronic Stabilization Programs
(ESPs). The current development is toward making the most safety-critical vehicle functions, such as
braking and steering, completely computer controlled. This is done by removing the mechanical connec-
tions (e.g., between steering wheel and front wheels, and between break pedal and breaks), replacing them
with computers and computer networks. Meeting the stringent safety requirements for such functions
will require careful introduction of redundancy mechanisms in hardware and communication, as well
as software, that is, a safety-critical system architecture is needed (an example of such an architecture is
TTA [8]).
2.2.2 Models of Interaction
In Section 2.2.1 we presented the physical organization of a RTS, but for an application programmer this
is not the most important aspect of the system architecture. Actually, from an application programmers
perspective the system architecture is given more by the execution paradigm (execution strategy) and the
interaction model used in the system. In this section we describe what an interaction model is and how it
affects the real-time properties of a system, and in Section 2.2.3 we discuss the execution strategies used
in RTSs.
A model of interaction describes the rules by which components interact with each other (in this
section we will use the term component to denote any type of software unit, such as a task or a
module). The interaction model can govern both control ow and data ow between system com-
ponents. One of the most important design decisions, for all types of systems, is which interaction
models to use (sadly, however, this decision is often implicit, and hidden in the systems architectural
description).
When designing RTSs, attention should be paid to the timing properties of the interaction models
chosen. Some models have a more predictable and robust behavior with respect to timing than other
models. Examples of some of the more predictable models that are commonly used in RTSs design, are
pipes-and-lters, publishersubscriber, and blackboard.
On the other end of the spectrum of interaction models there are models that increase the (timing)
unpredictability of the system. These models should, if possible, be avoided when designing RTSs. The
two most notable, and commonly used, are clientserver and message boxes.
2.2.2.1 Pipes-and-Filters
In this model, both data and control ow is specied using input and output ports of components.
A component becomes eligible for execution when data has arrived on its input ports and when the
component nishes execution it produces output on its output ports.
This model ts well for many types of control programs, and control laws are easily mapped to this
interaction model. Hence, it has gained widespread use in the real-time community. The real-time
properties of this model are also quite nice. Since both data and control ows unidirectionally through a
series of components, the order of execution and end-to-end timing delay usually becomes predictable.
The model alsoprovides a highdegree of decoupling intime; that is, components canoftenexecute without
having to worry about timing delays caused by other components. Hence, it is usually straightforward to
specify the compound timing behavior of set of components.
2.2.2.2 PublisherSubscriber
The publishersubscriber model is similar to the pipes-and-lters model but it usually decouples data
and control ow. That is, a subscriber can usually choose different forms for triggering its execution.
If the subscriber chooses to be triggered on each new published value, the publishersubscriber model
takes on the form of the pipes-and-lters model. However, in addition, a subscriber could choose to
ignore the timing of the published values and decide to use the latest published value. Also, for the
publishersubscriber model, the publisher is not necessarily aware of the identity, or even the existence,
of its subscribers. This provides a higher degree of decoupling of components.
Similar to the pipes-and-lters model, the publishersubscriber model provides goodtiming properties.
However, a prerequisite for analysis of systems using this model is that subscriber components make
explicit the values they subscribe to (this is not mandated by the model itself). However, when using
the publishersubscriber model for embedded systems, it is the norm that subscription information is
available (this information is used, for instance, to decide the values that are to be published over a
communications network, and to decide the receiving nodes of those values).
2.2.2.3 Blackboard
The blackboard model allows variables to be published on a globally available blackboard area.
Thus, it resembles the use of global variables. The model allows any component to read or write
values to variables in the blackboard. Hence, the software engineering qualities of the blackboard
model is questionable. Nevertheless, it is a model that is commonly used, and in some situations it
provides a pragmatic solution to problems that are difcult to address with more stringent interaction
models.
Software engineering aspects aside, the blackboard model does not introduce any extra elements of
unpredictable timing. On the other hand, the exibility of the model does not help engineers to achieve
predictable systems. Since the model does not address the control ow, components can execute relatively
undisturbed and decoupled from other components.
2.2.2.4 ClientServer
In the clientserver model, a client asynchronously invokes the service of a server. The service invocation
passes the control ow (plus any input data) to the server, and control stays at the server until it has
completed the service. When the server is done, the control ow (and any return data) is returned to the
client which in turn resumes execution.
The clientserver model has inherently unpredictable timing. Since services are invokedasynchronously,
it is very difcult to a priori asses the load on the server for a certain service invocation. Thus, it is difcult
to estimate the delay of the service invocation and, in turn, it is difcult to estimate the response time of
the client. This matter is furthermore complicated by the fact that most components often behave both
as clients and as servers (a server often uses other servers to implement its own services); leading to very
complex and unanalyzable control ow paths.
2.2.2.5 Message Boxes
Acomponent can have a set of message boxes, and components communicate by posting messages in each
others message boxes. Messages are typically handled in First In First Out (FIFO) order, or in priority
order (where the sender species a priority). Message passing does not change the ow of control for
the sender. A component that tries to receive a message from an empty message box, however, blocks on
that message box until a message arrives (often the receiver can specify a timeout to prevent indenite
blocking).
From a senders point of view, the message box model has similar problems as the clientserver model.
The data sent by the sender (and the action that the sender expects the receiver to perform) may be delayed
in an unpredictable way when the receiver is highly loaded. Also, the asynchronous nature of the message
passing makes it difcult to foresee the load of a receiver at any particular moment.
Furthermore, from the receivers point of view, the reading of message boxes is unpredictable in the
sense that the receiver may or may not block on the message box. Also, since message boxes often are of
limited size, there is a risk that a highly loaded receiver loose some message. Lost messages are another
source of unpredictability.
2.2.3 Execution Strategies
There are two main execution paradigms for RTSs: time-triggered and event-triggered. On one hand, when
using timed-triggered execution, activities occur at predened instances of time, for example, a specic
sensor value is read exactly every 10 msec and 2 msec, later the corresponding actuator receives an
updated control parameter. In an event-triggered execution, on the other hand, actions are triggered by
event occurrences, for example, when the toxic uid in a tank reaches a certain level an alarm will go
off. It should be noted that the same functionality, typically, can be implemented in both paradigms,
for example, a time-triggered implementation of the above alarm would be to periodically read the level-
measuring sensor and activate the alarm when the read level exceeds the maximum allowed. If alarms
are rare, the time-triggered version will have much higher computational overhead than the event-
triggered one. On the other hand, the periodic sensor readings will facilitate detection of a malfunctioning
sensor.
Time-triggered executions are used in many safety-critical systems with high dependability require-
ments (such as avionic control systems), whereas the majority of other systems are event-triggered.
Dependability can also be guaranteed in the event-triggered paradigm, but owing to the observability
provided by the exact timing of time-triggered executions, most experts argue for using time-triggered
in ultra-dependable systems. The main argument against time-triggered is its lack of exibility and the
requirement of pre runtime schedule generation (which is a nontrivial and possibly time-consuming task).
Time-triggered systems are mostly implemented by simple proprietary table-driven dispatchers [9]
(see Section 2.4.2 for a discussion on table-driven execution), but complete commercial systems including
design tools are also available [10, 11]. For the event-triggered paradigm a large number of commercial
tools and OSs are available (examples are given in Section 2.3.3). There are also examples of systems
integrating the two execution paradigms, thereby aiming at getting the best of two worlds: time-triggered
dependability and event-triggered exibility. One example is the Basement system [12] and its associated
real-time kernel Rubus [13].
Since computations in time-triggered systems are statically allocated both in space (to a specic
processor) and time, some sort of conguration tool is often used. This tool assumes that the com-
putations are packaged into schedulable units (corresponding to tasks or threads in an event-triggered
system). Typically, for example, in Basement, computations are control-ow based, in the sense that
they are dened by sequences of schedulable units, each unit performing a computation based on its
inputs and producing outputs to the next unit in sequence. The system is congured by dening the
sequences and their timing requirements. The conguration tool will then automatically (if possible)
1
generate a schedule which guarantees that all timing requirements are met.
Event-triggered systems typically have a richer and more complex Application Programming Interfaces
(APIs), dened by the used OS and middleware, which will be elaborated on in Section 2.3.
2.2.4 Component-Based Design
Component-Based Design (CBD) of software systems is an interesting approach for software engineering
in general, and for engineering of RTSs in particular. In CBD, a software component is used to encapsulate
some functionality. That functionality is only accessed through the interface of the component. A system
is composed by assembling a set of components and connecting their interfaces.
The reason CBD could prove extra useful for RTSs is the possibility to extend components with intro-
spective interfaces. An introspective interface does not provide any functionality per se, rather the interface
can be used to retrieve information about extra-functional properties of the component. Extra-functional
properties can include attributes such as memory consumption, execution times, task periods, etc. For
RTS, timing properties are of course of particular interest.
Unlike the functional interfaces of components, the introspective interfaces can be available ofine,
that is, during the component assembly phase. This way, the timing attributes of the system components
can be obtained at design time and tools to analyze the timing behavior of the system could be used. If the
introspective interfaces are also available online they could be used in, for instance, admission control
algorithms. An admission control could query new components for their timing behavior and resource
consumption before deciding to accept new component to the system.
Unfortunately, many industry standard software techniques are based on the clientserver or the
message-box models of interaction, which we deemed, in Section 2.2.2, unt for RTSs. This is espe-
cially true for the most commonly used component models. For instance, the Corba Component Model
(CCM) [14], Microsofts COM [15] and .NET [16] models, and Java Beans [17] all have the clientserver
model as their core model. Also, none of these component technologies allow the specication of extra-
functional properties through introspective interfaces. Hence, from the real-time perspective, the biggest
advantage of CBD is void for these technologies.
However, there are numerous research projects addressing CBD for real-time and embedded systems
(e.g., [1821]). These projects are addressing the issues left behind by the existing commercial technolo-
gies, such as timing predictability (using suitable computational models), support for ofine analysis of
1
This scheduling problem is theoretically intractable, so the conguration tool will have to rely on some heuristics
which works well in practice, but which does not guarantee to nd a solution in all cases when there is a solution.
component assemblies, and better support for resource constrained systems. Often, these projects strive
to remove the considerable runtime exibility provided by existing technologies. This runtime exibility
is judged to be the foremost contributor to unpredictability (the exibility is also adding to the runtime
complexity and prevents CBD for resource constrained systems).
2.2.5 Tools for Design of RTSs
In the industry the term real-time system is highly overloaded, and can mean anything from interactive
systems to superfast systems, or embedded systems. Consequently, it is not easy to judge what tools are
suitable for developing RTSs (as we dene real-time in this chapter).
For instance, UML [22] is commonly used for software design. However, UMLs focus is mainly
on clientserver solutions, and it has proven inapt for RTSs design. As a consequence, UML-based
tools that extend UML with constructs suitable for real-time programs have emerged. The two most
known products are Rationals Rose RealTime [23] and i-Logix Rhapsody [24]. These tools provide
UML support with the extension of real-time proles. While giving real-time engineers access to suit-
able abstractions and computational models, these tools do not provide means to describe timing
properties or requirements in a formal way; thus they do not allow automatic verication of timing
requirements.
TeleLogic provides programming and design support using the language SDL [25]. SDL was originally
developed as a specication language for the telecomindustry, and is as such highly suitable for describing
complex reactive systems. However, the fundamental model of computation is the message-box model,
which has an inherently unpredictable timing behavior. However, for soft embedded RTSs, SDL can give
very time- and space-efcient implementations.
For more resource constrained hard RTSs, design tools are provided by, for example, Arcticus
System [13], TTTech [10], and Vector [26]. These tools are instrumental during both system design
and implementation, and also provide some timing analysis techniques that allow timing verication of
the system (or parts of the system). However, these tools are based on proprietary formats and processes,
and have as such reached a limited customer base (mainly within the automotive industry).
Within the near future UML2 will become an adopted standard [27]. UML2 has support for compu-
tational models suitable for RTSs. This support comes mainly in the form of ports that can have protocols
associated to them. Ports are either provided or required, hence allowing type-matching of connections
between components. UML2 also includes much of the concepts from Rose RealTime, Rhapsody, and
SDL. Other, future design techniques that are expected to have an impact on the design of RTSs include,
the EAST/EEA Architecture Description Language (EAST-ADL) [28]. The EAST-ADL is developed by the
automotive industry and is a description language that will cover the complete development cycle of
distributed, resource constrained, safety critical, RTSs. Tools to support development with EAST-ADL
(which is a UML2 compliant language) are expected to be provided by automotive tool vendors such as
ETAS [29], Vector [30], and Siemens [31].
2.3 Real-Time Operating Systems
A RTOS provides services for resource access and resource sharing, very similar to a general-purpose
OS. An RTOS, however, provides additional services suited for real-time development and also supports
the development process for embedded systems. Using a general-purpose OS when developing RTSs has
several drawbacks:
High resource utilization, for example, large RAM and ROM footprints, and high internal
CPU-demand.
Difcult to access hardware and devices in a timely manner, for example, no application level
control over interrupts.
Lack of services to allow timing sensitive interactions between different processes.
2.3.1 Typical Properties of RTOSs
The state of practice inRTOSs is reectedinReference 32. Not all OSs are RTOSs. ARTOS is typically multi-
threaded and preemptible, there has to be a notion of thread priority, predictable thread synchronization
has to be supported, priority inheritance should be supported, and the OS behavior should be known [33].
This means that the interrupt latency, worst-case execution time (WCET) of system calls, and maximum
time during which interrupts are masked must be known. A commercial RTOS is usually marketed as the
runtime component of an embedded development platform.
As a general rule of thumb one can say that RTOSs are:
Suitable for resource constrained environments. RTSs typically operate in such environments. Most
RTOSs can be congured pre runtime (e.g., at compile time) to only include a subset of the total
functionality. Thus, the application developer can choose to leave out unused portions of the RTOS
in order to save resources. RTOSs typically store much of their conguration in ROM. This is done
mainly for two purposes: (1) minimize the use of expensive RAM memory and (2) minimize the
risk that critical data is overwritten by an erroneous application.
Giving the application programmer easy access to hardware features. These include interrupts and
devices. Most often the RTOSs give the application programmer means to install Interrupt Ser-
vice Routines during compile time and/or during runtime. This means that the RTOS leaves all
interrupt handing to the application programmer, allowing fast, efcient, and predictable handling
of interrupts. In general-purpose OSs, memory-mapped devices are usually protected from direct
access using the MMU (Memory Management Unit) of the CPU, hence forcing all device accesses
to go through the OS. RTOSs typically do not protect such devices, but allow the application to
directly manipulate the devices. This gives faster and more efcient access to the devices. (However,
this efciency comes at the price of an increased risk of erroneous use of the device.)
Providing services that allow implementation of timing sensitive code. An RTOS typically has many
mechanisms to control the relative timing between different processes in the system. Most notably,
an RTOS has a real-time process scheduler whose function is to make sure that the processes
execute in the way the application programmer intended them to. We will elaborate more on the
issues of scheduling in Section 2.4. An RTOS also provides mechanisms to control the processes
relative performance when accessing shared resources. This can, for instance, be done by priority
queues, instead of plain FIFO queues as used in general-purpose OSs. Typically, an RTOS supports
one or more real-time resource locking protocols, such as priority inheritance or priority ceiling
(Section 2.3.2 discusses resource locking protocols further).
Tailored to t the embedded systems development process. RTSs are usually constructed in a host
environment that is different from the target environment, so called cross platform development.
Also, it is typical that the whole memory image, including both RTOS and one or more applica-
tions, is created at the host platform and downloaded to the target platform. Hence, most RTOSs
are delivered as source code modules or precompiled libraries that are statically linked with the
applications at compile time.
2.3.2 Mechanisms for Real-Time
One of the most important functions of an RTOS is to arbitrate access to shared resources in such a way
that the timing behavior of the systembecomes predictable. The two most obvious resource that the RTOS
manages access to are:
The CPU that is, the RTOS should allow processes to execute in a predictable manner.
Shared memory areas that is, the RTOS should resolve contention to shared memory in a way
that gives predictable timing.
The CPU access is arbitrated with a real-time scheduling policy. Section 2.4 will, in more depth,
describe real-time scheduling policies. Examples of scheduling policies that can be used in RTSs
are priority scheduling, deadline scheduling, or rate scheduling. Some of these policies directly use
timing attributes (like deadline) of the tasks to perform scheduling decisions, whereas other policies
use scheduling parameters (like priority, rate, or bandwidth) that indirectly affect the timing of the
tasks.
A special form of scheduling, which is also very useful for RTSs, is table-driven (static) scheduling.
Table-driven scheduling is described further in Section 2.4.2. To summarize, in table-driven scheduling
all arbitration decisions have been made ofine and the RTOS scheduler just follows a simple table. This
gives very good timing predictability, albeit on the expense of system exibility.
The most important aspect of a real-time scheduling policy is that it should provide means to a priori
analyze the timing behavior of the system, hence giving a predictable timing behavior of the system.
Scheduling in general-purpose OSs normally emphasizes properties such as fairness, throughput, and
guaranteed progress; these properties may be adequate in their own respect, however, they are usually in
conict with the requirement that an RTOS should provide timing predictability.
Shared resources (such as memory areas, semaphores, and mutexes) are also arbitrated by the RTOS.
When a task locks a shared resource it will block all other tasks that subsequently tries to lock the resource.
In order to achieve predictable blocking times special real-time resource locking protocols have been
proposed ([34, 35] provides more details about the protocols).
2.3.2.1 Priority Inheritance Protocol
The priority inheritance protocol (PIP) makes a lowpriority task inherit the priority of any higher priority
task that becomes blocked on a resource locked by the lower priority task.
This is a simple and straightforward method to lower the blocking time. However, it is computationally
intractable to calculate the worst-case blocking (which may be innite since the protocol does not prevent
deadlocks). Hence, for hard RTSs or when timing performance needs to be calculated a priori, the PIP is
not adequate.
2.3.2.2 Priority Ceiling Inheritance Protocol
The priority ceiling protocol (PCP) associates, to each resource, a ceiling value that is equal to the highest
priority of any task that may lock the resource. By clever use of the ceiling values of each resource, the
RTOS scheduler will manipulate task priorities to avoid the problems of PIP.
PCP guarantees freedom from deadlocks, and the worst-case blocking is relatively easy to calculate.
However, the computational complexity of keeping track of ceiling values and task priorities gives PCP
high runtime overhead.
2.3.2.3 Immediate Ceiling Priority Inheritance Protocol
The immediate inheritance protocol (IIP) also associates, to each resource, a ceiling value that is equal to
the highest priority of any task that may lock the resource. However, different from PCP, in IIP a task is
immediately assigned the ceiling priority of the resource it is locking.
IIP has the same real-time properties as PCP (including the same worst-case blocking time).
2
However,
IIP is signicantly more easy to implement. It is, in fact, for single node systems easier to implement than
any other resource locking protocol (including non-real-time protocols). In IIP no actual locks need to be
implemented, it is enough for the RTOS to adjust the priority of the task that locks or releases a resource.
IIP has other operational benets, notably it paves the way for letting multiple tasks use the same stack
area. OSs based on IIP can be used to build systems with footprints that are extremely small [36, 37].
2.3.3 Commercial RTOSs
There is an abundance of commercial RTOSs. Most of them provide adequate mechanisms to enable
development of RTSs. Some examples are Tornado/VxWorks [38], LYNX [39], OSE [40], QNX [41],
RT-Linux [42], and ThreadX [43]. However, the major problem with these OSs is the rich set of
2
The average blocking time will however be higher in IIP than in PCP.
primitives provided. These systems provide both primitives that are suitable for RTSs and primitives
that are unt for RTSs (or that should be used with great care). For instance, they usually provide
multiple resource locking protocols; some of which are suitable and some of which are not suitable for
real-time.
This richness becomes a problem when these OSs are used by inexperienced engineers and/or project
managers when projects are large and project management does not provide clear design guidelines/rules.
In these situations, it is very easy to use primitives that will contribute to timing unpredictability of the
developed system. Rather, an RTOS should help the engineers and project managers by providing only
mechanisms that help in designing predictable systems. However, there is an obvious conict between
the desire/need of RTOS manufacturers to provide rich interfaces and stringency needed by designers
of RTSs.
There is a smaller set of RTOSs that have been designed to resolve these problems, and at the same time
also allowextreme lightweight implementations of predictable RTSs. The driving idea is to provide a small
set of primitives that guides the engineers toward good design of their system. Typical examples are the
research RTOS Asterix [36] and the commercial RTOS SSX5 [37]. These systems provide a simplied task
model, in which tasks cannot suspend themselves (e.g., no sleep() primitive) and tasks are restarted
from their entry point on each invocation. The only resource locking protocol that is supported is IIP, and
the scheduling policy is a xed priority scheduling. These limitations makes it possible to build an RTOS
that is able to run, for example, ten tasks using less than 200 bytes of RAM, and at the same time giving
predictable timing behavior [44]. Other commercial systems that follow a similar principle of reducing
the degrees of freedom and hence promote stringent design of predictable RTSs include Arcticus Systems
Rubus OS [13].
Many of the commercial RTOSs provide standard APIs. The most important RTOS standards are
RT-POSIX [45], OSEK [46], and APEX [47]. Here we will only deal with POSIX since it is the most widely
adopted RTOS standard, but those interested in automotive and avionic systems should take a closer look
at OSEK and APEX, respectively.
The POSIX standard is based on Unix, and its goal is portability of applications at the source code
level. The basic POSIX services include task and thread management, le system management, input
and output, and event notication via signals. The POSIX real-time interface denes services facilita-
ting concurrent programming and providing predictable timing behavior. Concurrent programming is
supported by synchronization and communication mechanisms that allow predictability. Predictable
timing behavior is supported by preemptive xed priority scheduling, time management with high
resolution, and virtual memory management. Several restricted subsets of the standard intended for
different types of systems have been dened, as well as specic language bindings, for example, for
Ada [48].
2.4 Real-Time Scheduling
Traditionally, real-time schedulers are divided into ofine and online schedulers. Ofine schedulers make
all scheduling decisions before the system is executed. At runtime a simple dispatcher is used to activate
tasks according to the ofine generated schedule. Online schedulers, on the other hand, decide during
execution, based on various parameters, which task should execute at any given time.
As there are loads of different schedulers developed in the research community, in this section we have
focused on highlighting the main categories of schedulers that are readily available in existing RTOSs.
2.4.1 Introduction to Scheduling
A RTS consists of a set of real-time programs, which in turn consists of a set of tasks. These tasks are
sequential pieces of code, executing on a platform with limited resources. The tasks have different timing
properties, for example, execution times, periods, and deadlines. Several tasks can be allocated to a single
processor. The scheduler decides, at each moment, which task to execute.
A RTS can be preemptive or nonpreemptive. In a preemptive system, tasks can preempt each other,
letting the task with the highest priority execute. In a nonpreemptive system a task that has been allowed
to start will execute until its completion.
Tasks can be categorized into either being periodic, sporadic, or aperiodic. Periodic task execute with a
specied time (period) between task releases. Aperiodic tasks have no information saying when the task is
to be released. Usually aperiodics are triggered by interrupts. Similarly, sporadic tasks have no period, but
in contrast with aperiodics, sporadic tasks have a known minimum time between releases. Typically, tasks
that perform measurements are periodic, collecting some value(s) every nth time unit. A sporadic task
is typically reacting to an event/interrupt that we know has a minimum interarrival time, for example,
an alarm or the emergency shut down of a production robot. The minimum interarrival time can be
constrained by physical laws, or it can be enforced by some hardware mechanism. If we do not know the
minimum time between two consecutive events, we must classify the event-handling task to be aperiodic.
A real-time scheduler schedules the real-time tasks sharing the same resource (e.g., a CPU or a network
link). The goal of the scheduler is to make sure that the timing requirements of these tasks are satised. The
scheduler decides, based on the task timing properties, which task has to execute or to use the resource.
2.4.2 Ofine Schedulers
Ofine schedulers, or table-driven schedulers, work as follows: the schedulers create a schedule (the table)
before the system is started (ofine). At runtime, a dispatcher follows the schedule, and makes sure that
tasks are only executing at their predetermined time slots (according to the schedule). Ofine schedules
are commonly used to implement the time-triggered execution paradigm (described in Section 2.2.3).
By creating a schedule ofine, complex timing constraints can be handled in a way that would be
difcult to do online. The schedule that is created will be used at runtime. Therefore, the online behavior
of table-driven schedulers is very deterministic. Because of this determinism, table-driven schedulers are
more commonly used in applications that have very high safety-critical demands. However, since the
schedule is created ofine, the exibility is very limited, in the sense that as soon as the system changes
(due to, e.g., adding of functionality or change of hardware), a new schedule has to be created and given
to the dispatcher. To create new schedules is nontrivial and sometimes very time consuming.
There also exist combinations of the predictable table-driven scheduling and the more exible priority-
based schedulers, and there exists methods to convert one policy to another [13, 49, 50].
2.4.3 Online Schedulers
Scheduling policies that make their scheduling decisions during runtime are classied as online schedulers.
These schedulers make their scheduling decisions based onsome task properties, for example, task priority.
Schedulers that base their scheduling decisions on task priorities are also called priority-based schedulers.
2.4.3.1 Priority-Based Schedulers
Using priority-based schedulers the exibility is increased (compared with table-driven schedulers), since
the schedule is created online, based on the currently active tasks constraints. Hence, priority-based
schedulers can cope with changes in workload and added functions, as long as the schedulability of the
task set is not violated. However, the exact behavior of priority-based schedulers is harder to predict.
Therefore, these schedulers are not used often in the most safety-critical applications.
Two common priority-based scheduling policies are Fixed-Priority Scheduling (FPS) and Earliest
Deadline First (EDF). The difference between these scheduling policies is whether the priorities of the
real-time tasks are xed or if they can change during execution (i.e., they are dynamic).
In FPS, priorities are assigned to the tasks before execution (ofine). The task with the highest priority
among all tasks that are available for execution is scheduled for execution. It can be proven that some
priority assignments are better than others. For instance, for a simple task model with strictly periodic
noninterfering tasks with deadlines equal to the period of the task, a Rate Monotonic (RM) priority
assignment has been shown by Liu and Layland [5] to be optimal. In RM, the priority is assigned based
on the period of the task. The shorter the period is, the higher priority will be assigned to the task.
Using EDF, the task withthe nearest (earliest) deadline among all available tasks is selectedfor execution.
Therefore the priority is not xed, it changes with time. It has been shown that for simple task models
EDF is an optimal dynamic priority scheme [5].
2.4.3.2 Scheduling with Aperiodics
In order for the priority-based schedulers to cope with aperiodic tasks, different service methods have been
presented. The objective of these service methods is to give a good average response time for aperiodic
requests, while preserving the timing properties of periodic and sporadic tasks. These services are imple-
mented using special server tasks. In the scheduling literature many types of servers are described. Using
FPS, for instance, the Sporadic Server (SS) is presented by Sprunt et al. [51]. The SS has a xed priority
chosen according to the RMpolicy. Using EDF, Dynamic Sporadic Server (DSS) [52,53] extends SS. Other
EDF-based schedulers are the Constant Bandwidth Server (CBS), presented by Abeni and Buttazzo [54],
and the Total Bandwidth Server (TBS) by Spuri and Buttazzo [52, 55]. Each server is characterized partly
by its unique mechanism for assigning deadlines, and partly by a set of variables used to congure the
server. Examples of such variables are bandwidth, period, and capacity.
In Section 2.6 we give examples of how timing properties of FPS can be calculated.
2.5 Real-Time Communications
Real-time communication aims at providing timely and deterministic communication of data between
distributed devices. Inmany cases, there are requirements to provide guarantees of the real-time properties
of these transmissions. There are real-time communicationnetworks of different types, ranging fromsmall
eldbus-based control systems to large Ethernet/Internet distributed applications. There is also a growing
interest for wireless solutions.
Inthis sectionwe give a brief introductiontocommunications ingeneral andreal-time communications
inparticular. We thenprovide anoverviewof the currently most popular real-time communicationsystems
and protocols, both in industry and in academia.
2.5.1 Communication Techniques
Common access mechanisms used in communication networks are CSMA/CD (Carrier Sense Multiple
Access/Collision Detection), CSMA/CA (Carrier Sense Multiple Access/Collision Avoidance), TDMA
(Time Division Multiple Access), Tokens, Central Master, and Mini Slotting. These techniques are used
in both real-time and non-real-time communication, and each of the techniques have different timing
characteristics.
In CSMA/CD, collisions between messages are detected, causing the messages involved in the collision
to be retransmitted. CSMA/CD is used, for example, in Ethernet. CSMA/CA, on the other hand, is
avoiding collisions and is therefore more deterministic in its behavior compared with CSMA/CD. Hence,
CSMA/CA is more suitable for hard real-time guarantees, whereas CSMA/CD can provide soft real-time
guarantees. Examples of networks that implement CSMA/CA are Controller Area Networks (CAN) and
ARINC 629.
TDMA is using time to achieve exclusive usage of the network. Messages are sent at predetermined
instances in time. Hence, the behavior of TDMA-based networks is very deterministic, that is, very suitable
to provide real-time guarantees. One example of a TDMA-based real-time network is TTP.
An alternative way of eliminating collisions on the network is to use tokens. In token-based networks
only the owner of the (unique within the network) token is allowed to send messages on the network.
Once the token holder is done or has used its allotted time the token is passed to another node. Tokens
are used in, for example, Probus.
It is also possible to eliminate collisions by letting one node in the network be the master node. The
master node is controlling the trafc on the network, and it decides which and when messages are allowed
to be sent. This approach is used in, for example, LIN and TTP/A.
Finally, mini slotting can also be used to eliminate collisions. When using mini slotting, as soon as
the network is idle and some node would like to transmit a message, the node has to wait for a unique
(for each node) time before sending any messages. If there are several competing nodes wanting to send
messages, the node with the longer waiting time will see that there is another node that already has started
its transmission of a message. In such a situation the node has to wait until the network becomes idle
again. Hence, collisions are avoided. Mini slotting can be found in, for example, FlexRay and ARINC 629.
2.5.2 Fieldbuses
Fieldbuses are a family of factory communication networks that have evolved as a response to the demand
toreduce cabling costs infactory automationsystems. By moving froma situationinwhichevery controller
has its own cables connecting the sensors to the controller (parallel interface), to a system with a set of
controllers sharing a bus (serial interface), costs could be cut and exibility could be increased. Pushing
for this evolution of technology was both the fact that the number of cables in the system increased as the
number of sensors and actuators grew, together with controllers moving from being specialized with their
own microchip, to sharing a microprocessor with other controllers. Fieldbuses were soon ready to handle
the most demanding applications on the factory oor.
Several eldbus technologies, usually very specialized, were developed by different companies to meet
the demands of their applications. Fieldbuses used in the automotive industry are, for example, CAN,
TT-CAN, TTP, LIN, and FlexRay. In avionics, ARINC 629 is one of the frequently used communication
standards. Probus is widely used in automation and robotics, while in trains TCN and WorldFIP are very
popular communication technologies. We will now present each of these eldbuses in some more detail,
outlining key features and specic properties.
2.5.2.1 Controller Area Network
The Controller Area Network (CAN) [56] was standardized by the International Standardisation
Organisation (ISO) [57] in 1993. Today CAN is a widely used eldbus, mainly in automotive systems
but also in other real-time applications, for example, medical equipment. CAN is an event-triggered
broadcast bus designed to operate at speeds of up to 1 Mbps. CAN is using a xed-priority based arbi-
tration mechanism that can provide timing guarantees using FPS type of analysis [58, 59]. An example of
this analysis will be provided in Section 2.6.3.
CAN is a collision-avoidance broadcast bus, using deterministic collision resolution to control access
to the bus (so-called CSMA/CA). The basis for the access mechanism is the electrical characteristics of a
CANbus, allowing sending nodes to detect collisions in a nondestructive way. By monitoring the resulting
bus value during message arbitration, a node detects if there are higher priority messages competing for
the access to the bus. If this is the case, the node will stop the message transmission, and try to retransmit
the message as soon as the bus becomes idle again. Hence, the bus is behaving like a priority-based queue.
2.5.2.2 Time-Triggered CAN
Time-triggered communication on CAN (TT-CAN) [60] is a standardized session layer extension to the
original CAN. InTT-CAN, the exchange of messages is controlled by the temporal progressionof time, and
all nodes are following a predened static schedule. It is also possible to support original event-triggered
CAN trafc together with the time-triggered trafc. This trafc is sent in dedicated arbitration windows,
using the same arbitration mechanism as native CAN.
The static schedule is based on a time division (TDMA) scheme, where message exchanges may only
occur during specic time slots or in time windows. Synchronization of the nodes is done using either a
clock synchronization algorithm, or by periodic messages from a master node. In the latter case, all nodes
in the system are synchronizing with this message, which gives a reference point in the temporal domain
for the static schedule of the message transactions, that is, the masters view of time is referred to as the
networks global time.
TT-CAN appends a set of new features to the original CAN, and being standardized, several
semiconductor vendors are manufacturing TT-CAN compliant devices.
2.5.2.3 Flexible Time-Triggered CAN
Flexible time-triggered communication on CAN (FTT-CAN) [61, 62] provides a way to schedule CAN in
a time-triggered fashion with support for event-triggered trafc as well. In FTT-CAN, time is partitioned
into Elementary Cycles (ECs) which are initiated by a special message, the Trigger Message (TM). This
message triggers the start of the EC and contains the schedule for the time-triggered trafc that shall
be sent within this EC. The schedule is calculated and sent by a master node. FTT-CAN supports both
periodic and aperiodic trafc by dividing the EC into two parts. In the rst part, the asynchronous
window, the aperiodic messages are sent, and in the second part, the synchronous window, trafc is sent
in a time-triggered fashion according to the schedule delivered by the TM. FTT-CAN is still mainly an
academic communication protocol.
2.5.2.4 Time-Triggered Protocol
The Time-Triggered Protocol Class C for, TTP/C [10, 63], is a TDMA based communication network
intended for truly hard real-time communication. TTP/Cis available for network speeds of up to 25 Mbps.
TTP/C is part of the Time-Triggered Architecture (TTA) by Kopetz [10, 64], which is designed for safety-
critical applications. TTP/C has support for fault tolerance, clock synchronization, membership services,
fast error detection, and consistency checks. Several major automotive companies are supporting this
protocol.
For the less hard RTSs (e.g., soft RTSs), there exists a scaled-down version of TTP/C called TTP/A [10].
2.5.2.5 Local Interconnect Network
The Local Interconnect Network LIN [65], was developed by the LIN Consortium (including Audi, BMW,
DaimlerChrysler, Motorola, Volvo, and VW) as a low cost alternative for small networks. LIN is cheaper
than, for example, CAN. LIN is using the UART/SCI interface hardware, and transmission speeds are
possible up to 20 Kbps. Among the nodes in the network, one node is the master node, responsible for
synchronization of the bus. The trafc is sent in a time-triggered fashion.
2.5.2.6 FlexRay
FlexRay [66] was proposed in 1999 by several major automotive manufacturers, for example, Daimler
Chrysler and BMW, as a competitive next generation eldbus replacing CAN. FlexRay is a real-time
communication network that provides both synchronous and asynchronous transmissions with network
speeds up to 10 Mbps. For the synchronous trafc FlexRay is using TDMA, providing deterministic data
transmissions with a bounded delay. For the asynchronous trafc mini-slotting is used. Compared with
CAN, FlexRay is more suitable for the dependable application domain, by including support for redundant
transmission channels, bus guardians, and fast error detection and signaling.
2.5.2.7 ARINC 629
For avionic andaerospace communicationsystems, theARINC429[67] standardandits newer ARINC629
[67] successor are the most commonly used communication systems today. ARINC 629 supports both
periodic and sporadic communication. The bus is scheduled in bus cycles, which in turn are divided in
two parts. In the rst part periodic trafc is sent, and in the second part the sporadic trafc is sent. The
arbitration of messages is based on collision avoidance (i.e., CSMA/CA) using mini-slotting. Network
speeds are as high as 2 Mbps.
2.5.2.8 Probus
Probus [68] is used in process automation and robotics. There are three different versions of
Probus: (1) Probus-DP is optimized for speed and low cost, (2) Probus-PA is designed for pro-
cess automation, and (3) Probus-FMS is a general purpose version of Probus. Probus provides
masterslave communication together with token mechanisms. Probus is available with data rates up
to 12 Mbps.
2.5.2.9 Train Communication Network
The Train Communication Network (TCN) [69] is widely used in trains, and implements the IEC 61275
standard as well as the IEEE 1473 standard. TCN is composed of two networks: the Wire Train Bus (WTB)
and the Multifunction Vehicle BUS (MVB). The WTB is the network used to connect the whole train, that
is, all vehicles of the train. Network data rate is up to 1 Mbps. The MVB is the network used within one
vehicle. Here the maximum data rate is 1.5 Mbps.
Both the WTB and the MVB are scheduled in cycles called basic periods. Each basic period consists of
a periodic phase and a sporadic phase. Hence, there is a support for both periodic and sporadic type of
trafc. The difference between the WTB and the MVB (apart from the data rate) is the length of the basic
periods (1 or 2 msec for the MVB and 25 msec for the WTB).
2.5.2.10 WorldFIP
The WorldFIP [70] is a very popular communication network in train control systems. WorldFIP is
based on the ProducerDistributorConsumers (PDC) communication model. Currently, network speeds
are as high as 5 Mbps. The WorldFIP protocol denes an application layer that includes PDC- and
messaging-services.
2.5.3 Ethernet for Real-Time Communication
In parallel with the search for the holy grail of real-time communication, Ethernet has established itself as
the de facto standard for non-real-time communication. Comparing networking solutions for automation
networks and ofce networks, eldbuses was the choice for the former. At the same time, Ethernet
developed as the standard for ofce automation, and owing to its popularity, prices on networking
solutions dropped. Ethernet is not originally developed for real-time communication since the original
intention with Ethernet is to maximize throughput (bandwidth). However, nowadays, a big effort is being
made in order to provide real-time communication using Ethernet. The biggest challenge is to provide
real-time guarantees using standard Ethernet components.
The reason why Ethernet is not very suitable for real-time communication is its handling of collisions
on the network. Several proposals to minimize or eliminate the occurrence of collisions on Ethernet have
been proposed. The following sections present some of these proposals.
2.5.3.1 TDMA
A simple solution would be to eliminate the occurrence of collisions on the network. This has been
explored by, for example, Kopetz et al. [71], using a TDMA protocol on top of Ethernet.
2.5.3.2 Usage of Tokens
Another solution to eliminate the occurrence of collisions is the usage of tokens. Token-based solutions
[72, 73] on the Ethernet also eliminates collisions, but is not compatible with standard hardware.
A token-based communication protocol is a way to provide real-time guarantees on most types
of networks. This is because they are deterministic in their behavior, although a dedicated net-
work is required. That is, all nodes sharing the network must obey the token protocol. Examples
of token-based protocols are the Timed Token Protocol (TTP) [74] and the IEEE 802.5 Token Ring
Protocol.
2.5.3.3 Modied Collision Resolution Algorithm
A different approach is to modify the collision resolution algorithm [75, 76]. Using standard Ethernet
controllers, the modied collision resolution algorithm is nondeterministic. In order to make a deter-
ministic modied collision resolution algorithm, a major modication of the Ethernet controllers is
required [77].
2.5.3.4 Virtual Time and Window Protocols
Another solution to real-time communication using Ethernet is the usage of the Virtual Time CSMA
(VTCSMA) [7880] protocol, where packets are delayed in a deterministic way in order to eliminate the
occurrence of collisions. Moreover, Window Protocols [81] are using a global window (synchronized time
interval) that also remove collisions. The window protocol is more dynamic and somewhat more efcient
in its behavior compared with the VTCSMA approach.
2.5.3.5 Master/Slave
A fairly straightforward way of providing real-time trafc on Ethernet is by using a master/slave
approach. As a part of the FTT framework [82], FTT Ethernet [83] is proposed as a master/multislave
protocol. At the cost of some computational overhead at each node in the system, timely delivery of
messages on Ethernet is provided.
2.5.3.6 Trafc Smoothing
The most recent work, without modications to the hardware or networking topology (infrastructure),
is the usage of trafc smoothing. Trafc smoothing can be used to eliminate bursts of trafc [84, 85] that
have severe impact on the timely delivery of message packets on the Ethernet. By keeping the network
load below a given threshold, a probabilistic guarantee of message delivery can be provided. Hence, trafc
smoothing could be a solution for soft RTSs.
2.5.3.7 Black Bursts
Black burst [86] is implementing a collision avoidance protocol on Ethernet. When a station wants to
submit a message, the station waits until the network is idle, i.e., no trafc is being transmitted. Then,
to avoid collisions, the transmitting station starts jamming the network. Several transmitting stations
might start jamming the network at the same time. However, each station is using unique length jamming
signals, always allowing a unique station to win. Winning means that once the jamming signal is over, the
network should be idle, i.e., no other stations are jamming the network. If this is the case, the message is
transmitted. In other cases, a loosing station will wait until the network is idle again, and the mechanism
starts over. Hence, no message collisions will occur on the network.
2.5.3.8 Switches
Finally, a completely different approach to achieve real-time communication using Ethernet is by changing
the infrastructure. One way of doing this is to construct the Ethernet using switches to separate collision
domains. By using these switches, a collisionfree network is provided. However, this requires newhardware
supporting the IEEE 802.1p standard. Therefore it is not an as attractive solution for existing networks as,
for example, trafc smoothing.
2.5.4 Wireless Communication
There are no commercially available wireless communication protocols providing real-time guarantees.
3
Two of the more common used wireless protocols today are the IEEE 802.11 (WLAN) and Bluetooth.
However, these protocols are not providing the temporal guarantees needed for hard real-time commu-
nication. Today, a big effort is being made (as with Ethernet) to provide real-time guarantees for wireless
communication, possibly by using either WLAN or Bluetooth.
3
Bluetooth provides real-time guarantees limited to streaming voice trafc.
2.6 Analysis of RTSs
The most important property to analyze in a RTS is its temporal behavior, that is, the timeliness of the
system. The analysis should provide strong evidence that the system performs as intended at the correct
time. This section will give an overview of the basic properties that are analyzed in a RTS. The section
concludes with a presentation of trends and tools in the area of RTS analysis.
2.6.1 Timing Properties
Timing analysis is a complex problem. Not only are the used techniques sometimes complicated, but also
the problem itself is elusive; for instance, what is the meaning of the term program execution time?
Is it the average time to execute the program, or the worst possible time, or does it mean some form of
normal execution time? Under what conditions does a statement regarding program execution times
apply? Is the program delayed by interrupts or higher priority tasks? Does the time include waiting for
shared resources? etc.
To straighten out some of these questions, and to be able to study some existing techniques for timing
analysis, we categorize timing analysis into three major types. Each type has its own purpose, benets, and
limitations. The types are listed below.
2.6.1.1 Execution Time
This refers to the execution time of a singe task (or program, or function, or any other unit of single-
threaded sequential code). The result of an execution-time analysis is the time (i.e., the number of clock
cycles) the task takes to execute, when executing undisturbed on a single CPU, that is, the result should
not account for interrupts, preemption, background DMA transfers, DRAM refresh delays, or any other
type of interfering background activities.
At a rst glance, leaving out all types of interference from the execution-time analysis would give
us unrealistic results. However, the purpose of the execution-time analysis is not to deliver estimates on
real-worldtiming when executing the task. Instead, its role is to nd out howmuch computing resources
is needed to execute the task. (Hence, background activities that are not related to the task should not be
accounted for.)
There are some different types of execution times that can be of interest:
Worst-case execution time (WCET). This is the worst possible execution time a task could exhibit,
or equivalently, the maximum amount of computing resources required to execute the task. The
WCET should include any possible atypical task execution such as exception handling or clean up
after abnormal task termination.
Best-case execution time (BCET). During some types of real-time analysis, not only the WCET is
used, but also, as we will describe later, having knowledge about the BCET of tasks is useful.
Average execution time (AET). The AET can be useful in calculating throughput gures for a
system. However, for most RTS analysis the AET is of less importance, simply since a reasonable
approximation of the average case is easy to obtain during testing (where typically, the average
system behavior is studied). Also, only knowing the average and not knowing any other statistical
parameters such as standard deviation or distribution function makes statistical analysis difcult.
For analysis purposes a more pessimistic metric such as the 95% quartile would be more useful.
However, analytical techniques using statistical metrics of execution time are scarce and not very
well developed.
2.6.1.2 Response Time
The response time of a task is the time it takes from the invocation to the completion of the task. In other
words, the time from when the task is rst placed in the OSs ready-queue to the time when it is removed
from the running state and placed in the idle or sleeping state.
Typically, for analysis purposes it is assumed that a task does not voluntarily suspend itself dur-
ing its execution. That is, the task may not call primitives such as sleep() or delay(). However,
involuntary suspension, such as blocking on shared resources, is allowed. That is, primitives such as
get_semaphore() and lock_database_tuple() are allowed. When a program voluntarily
suspends itself, then that program should be broken down into two (or more) analysis tasks.
The response time is typically a system level property, in that it includes interference from other, unre-
lated, tasks and parts of the system. The response time also includes delays caused by contention on
shared resources. Hence, the response time is only meaningful when considering a complete system, or in
distributed systems, a complete node.
2.6.1.3 End-to-End Delay
The described execution time and response time are useful concepts since they are relatively easy to
understand and have well dened scopes. However, when trying to establish the temporal correctness
of a system, knowing the WCET and/or the response times of tasks is often not enough. Typically, the
correctness criteria is stated using end-to-end latency timing requirements, for instance, an upper bound
on the delay between the input of a signal and the output of a response.
In a given implementation there may be a chain of events taking place between the input of a signal and
the output of a response. For instance, one task may be in charge of reading the input and another task
for generating the response, and the two tasks may have to exchange messages on a communications link
before the response can be generated. The end-to-end timing denotes timing of externally visible events.
2.6.1.4 Jitter
The termjitter is used as a metric for variability in time. For instance, the jitter in execution time of a task
is the difference between the tasks BCET and WCET. Similarly, the response-time jitter of a task is the
difference between its best-case response time and its worst-case response time. Often, control algorithms
has requirements that the jitter of the output should be limited. Hence, the jitter is sometimes a metric
equally important as the end-to-end delay.
Also input to the system can have jitter. For instance, an interrupt which is expected to be periodic may
have a jitter (owing to some imperfection in the process generating the interrupt). In this case the jitter
value is used as a bound on the maximum deviation from the ideal period of the interrupt. Figure 2.3
illustrates the relation between the period and the jitter for this example.
Note that jitter shouldnot accumulate over time. For our example, eventhoughtwosuccessive interrupts
could arrive closer than one period, in the long run, the average interrupt interarrival time will be that of
the period.
In the above list of types of time, we only mentioned time to execute programs. However, in many
RTSs, other timing properties may also exist. This includes delays on communications network and other
resources, such as hard disk drives may be causing delays and need to be analyzed. The above introduced
times can all be mapped on to different types of resources, for instance, the WCET of a task corresponds to
the maximum size of a message to be transmitted, and the response time of message is dened analogous
to the response time of a task.
Time
I
n
t
e
r
r
u
p
t
I
n
t
e
r
r
u
p
t
Period Jitter
Earliest time Latest time
FIGURE 2.3 Jitter used as a bound on variability in periodicity.
2.6.2 Methods for Timing Analysis
When analyzing hard RTSs it is essential that the estimates obtained during timing analysis are safe.
An estimate is considered safe if it is guaranteed that it is not an underestimation of the actual worst-case
time. It is also important that the estimate is tight, meaning that the estimated time is close to the actual
worst-case time.
For the previously dened types of timings (Section 2.6.1) there are different methods available that
are given in the following sections.
2.6.2.1 Execution-Time Estimation
For real-time tasks the WCET is the most important execution time measure to obtain. Sadly, however, it
is also often the most difcult measure to obtain.
Methods to obtain the WCET of a task can be divided into two categories: (1) static analysis, and
(2) dynamic analysis. Dynamic analysis is essentially equivalent to testing (i.e., executing the task on the
target hardware) and has all the drawbacks/problems that testing exhibits (such as being tedious and error
prone). One major problem with dynamic analysis is that it does not produce safe results. In fact, the
result can never exceed the true WCET and it is very difcult to make sure that the estimated WCET is
really the true WCET.
Static analysis, on the other hand, can give guaranteed safe results. Static analysis is performed by
analyzing the code (source and/or object code is used) and basically counting the number of clock
cycles that the task may use to execute (in the worst possible case). Static analysis uses models of the
hardware to predict execution times for each instruction. Hence, for modern hardware it may be very
difcult to produce static analyzers that give good results. One source of pessimism in the analysis
(i.e., overestimation) is hardware caches; whenever an instruction or data item cannot be guaranteed to
reside in the cache, a static analyzer must assume a cache miss. And since modeling the exact state of
caches (sometimes of multiple levels), branch predictors etc. is very difcult and time consuming, few
tools that give adequate results for advanced architectures exists. Also, to performa programowand data
analysis that exactly calculates, for example, the number of times a loop iterates or the input parameters
for procedures is difcult.
Methods for good hardware and software modeling do exist in the research community, however,
combining these methods into good quality tools has proven tedious.
2.6.2.2 Schedulability Analysis
The goal of schedulability analysis is to see whether or not a system is schedulable. A system is deemed
schedulable if it is guaranteed that all task deadlines will always be met. For statically scheduled (table
driven) systems, calculation of response times are trivially given from the static schedule. However, for
dynamically scheduled systems (such as xed priority or deadline scheduling) more advanced techniques
have to be used.
There are two main classes of schedulability analysis techniques: (1) response-time analysis, and
(2) utilization analysis. As the name suggest, a response-time analysis calculates an (safe) estimate of
the worst-case response time of a task. That estimate can then be compared with the deadline of the task
and if it does not exceed the deadline then the task is schedulable. Utilization analysis, in contrast, does not
directly derive the response times for tasks, rather they give a boolean result for each task telling whether
or not the task is schedulable. This result is based on the fraction of utilization of the CPU for a relevant
subset of the tasks, hence the term utilization analysis.
Both the analyses are based on similar types of task models. However, typically, the task models used
for analysis are not the task models provided by commercial RTOSs. This problem can be resolved by
mapping one or more OS task on to one or more analysis task. However, this mapping has to be performed
manually and requires an understanding of the limitations of the analysis task model and the analysis
technique used.
2.6.2.3 End-to-End Delay Estimation
The typical way to obtain end-to-end delay estimations is to calculate the response time for each
task/message in the end-to-end chain and to summarize these response times to obtain an end-to-end
estimate. When using an utilization based analysis technique (in which no response time is calculated)
one has to resort to using the task/message deadlines as safe upper bounds on the response times.
However, when analyzing distributed RTSs, it may not be possible to calculate all response times in one
pass. The reason for this is that delays on one node will lead to jitter on another node, and that this jitter
may in turn affect the response times on that node. Since jitter can propagate in several steps between
nodes, in both directions, there may not exist a right order to analyze the nodes. (If A sends a message
to B, and B sends a message to A; which node should one analyze rst?) Solution to these type of problems
are called holistic schedulability analysis methods (since they consider the whole system). The standard
method for holistic response-time analysis is to repeatedly calculate response times for each node (and
update jitter values in the nodes affected by the node just analyzed) until response times do not change
(i.e., a x point is reached).
2.6.2.4 Jitter Estimation
To calculate the jitter one need not only perform a worst-case analysis (of for instance, response time or
end-to-end delay), but also perform a best-case analysis.
However, eventhough best-case analysis techniques oftenare conceptually similar to worst-case analysis
techniques, there has been little attention paid to best-case analysis. One reason for not spending too much
time on best-case analysis is that it is quite easy to make a conservative estimate of the best-case: the best-
case time is never less than zero (0). Hence, in many tools it is simply assumed that the BCET(for instance)
is zero, whereas great efforts can be spent analyzing the WCET.
However, it is important to have tight estimates of the jitter, and to keep the jitter as low as possible.
It has been shown that the number of execution paths a multi-tasking RTS can take, dramatically increases
if jitter increases [87]. Unless the number of possible execution paths is kept as low as possible it becomes
very difcult to achieve good coverage during testing.
2.6.3 Example of Analysis
In this section we give simple examples of schedulability analysis. We show a very simple example of how
a set of tasks running on a single CPU can be analyzed, and we also give an example of how the response
times for a set of messages sent on a CAN bus can be calculated.
2.6.3.1 Analysis of Tasks
This example is based on some 30 year old task models and is intended to give the reader a feeling for how
these types of analysis work. Todays methods allow for far richer and more realistic task models; with
resulting increase of complexity of the equations used (hence they are not suitable for use in our example).
In the rst example we will analyze a small task set as described in Table 2.2, where T, C, and D denote
the tasks period, WCET, and deadline, respectively. In this example T = D for all tasks and priorities have
been assigned in RM order, that is, the highest rate gives the highest priority.
TABLE 2.2 Example Task Set for Analysis
Task T C D Prio
X 30 10 30 High
Y 40 10 40 Medium
Z 52 10 52 Low
For the task set in Table 2.2 original analysis techniques of Liu and Layland [5], and Joseph and Pandya
[88] are applicable, and we can perform both utilization-based and response-time based schedulability
analysis.
We start with the utilization based analysis; for this task model Liu and Laylands result is that a task set
of n tasks is schedulable if its total utilization, U
tot
, is bounded by the following equation:
U
tot
n(2
1/n
1)
Table 2.3 shows the utilization calculations performed for the schedulability analysis. For our example,
task set n = 3 and the bound is approximately 0.78. However, the utilization (U
tot
=

n
i=1
(C
i
/T
i
)) for
our task set is 0.81, which exceeds the bound. Hence, the task set fails the RMtest and cannot be deemed
schedulable.
Joseph and Pandyas response-time analysis allows us to calculate worst-case response time, R
i
, for each
task i in our example (Table 2.2). This is done using the following formula:
R
i
= C
i
+
jhp(i)
R
i
T
j
C
j
(2.1)
where hp(i) denotes the set of tasks with priority higher than task i.
The observant reader may have noticed that equation 2.1 is not on closed form, in that R
i
is not
isolated on the left-hand side of the equality. As a matter of fact, R
i
cannot be isolated on the left-hand
side of the equality; instead equation 2.1 has to be solved using x-point iteration. This is done with the
recursive formula in equation 2.1, starting with R
0
i
= 0 and terminating when a x point has been reached
(i.e., when R
m+1
i
= R
m
i
).
R
m+1
i
= C
i
+
jhp(i)
R
m
i
T
j
C
j
(2.2)
For our example task set Table 2.4 shows the results of calculating equation 2.1. From the table we can
conclude that no deadlines will be missed and that the system is schedulable.
Remarks As we could see for our example task set in Table 2.2, the utilization based test could not deem
the task set as schedulable whereas the response-time basedtest could. This situationis symptomatic for the
relation between utilization based and response-time based schedulability tests. That is, the response-time
based tests nd more task sets schedulable than the utilization based tests.
TABLE 2.3 Result of RM Test
Task T C D Prio U
X 30 10 30 High 0.33
Y 40 10 40 Medium 0.25
Z 52 10 52 Low 0.23
Total 0.81
Bound 0.81
TABLE 2.4 Result of Response-Time Analysis for Tasks
Task T C D Prio R R D
X 30 10 30 High 10 Yes
Y 40 10 40 Medium 20 Yes
Z 52 10 52 Low 52 Yes
TABLE 2.5 Example CAN-Message Set
Message T S D Id
X 350 8 300 00010
Y 500 6 400 00100
Z 1000 5 800 00110
TABLE 2.6 Result of Response-Time Analysis for CAN
Message T S D Id Prio C w R R D
X 350 8 300 00010 High 130 130 260 Yes
Y 500 6 400 00100 Medium 111 260 371 Yes
Z 1000 5 800 00110 Low 102 612 714 Yes
However, as alsoshownby the example, the response-time basedtest needs toperformmore calculations
than the utilization based tests. For this simple example the extra computational complexity of the
response-time test is insignicant. However, whenusing moderntask models (that are capable of modeling
realistic systems) the computational complexity of response-time based tests is signicant. Unfortunately,
for these advanced models, utilization based tests are not always available.
2.6.3.2 Analysis of Messages
In our second example we show how to calculate the worst-case response times for a set of periodic
messages sent over the CAN-bus (CAN is described in Section 2.5.2). We use a response-time analysis
technique similar to the one we used when we analyzed the task set in Table 2.2. In this example our
message set is given in Table 2.5, where T, S, D, and Id denotes the messages period, data size (in bytes),
deadline, and CAN identier, respectively. (The time-unit used in this example is bit-time, that is, the
time it takes to send one bit. For a 1 Mbit CAN this means that 1 time-unit is 10
6
sec.)
Before we attack the problem of calculating response times we extend Table 2.5 with two columns.
First, we need the priority of each message; in CAN this is given by the identier, the lower the numerical
value the higher the priority. Second, we need to know the worst-case transmission time of each message.
The transmission time is given partly by the message data size but we also need to add time for the frame
header and for any stuff bits.
4
The formula to calculate the transmission time, C
i
for a message i containing
S
i
bytes of pay load data is given below:
C
i
= 8S
i
+47 +
34 +8S
i
1
4
In Table 2.6 the two columns Prio and C shows the priority assignment and the transmission times for
our example message set.
Now we have all the data needed to perform the response-time analysis. However, since CAN is a
nonpreemptive resource the structure of the equation is slightly different from equation 2.1 which we
used for analysis of tasks. The response-time equation for CAN is given in equation 2.3.
R
i
= w
i
+C
i
w
i
= B
i
+
jhp(i)
w
i
+1
T
j
C
j
(2.3)
4
CAN adds stuff bits, if necessary, to avoid the two reserved bit patterns 000000 and 111111. These stuff bits are
never seen by the CAN user but have to be accounted for in the timing analysis.
Inequation2.3, B
i
denotes the blocking time orginating froma lower priority message already intransition
when message i enters arbitration (B
i
135 which is the largest possible message), and hp(i) denotes the
set of messages with higher priority than message i. Note that (similar to equation 2.1) w
i
is not isolated
on the left-hand side of the equation, and its value has to be calculated using x-point iteration (compare
with equation 2.2).
Applying equation 2.3 we can now calculate the worst-case response time for our example messages.
In Table 2.6 the two columns w and R shows the results of the calculations, and the nal column shows
the schedulablilty verdict for each message.
As we can see from Table 2.6, our example message set is schedulable, meaning that the messages will
always be transmitted before their deadlines. Note that this analysis was made assuming that there will
not be any retransmissions of broken messages. Normally, CAN automatically retransmits any message
that has been broken owing to interference on the bus. To account for such automatic retransmissions an
error model needs to be adopted and the response-time equation adjusted accordingly, see, for example,
Reference 59.
2.6.4 Trends and Tools
As discussed earlier, and also illustrated by our example in Table 2.2, there is a mismatch between the
analytical task models and the task models provided by commonly used RTOSs. One of the basic problems
is that there is no one-to-one mapping between analysis tasks and RTOS tasks. In fact, for many systems
there is a N-to-N mapping between the task types. For instance, an interrupt handler may have to be
modeled as several different analysis task (one analysis task for each type of interrupt it handles), and
one OS task may have to be modeled as several analysis tasks (for instance, one analysis task per call to
sleep() primitives).
Also, current schedulability analysis techniques cannot adequately model other types of task synchro-
nization than locking/blocking on shared resources. Abstractions such as message queues are difcult to
include in the schedulability analysis.
5
Furthermore, tools to estimate the WCET are also scarce. Currently
only two tools that gives safe WCET estimates are commercially available [90, 91].
These problems have led to low penetration of schedulability analysis in industrial software-
development processes. However, in isolated domains, such real-time networks, some commercial tools
that are based on real-time analysis do exist. For instance, Volcano [92, 93] provides tools for the CAN
bus that allow system designers to specify signals on an abstract level (giving signal attributes such as size,
period, and deadline) and automatically derive a mapping of signals to CAN messages where all deadlines
are guaranteed to be met.
On the software side, tools provided by, for instance, TimeSys [94], Arcticus Systems [13], and TTTech
[10] can provide system development environments with timing analysis as an integrated part of the tool
suite. However, all these tools require that the software development processes is under complete control
of the respective tool. This requirement has limited the use of these tools.
The widespread use of UML [22] in software design has led to some specialized UML products for
real-time engineering [23, 24]. However, these products, as of today, do not support timing analysis of the
designed systems. There is, however, recent work within the OMG that species a prole Schedulability,
Performance, and Time(SPT) [95], which allows specication of both timing properties and requirement
in a standardized way. This will in turn lead to products that can analyze UML models conforming to the
SPT prole.
The SPT prole has, however, not been received without criticism. Critique has mainly come from
researchers active in the timing analysis eld, claiming both that the prole is not precise enough and that
some important concepts are missing. For instance, the Universidad de Cantabria has instead developed
5
Techniques to handle more advancedmodels include timedlogic andmodel checking. However, the computational
and conceptual complexity of these techniques has limited their industrial impact. However, there are examples of
commercial tools for this type of verication, for example, Reference [89].
the MASTUML prole and an associated MAST tool for analyzing MASTUML models [96, 97].
MAST allows modeling of advanced timing properties and requirement, and the tool also provides
state-of-the-art timing analysis techniques.
2.7 Component-Based Design of RTS
Component-Based Design is a current trend in software engineering. In the desktop-area component tech-
nologies like COM [15], .NET [16], and Java Beans [17] have gained widespread use. These technologies
give substantial benets, in terms of reduced development time and software complexity, when designing
complex and/or distributed systems. However, for RTSs these, and other, desktop oriented component
technologies does not sufce.
As stated before, the main challenge of designing RTSs is the need to consider issues that do not typically
apply to general-purpose computing systems. These issues include:
Constraints on extra-functional properties, such as timing, QoS, and dependability.
The need to statically predict (and verify) these extra-functional properties.
Scarce resources, including processing power, memory, and communication bandwidth.
In the commercially available component technologies today, there is little or no support for these
issues. Also on the academic scene, there are no readily available solutions to satisfactorily handle all these
issues.
In the remainder of this chapter we will discuss how these issues can be addressed in the context of
CBD. In doing so, we also highlight the challenges in designing a CBD process and component technology
for development of RTS.
2.7.1 Timing Properties and CBD
In general, for systems where timing is crucial there will necessarily be at least some global timing
requirements that have to be met. If the system is built from components, this will imply the need for
timing parameters/properties of the components and some proof that the global timing requirements
are met.
In Section 2.6 we introduced the following four types of timing properties:
execution time
response time
end-to-end delay
jitter.
So, how are these related to the use of a CBD methodology?
2.7.1.1 Execution Time
For a component used in a real-time context, an execution time measure will have to be derived. This
is, as discussed in Section 2.6, not an easy or satisfactorily solved problem. Furthermore, since execution
time is inherently dependent on the target hardware, and since reuse is the primary motivation for CBD,
it is highly desirable if the execution time for several targets would be available. (Alternatively, that the
execution time for new hardware platforms is automatically derivable.)
The nature of the applied component model may also make execution-time estimation more or less
complex. Consider, for instance, a clientserver oriented component model, with a server component that
provides services of different types, as illustrated in Figure 2.4(a). What does execution time mean for
such a component? Clearly, a single execution time is not appropriate, rather the analysis will require a set
of execution times related to servicing different requests. On the other hand, for a simple port-based object
component model [21] in which components are connected in sequence to form periodically executing
transactions (illustrated in Figure 2.4[b]), it could be possible to use a single execution time measure,
Server component
Client comp.
Client comp.
Client comp.
(a) (b)

Client comp.
Client comp.
Client comp.
Client comp.
FIGURE 2.4 (a) A complex server component, providing multiple services to multiple users, and (b) a simple chain
of components implementing a single thread of control.
Task
Component
(a)
Task
Component Component
Component
Task
(b)
Component
Task
Task
Task
Component
Task Task
(c)
Task
(d)
Component
Task
Component
Task
Component
Task
Component
Component Component
FIGURE 2.5 Tasks and components: (a) one-to-one correspondence, (b) one-to-many correspondence, (c) many-
to-one correspondence, (b + c) many-to-many correspondence, and (d) irregular correspondence.
corresponding to the execution time required for reading the values at the input ports, performing the
computation, and writing values to the output ports.
2.7.1.2 Response Time
Response times denote the time from invocation to completion of tasks, and response-time analysis is the
activity to statically derive response-time estimates.
The rst question to ask from a CBD perspective is: what is the relation between a task and a
component?
This is obviously highly related to the component model used. As illustrated in Figure 2.5(a), there
could be a one-to-one mapping between components and tasks, but in general, several components could
be implemented in one task (Figure 2.5[b]) or one component could be implemented by several tasks
(Figure 2.5[c]), hence there is a many-to-many relation between components and tasks. In principle,
there could even be more irregular correspondence between components and tasks, as illustrated in
Figure 2.5(d). Furthermore, in a distributed system there could be a many-to-many relation between
components and processing nodes, making the situation even more complicated.
Once we have sorted out the relation between tasks and components, we can calculate the response times
of tasks, given that we have an appropriate analysis method for the used execution paradigm, and that
relevant execution time measures are available. However, to relate these response times to components
and the application level timing requirements may not be straightforward, but this is an issue for the
subsequent end-to-end analysis.
Another issue with respect to response times is how to handle communication delays in distrib-
uted systems. In essence there are two ways to model the communication, as depicted in Figure 2.6.
In Figure 2.6(a) the network is abstracted away and the intercomponent communication is handled by
the framework. In this case, response-time analysis is made more complicated since it must account for
different delays in intercomponent communication, depending on the physical location of components.
Network
Node Node Node
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Network
Node Node Node
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Comp.
Network component
(a) (b)
Network
FIGURE 2.6 Components and communication delays: (a) communication delays can be part of the intercomponent
communication properties, and (b) communication delays can be timing properties of components.
In Figure 2.6(b), on the other hand, the network is modeled as a component itself, and network delays can
be modeled as delays in any other component (and intercomponent communication can be considered
instantaneous).
However, the choice of how to model network delays also has an impact on the software engineer-
ing aspects of the component model. In Figure 2.6(a), the communication is completely hidden from
the components (and the software engineers), hence giving optimizing tools many degrees of freedom
with respect to component allocation, signal mapping, and scheduling parameter selection. Whereas, in
Figure 2.6(b) the communication is explicitly visible to the components (and the software engineers),
hence putting a larger burden on the software engineers to manually optimize the system.
2.7.1.3 End-to-End Delay
End-to-end delays are application level timing requirements relating the occurrence in time of one event
to the occurrence of another event. As pointed out earlier, how to relate such requirements to the lower
level timing properties of components is highly dependent on both the component model and the timing
analysis model.
When designing RTS using CBD the component structure gives excellent information about the points
of interaction between the RTS and its environment. Since, end-to-end delays is about timing estimates
and timing requirements on such interactions, CBD gives a natural way of stating timing requirements in
terms of signals received or generated. (In traditional RTS development, the reception and generation of
signals is embedded into the code of tasks and are not externally visible, hence making it difcult to relate
response times of tasks to end-to-end requirements.)
2.7.1.4 Jitter
Jitter is an important timing parameter that is related to execution time, and that will affect response times
and end-to-end delays. There may also be specic jitter requirements. Jitter has the same relation to CBD
as does end-to-end delay.
2.7.1.5 Summary of Timing and CBD
As described earlier, there is no single solution for how to apply CBD to RTSs. In some cases, timing
analysis is made more complicated when using CBD, for example, when using clientserver oriented
component models, whereas in other cases, CBD actually helps timing analysis, for example, identifying
interfaces/events associated with end-to-end requirements is facilitated when using CBD.
Further, the characteristics of the component model has great impact on the analyzability of CBDed
RTSs. For instance, interaction patterns such as clientserver does not map well to established analysis
methods and makes analysis difcult, whereas pipes-and-lter based patterns (such as the port based
objects component model [21]) maps very well to existing analysis methods and allow for tight analysis of
timing behavior. Also, the executionsemantics of the component model has animpact onthe analyzability.
The execution semantics gives restrictions on how to map components to tasks, for example, in the Corba
Component Model [14] eachcomponent is assumedtohave its ownthreadof execution, making it difcult
to map multiple components to a single thread. On the other hand, the simple execution semantics of
pipes-and-lter based models allow for automatic mapping of multiple components to a single task,
simplifying timing analysis and making better use of system resources.
2.7.2 Real-Time Operating Systems
There are two important aspects regarding CBDand RTOSs: (1) the RTOS may itself be component based,
and (2) the RTOS may support or provide a framework for CBD.
2.7.2.1 Component-Based RTOSs
Most RTOSs allow for ofine conguration where the engineer can choose to include or exclude large
parts of functionality. For instance, which communications protocols to include is typically congurable.
However, this type of congurability is not the same as the RTOS being component based (even though
the unit of conguration is often referred to as components in marketing material). For an RTOS to be
component based it is required that the components conform to a component model, which is typically
not the case in most congurable RTOSs.
There has been some research on component-based RTOSs, for instance, the research RTOS VEST
[18]. In VEST, schedulers, queue managers, and memory management is built up out of components.
Furthermore, special emphasis has beenput onpredictability andanalyzability. However, VESTis currently
still on the research stage and has not been released to the public. Publicly available is, however, the eCos
RTOS [98, 99] which provides a component based conguration tool. Using eCos components the RTOS
can be congured by the user, and third party extension can be provided.
2.7.2.2 RTOSs that Support CBD
Looking at component models in general and those intended for embedded systems in particular,
we observe that they are all supported by some runtime executive or simple RTOS. Many component
technologies provides frameworks that are independent of the underlying RTOS, and hence, RTOS can be
used to support CBD using such an RTOS-independent framework. Examples include Corbas ORB [100]
and the framework for PECOS [20, 101].
Other component technologies have a tighter coupling between the RTOS and component framework,
in that the RTOS explicitly supports the component model by providing the framework (or part of the
framework). Such technologies include:
Koala [19] is a component model and architectural description language from Philips. Koala
provides high-level APIs to the computing and audio/video hardware. The computing layer
provides a simple proprietary real-time kernel with priority-driven preemptive scheduling. Special
techniques for thread sharing is used to limit the number of concurrent threads.
The Chimera RTOS provides anexecutionframework for the Port-Based-Object component model
[21], intendedfor development of sensor-basedcontrol systems, specically recongurable robotics
applications. Chimera has multiprocessor support, andhandles bothstatic anddynamic scheduling,
the latter EDF based.
The Rubus is a RTOS. Rubus supports a component model in which behaviors are dened by
sequences of port-based objects [13]. The Rubus kernel supports predictable execution of statically
scheduled periodic tasks (termed red tasks in Rubus) and dynamically xed-priority preemptive
scheduled tasks (termed Blue). In addition, support for handling interrupts is provided. In Rubus,
support is provided for transforming sets of components into sequential chains of executable code.
Each such chain is implemented as a single task. Support is also provided for analysis of response
times and end-to-end deadlines, based on execution-time measures that have to be provided, that
is, execution-time analysis is not provided by the framework.
The Time-Triggered Operating System (TTOS) is an adapted and extended version of the MARS
OS [71]. Task scheduling in TTOS is based on an ofine generated scheduling table, and relies on
the global time base provided by the TTP/Ccommunication system. All synchronization is handled
by the ofine scheduling. TTOS, and in general the entire TTA is (just as IEC61131-3) well suited
for the synchronous execution paradigm.
In a synchronous execution the system is considered sequential; computing in each step (or cycle) a
global output based on a global input. The effect of each step is dened by a set of transformation rules.
Scheduling is done statically by compiling the set of rules into a sequential program implementing these
rules and executing them in some statically dened order. A uniform timing bound for the execution of
global steps is assumed. In this context, a component is a design level entity.
TTA denes a protocol for extending the synchronous language paradigm to distributed platforms,
allowing distributedcomponents tointeroperate, as long as they conformtoimposedtiming requirements.
2.7.3 Real-Time Scheduling
Ideally, from a CBD perspective, the response time of a component should be independent of the envir-
onment in which it is executing (since this would facilitate reuse of the component). However, this is in
most cases highly unrealistic, since:
1. The execution time of the task will be different in different target environments.
2. The response time is additionally dependent on the other tasks competing for the same resources
(CPU etc.) and the scheduling method used to resolve the resource contention.
Rather thanaiming for the nonachievable ideal, a realistic ambitioncouldbe tohave a component model
and framework which allows for analysis of response times based on abstract models of components and
their compositions. Time-triggered systems goes one step toward the ideal solution, in that components
can be timely isolated from each other. While not having a major impact on the component model,
time-triggered systems simplify implementation of the component framework since all synchronization
between components is resolved ofine. Also, from a safety perspective, the time-triggered paradigm
gives benets, in that it reduces the number of possible execution scenarios (owing to the static order of
execution of components and owing to the lack of preemption).
Also, in time-triggered component models it is possible to use the structure given by the component
composition to synthesize scheduling parameters. For instance, in Rubus [13] and TTA [8] this is already
done, by generating the static schedule using the components as schedulable entities.
In theory, a similar approach could be used also for dynamically scheduled systems; using a scheduler/
task conguration tool to automatically derive mappings of components to tasks and scheduling param-
eters (such as priorities or deadlines) for the tasks. However, this approach is still on the research stage.
2.8 Testing and Debugging of RTSs
According to a recent study by NIST [102] up to 80% of the life cycle cost for software is spent on testing
and debugging. Despite the importance, there are few results on RTSs testing and debugging.
The main reason for this is that it is actually quite difcult to test and debug RTS. Remember that
RTSs are timing critical and that they interact with the real world. Since testing and debugging typically
involves some instrumentation of the code, the timing behavior of the system will be different when
testing/debugging compared with when executing the deployed system. Hence, the test-cases that were
passed during testing may lead to failures in the deployed system, and tests that failed may not cause
any problem at all in the deployed system. For debugging the situation is possibly even worse, since
in addition to a similar effect when running the system in a debugger, entering a breakpoint will stop
the execution for an unspecied time. The problem with this is that the controlled external process will
continue to evolve (e.g., a car will not momentarily stop by stopping the execution of the controlling
software). The result of this is that we get a behavior of the debugged system which will not be possible in
the real system. Also, it is often the case that the external process cannot be completely controlled, which
means that we cannot reproduce the observed behavior, which means that it will be difcult to use (cyclic)
debugging to track down an error that caused a failure.
The following are two possible solutions to the presented problems:
To build a simulator that faithfully captures the functional as well as timing behavior of both the
RTS and the environment which it is controlling. Since this is both time consuming and costly, this
approach is only feasible in very special situations. Since such situations are rare we will not further
consider this alternative here.
To record the RTSs behavior during testing or execution, and then if a failure is detected replay
the execution in a controlled way. For this to work it is essential that the timing behavior is the
same during testing as in the deployed system. This can either be achieved by using nonintrusive
hardware recorders, or by leaving the software used for instrumentation in the deployed system.
The latter comes at a cost in memory space and execution time, but gives the additional benet
that it becomes possible to debug also the deployed system in case of a failure [103].
An additional problem for most RTSs is that the system consists of several concurrently executing
threads. This is also the case for the majority of nonRTSs. This concurrency will per se lead to a problematic
nondeterminism, since owing to race conditions caused by slight variations in execution time the exact
preemption points will vary, causing unpredictability, both in terms of the number of scenarios and in
terms of being able to predict which scenario will actually be executed in a specic situation.
In conclusion we note that testing and debugging of RTSs are difcult and challenging tasks.
The following is a brief account of some of the fewresults on testing of RTSs reported in the literature:
Thane and Hansson [87] proposed a method for deterministic testing of distributed RTSs. The
key element here is, to identify the different execution orderings (serializations of the concurrent
system) and treat each of these orderings as a sequential program. The main weakness of this
approach is the potentially exponential blow-up of the number of execution orderings.
For testing of temporal correctness Tsai et al. [104] provide a monitoring technique that records
runtime information. This information is then used to analyze if the temporal constraints are
violated.
Schtz [105] has proposed a strategy for testing distributed RTSs. The strategy is tailored for the
time-triggered MARS system [71].
Zhu et al. [106] have proposed a framework for regression testing of real-time software in
distributed systems. The framework is based on the Onomas [107] regression testing process.
When it comes to RTS debugging the most promising approach is record/replay [108112] as men-
tioned earlier. Using record/replay, rst, a reference execution of the system is executed and observed,
second, a replay execution is performed based on the observations made during the reference execution.
Observations are performed by instrumenting the system, in order to extract information about the
execution.
The industrial practice for testing and debugging of multi-tasking RTS is a time consuming activity. At
best, hardware emulators, for example, Reference 113, are used to get some level of observability without
interfering with the observed system. More often, it is an ad hoc activity, using intrusive instrumentations
of the code to observe test results or try to track down intricate timing errors. However, some tools using
the above record/replay method is now emerging on the market, for example, Reference 114.
2.9 Summary
This chapter has presented the most important issues, methods, and trends in the area of embedded RTSs.
Awide range of topics have been covered, fromthe initial design of embedded RTSs to analysis and testing.
Important issues discussed and presented are design tools, OSs, and major underlying mechanisms such
as architectures, models of interactions, real-time mechanisms, executions strategies, and scheduling.
Moreover, communications, analysis, and testing techniques are presented.
Over the years, the academics have put an effort in increasing the various techniques used to compose
and design complex embedded RTSs. Standards and industry are following a slower pace, while also
adopting and developing area-specic techniques. Today, we can see diverse techniques used in different
application domains, such as automotive, aero, and trains. In the area of communications, an effort is
made in the academic, and also in some parts of industry, toward using Ethernet. This is a step toward a
common technique for several application domains.
Different real-time demands have led to domain specic OSs, architectures, and models of interactions.
As many of these have several commonalities, there is a potential for standardization across several
domains. However, as this takes time, we will most certainly stay with application specic techniques for
a while, and for specic domains, with extreme demands on safety or low cost, specialized solutions will
most likely be used also in the future. Therefore, knowledge of the techniques used in and suitable for the
various domains will remain important.
References
[1] Tom R. Halfhill. Embedded Markets Breaks New Ground. Microprocessor Report, 17, 2000.
[2] H. Kopetz. Introduction. In Real-Time Systems: Introduction and Overview. Part XVIII of Lecture
Notes from ESSES 2003 European Summer School on Embedded Systems. Ylva Boivie, Hans
Hansson, and Sang Lyul Min, Eds., Vsters, Sweden, September 2003.
[3] IEEE Computer Society. Technical Committee on Real-Time Systems Home Page. http://www.cs.
bu.edu/pub/ieee-rts/.
[4] Kluwer. Real-Time Systems (Journal). http://www.wkap.nl/kapis/CGI-BIN/WORLD/
journalhome.htm?0922-6443.
[5] C. Liu and J. Layland. Scheduling Algorithms for Multiprogramming in a Hard-Real-Time
Environment. Journal of the ACM, 20:4661, 1973.
[6] M.H. Klein, T. Ralya, B. Pollak, R. Obenza, and M.G. Harbour. A Practitioners Handbook for
Rate-Monotonic Analysis. Kluwer, Dordrecht, 1998.
[7] N.C. Audsley, A. Burns, R.I. Davis, K. Tindell, and A.J. Wellings. Fixed Priority Pre-Emptive
Scheduling: An Historical Perspective. Real-Time Systems, 8:129154, 1995.
[8] Hermann Kopetz and Gnther Bauer. The Time-Triggered Architecture. Proceedings of the IEEE,
Special Issue on Modeling and Design of Embedded Software, 91:112126, 2003.
[9] J. Xu and D.L. Parnas. Scheduling Processes with Release Times, Deadlines, Precedence, and
Exclusion Relations. IEEE Transactions on Software Engineering, 16:360369, 1990.
[10] Time Triggered Technologies. http://www.tttech.com.
[11] H. Kopetz and G. Grnsteidl. TTP A Protocol for Fault-Tolerant Real-Time Systems. IEEE
Computer, 27(1):1423, 1994.
[12] H. Hansson, H. Lawson, and M. Strmberg. BASEMENT a Distributed Real-Time Architecture
for Vehicle Applications. Real-Time Systems, 3:223244, 1996.
[13] Arcticus Systems. The Rubus Operating System. http://www.arcticus.se.
[14] OMG. CORBA Component Model 3.0, June 2002. http://www.omg.org/technology/documents/
formal/components.htm.
[15] Microsoft. Microsoft .COM Technologies. http://www.microsoft.com/com/.
[16] Microsoft. .NET Home Page. http://www.microsoft.com/net/.
[17] SUN Microsystems. Introducing Java Beans. http://developer.java.sun.com/developer/online/
Training/Beans/ Beans1/index.html.
[18] John A. Stankovic. VEST A Toolset for Constructing and Analyzing Component-Based
Embedded Systems. Lecture Notes in Computer Science, 2211:390402, 2001.
[19] Rob van Ommering. The Koala Component Model. In Building Reliable Component-Based
Software Systems. Artech House Publishers, July 2002, pp. 223236.
[20] P.O. Mller, C.M. Stich, and C. Zeidler. Component-Based Embedded Systems. In Building
Reliable Component-Based Software Systems. Artech House Publisher, 2002, pp. 303323.
[21] D.B. Stewart, R.A. Volpe, and P.K. Khosla. Design of Dynamically Recongurable Real-Time
Software Using Port-Based Objects. IEEE Transactions on Software Engineering, 23(12):759776,
1997.
[22] OMG. Unied Modeling Language (UML), Version 1.5, 2003. http://www.omg.org/technology/
documents/formal/uml.htm.
[23] Rational. Rational Rose RealTime. http://www.rational.com/products/rosert.
[24] I-Logix. Rhapsody. http://www.ilogix.com/products/rhapsody.
[25] TeleLogic. Telelogic tau. http://www.telelogic.com/products/tau.
[26] Vector. DaVinci Tool Suite. http://www.vector-informatik.de/.
[27] OMG. Unied Modeling Language (UML), Version 2.0 (draft). OMG document ptc/03-09-15,
September 2003.
[28] ITEA. EAST/EEA Project Site. http://www.east-eea.net.
[29] ETAS. http://en.etasgroup.com.
[30] Vector. http://www.vector-informatik.com.
[31] Siemens. http://www.siemensvdo.com.
[32] Comp.realtime FAQ. Available at http://www.faqs.org/faqs/realtime-computing/faq/.
[33] Roadmap Adaptive Real-Time Systems for Quality of Service Management. ARTIST Project
IST-2001-34820, May 2003. http://www.artist-embedded.org/Roadmaps/.
[34] G.C. Buttazzo. Hard Real-Time Computing Systems. Kluwer Academic Publishers, Dordrecht,
1997.
[35] A. Burns and A. Wellings. Real-Time Systems and Programming Languages, 2nd ed. Addison-
Wesley, Reading, MA, 1996.
[36] The Asterix Real-Time Kernel. http://www.mrtc.mdh.se/projects/asterix/.
[37] LiveDevices. Realogy Real-Time Architect, SSX5 Operating System, 1999. http://www.livedevices.
com/realtime.shtml.
[38] Wind River Systems Inc. VxWorks Programmers Guide. http://www.windriver.com/.
[39] Lynuxworks. http://www.lynuxworks.com.
[40] Enea OSE Systems. Ose. http://www.ose.com.
[41] QNX Software Systems. QNX Realtime OS. http://www.qnx.com.
[42] List of Real-Time Linux Variants. http://www.realtimelinuxfoundation.org/variants/
variants.html.
[43] Express Logic. Threadx. http://www.expresslogic.com.
[44] Northern Real-Time Applications. Total Time Predictability. Whitepaper on SSX5, 1998.
[45] IEEE. Standard for Information Technology Standardized Application Environment Prole
POSIX Realtime Application Support (AEP). IEEE Standard P1003.13-1998, 1998.
[46] OSEK Group. OSEK/VDX Operating System Specication 2.2.1. http://www.osek-vdx.org/.
[47] Airlines Electronic Engineering Committee (AEEC). ARINC 653: Avionics Application Software
Standard Interface (Draft 15), June 1996.
[48] ISO. Ada95 Reference Manual. ISO/IEC 8652:1995(E), 1995.
[49] G. Fohler, T. Lennvall, and R. Dobrin. A Component Based Real-Time Scheduling Archi-
tecture. In Architecting Dependable Systems, Vol. LNCS-2677. R. de Lemos, C. Gacek, and
A. Romanovsky, Eds., Springer-Verlag, Heidelberg, 2003.
[50] J. Mki-Turja and M. Sjdin. Combining Dynamic and Static Scheduling in Hard Real-Time
Systems. Technical report MRTC no. 71, Mlardalen Real-Time Research Centre (MRTC),
October 2002.
[51] B. Sprunt, L. Sha, and J.P. Lehoczky. Aperiodic Task Scheduling for Hard Real-Time Systems.
Real-Time Systems, 1:2760, 1989.
[52] M. Spuri and G.C. Buttazzo. Efcient Aperiodic Service under Earliest Deadline Scheduling.
In Proceedings of the 15th IEEE Real-Time Systems Symposium (RTSS94). IEEE Computer Society,
San Juan, Puerto Rico, December 1994, pp. 211.
[53] M. Spuri and G.C. Buttazzo. Scheduling Aperiodic Tasks in Dynamic Priority Systems. Real-Time
Systems, 10:179210, 1996.
[54] L. Abeni and G. Buttazzo. Integrating Multimedia Applications in Hard Real-Time Systems.
In Proceedings of the 19th IEEE Real-Time Systems Symposium (RTSS98). IEEE Computer Society,
Madrid, Spain, December 1998, pp. 413.
[55] M. Spuri, G.C. Buttazzo, and F. Sensini. Robust Aperiodic Scheduling under Dynamic Priority
Systems. In Proceedings of the 16th IEEE Real-Time Systems Symposium (RTSS95). IEEE Computer
Society, Pisa, Italy, December 1995, pp. 210219.
[56] CAN Specication 2.0, Part-A and Part-B. CAN in Automation (CiA), Am Weichselgarten 26,
D-91058 Erlangen, 2002. http://www.can-cia.de.
[57] Road Vehicles Interchange of Digital Information Controller Area Network (CAN) for
High Speed Communications, ISO/DIS 11898, February 1992.
[58] K.W. Tindell, A. Burns, and A.J. Wellings. Calculating Controller Area Network (CAN) Message
Response Times. Control Engineering Practice, 3:11631169, 1995.
[59] K. Tindell, H. Hansson, and A. Wellings. Analysing Real-Time Communications: Controller Area
Network (CAN). In Proceedings of the 15th IEEE Real-Time Systems Symposium (RTSS). IEEE
Computer Society Press, December 1994, pp. 259263.
[60] Road Vehicles Controller Area Network (CAN) Part 4: Time Triggered Communication.
ISO/CD 11898-4.
[61] L. Almeida, J.A. Fonseca, and P. Fonseca. Flexible Time-Triggered Communication on a Con-
troller Area Network. In Proceedings of the Work-In-Progress Session of the 19th IEEE Real-Time
Systems Symposium (RTSS98). IEEE Computer Society, Madrid, Spain, December 1998.
[62] L. Almeida, J.A. Fonseca, and P. Fonseca. A Flexible Time-Triggered Communication
System Based on the Controller Area Network: Experimental Results. In Proceedings of the IFAC
International Conference on Fieldbus Technology (FeT). Springer, 1999, pp. 342350.
[63] TTTech Computertechnik AG. Specication of the TTP/C Protocol v0.5, July 1999.
[64] H. Kopetz. The Time-Triggered Model of Computation. In Proceedings of the 19th IEEE Real-
Time Systems Symposium (RTSS98). IEEE Computer Society, Madrid, Spain, December 1998,
pp. 168177.
[65] LIN. Local Interconnect Network. http://www.lin-subbus.de.
[66] R. Belschner, J. Berwanger, C. Ebner, H. Eisele, S. Fluhrer, T. Forest, T. Fhrer, F. Hartwich,
B. Hedenetz, R. Hugel, A. Knapp, J. Krammer, A. Millsap, B. Mller, M. Peller, and A. Schedl.
FlexRay Requirements Specication, April 2002. http://www.exray-group.com.
[67] ARINC/RTCA-SC-182/EUROCAE-WG-48. Minimal Operational Performance Standard for
Avionics Computer Resources, 1999.
[68] PROFIBUS. PROFIBUS International. http://www.probus.com.
[69] H. Kirrmann and P.A. Zuber. The IEC/IEEE Train Communication Network. IEEE Micro,
21:8192, 2001.
[70] WorldFIP. WorldFIP Fieldbus. http://www.worldp.org.
[71] H. Kopetz, A. Damm, C. Koza, and M. Mullozzani. Distributed Fault Tolerant Real-Time Systems:
The MARS Approach. IEEE Micro, 9(1):2540, 1989.
[72] C. Venkatramani and T. Chiueh. Supporting Real-Time Trafc on Ethernet. In Proceedings of the
15th IEEE Real-Time Systems Symposium (RTSS94). IEEE Computer Society, San Juan, Puerto
Rico, December 1994, pp. 282286.
[73] D.W. Pritty, J.R. Malone, S.K. Banerjee, and N.L. Lawrie. A Real-Time Upgrade for Ethernet
Based Factory Networking. In Proceedings of the IECON95. IEEE Industrial Electronics Society,
1995, pp. 16311637.
[74] N. Malcolm and W. Zhao. The Timed Token Protocol for Real-Time Communication. IEEE
Computer, 27:3541, 1994.
[75] K.K. Ramakrishnan and H. Yang. The Ethernet Capture Effect: Analysis and Solution. In
Proceedings of the 19th IEEE Local Computer Networks Conference (LCNC94), October 1994,
pp. 228240.
[76] M. Molle. A New Binary Logarithmic Arbitration Method for Ethernet. Technical report, TR
CSRI-298, CRI, University of Toronto, Canada, 1994.
[77] G. Lann and N. Riviere. Real-Time Communications over Broadcast Networks: The CSMA/DCR
and the DOD-CSMA/CD Protocols. Technical report, TR 1863, INRIA, 1993.
[78] M. Molle and L. Kleinrock. Virtual Time CSMA: Why Two Clocks are Better than One. IEEE
Transactions on Communications, 33:919933, 1985.
[79] W. Zhao and K. Ramamritham. AVirtual Time CSMA/CD Protocol for Hard Real-Time Commu-
nication. In Proceedings of the 7th IEEE Real-Time Systems Symposium (RTSS86). IEEE Computer
Society, New Orleans, LA, December 1986, pp. 120127.
[80] M. El-Derini and M. El-Sakka. A Novel Protocol Under a Priority Time Constraint for Real-Time
Communication Systems. In Proceedings of the 2nd IEEE Workshop on Future Trends of Distrib-
uted Computing Systems (FTDCS90). IEEE Computer Society, Cairo, Egypt, September 1990,
pp. 128134.
[81] W. Zhao, J.A. Stankovic, and K. Ramamritham. A Window Protocol for Transmission of Time-
Constrained Messages. IEEE Transactions on Computers, 39:11861203, 1990.
[82] L. Almeida, P. Pedreiras, and J.A. Fonseca. The FTT-CAN Protocol: Why and How? IEEE
Transaction on Industrial Electronics, 49(6):11891201, 2002.
[83] P. Pedreiras, L. Almeida, and P. Gai. The FTT-Ethernet Protocol: Merging Flexibility, Timeliness
and Efciency. In Proceedings of the 14th Euromicro Conference on Real-Time Systems (ECRTS02).
IEEE Computer Society, Vienna, Austria, June 2002, pp. 152160.
[84] S.K. Kweon, K.G. Shin, and G. Workman. Achieving Real-Time Communication over Ethernet
with Adaptive Trafc Smoothing. In Proceedings of the Sixth IEEE Real-Time Technology and
Applications Symposium (RTAS00). IEEE Computer Society, Washington DC, USA, June 2000,
pp. 90100.
[85] A. Carpenzano, R. Caponetto, L. LoBello, and O. Mirabella. Fuzzy Trafc Smoothing: An
Approach for Real-Time Communication over Ethernet Networks. In Proceedings of the Fourth
IEEE International Workshop on Factory Communication Systems (WFCS02). IEEE Industrial
Electronics Society, Vsters, Sweden, August 2002, pp. 241248.
[86] J.L. Sobrinho and A.S. Krishnakumar. EQuB-Ethernet Quality of Service Using Black Bursts.
In Proceedings of the 23rd IEEE Annual Conference on Local Computer Networks (LCN98). IEEE
Computer Society, Lowell, MA, October 1998, pp. 286296.
[87] H. Thane and H. Hansson. Towards Systematic Testing of Distributed Real-Time Systems.
In Proceedings of the 20th IEEE Real-Time Systems Symposium (RTSS). December 1999,
pp. 360369.
[88] M. Joseph and P. Pandya. Finding Response Times in a Real-Time System. Computer Journal,
29:390395, 1986.
[89] The Times Tool. http://www.docs.uu.se/docs/rtmv/times.
[90] AbsInt. http://www.absint.com.
[91] Bound-T Execution Time Analyzer. http://www.bound-t.com.
[92] L. Casparsson, A. Rajnak, K. Tindell, and P. Malmberg. Volcano A Revolution in On-Board
Communications. Volvo Technology Report, 1:919, 1998.
[93] Volcano automotive group. http://www.volcanoautomotive.com.
[94] TimeSys. Timewiz A Modeling and Simulation Tool. http://www.timesys.com/.
[95] OMG. UML Prole for Schedulability, Performance, and Time Specication. OMG document
formal/2003-09-01, September 2003.
[96] J.L. Medina, M. Gonzlez Harbour, and J.M. Drake. MAST Real-Time View: A Graphic UML
Tool for Modeling Object-Oriented Real-Time Systems. In Proceedings of the 22nd IEEE Real-Time
Systems Symposium (RTSS). IEEE Computer Society, December 2001, pp. 245256.
[97] MAST home-page. http://mast.unican.es/.
[98] A. Massa. Embedded Software Development with eCos. Prentice Hall, New York, November 2002,
ISBN: 0130354732.
[99] eCos Home Page. http://sources.redhat.com/ecos.
[100] OMG. CORBA Home Page. http://www.omg.org/corba/.
[101] PECOS Project Web Site. http://www.pecos-project.org.
[102] U.S. Department of Commerce. The Economic Impacts of Inadequate Infrastructure for Software
Testing. NIST report, May 2002.
[103] M. Ronsse, K. De Bosschere, M. Christiaens, J. Chassin de Kergommeaux, and D. Kranzlmller.
Record/Replay for Nondeterministic Program Executions. Communications of the ACM, 46:6267,
2003.
[104] J.J.P. Tsai, K.Y. Fang, and Y.D. Bi. On Realtime Software Testing and Debugging. In Proceedings
of the 14th Annual International Computer Software and Application Conference. IEEE Computer
Society, November 1990, pp. 512518.
[105] W. Schtz. Fundamental Issues in Testing Distributed Real-Time Systems. Real-Time Systems,
7:129157, 1994.
[106] H. Zhu, P. Hall, and J. May. Software Unit Test Coverage and Adequacy. ACM Computing Surveys,
29(4):366427, 1997.
[107] K. Onoma, W.-T. Tsai, M. Poonawala, and H. Suganuma. Regression Testing in an Industrial
Environment. Communications of the ACM, 41:8186, 1998.
[108] J.D. Choi, B. Alpern, T. Ngo, M. Sridharan, and J. Vlissides. A Pertrubation-Free Replay Platform
for Cross-Optimized Multithreaded Applications. In Proceedings of the 15th International Parallel
and Distributed Processing Symposium. IEEE Computer Society Press, Washington, April 2001.
[109] J. Mellor-Crummey and T. LeBlanc. A Software Instruction Counter. In Proceedings of the Third
International Conference on Architectural Support for Programming Languages and Operating
Systems. ACM, April 1989, pp. 7886.
[110] K.C. Tai, R. Carver, and E. Obaid. Debugging Concurrent ADA Programs by Deterministic
Execution. IEEE Transactions on Software Engineering, 17:280287, 1991.
[111] H. Thane and H. Hansson. Using Deterministic Replay for Debugging of Distributed Real-Time
Systems. In Proceedings of the 12th Euromicro Conference on Real-Time Systems. IEEE Computer
Society Press, Washington, June 2000, pp. 265272.
[112] F. Zambonelli and R. Netzer. An Efcient Logging Algorithm for Incremental Replay of Message-
Passing Applications. In Proceedings of the 13th International and 10th Symposium on Parallel and
Distributed Processing. IEEE, April 1999, pp. 392398.
[113] Lauterbach. Lauterbach. http://www.laterbach.com.
[114] ZealCore. ZealCore Embedded Solutions AB. http://www.zealcore.com.
ZURA: 2824_C003 2005/6/21 20:01 page 1 #1
Design and Validation
of Embedded Systems
3 Design of Embedded Systems
Luciano Lavagno and Claudio Passerone
4 Models of Embedded Computation
Axel Jantsch
5 Modeling Formalisms for Embedded System Design
Lus Gomes, Joo Paulo Barros, and Anik Costa
6 System Validation
J.V. Kapitonova, A.A. Letichevsky, V.A. Volkov, and Thomas Weigert
ZURA: 2824_C003 2005/6/21 20:01 page 1 #3
3
Design of Embedded
Systems
Luciano Lavagno
and Politecnico di Torino
Claudio Passerone
3.1 The Embedded System Revolution . . . . . . . . . . . . . . . . . . . . . 3-1
3.2 Design of Embedded Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2
3.3 Functional Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6
3.4 FunctionArchitecture and HardwareSoftware
Codesign. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7
3.5 HardwareSoftware Coverication and Hardware
Simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-11
3.6 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-12
Compilation, Debugging, and Memory Model Real-Time
Scheduling
3.7 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-16
Logic Synthesis and Equivalence Checking Placement,
Routing, and Extraction Simulation, Formal Verication,
and Test Pattern Generation
3.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22
3.1 The Embedded System Revolution
The world of electronics has witnessed a dramatic growth of its applications in the last few decades. From
telecommunications to entertainment, from automotives to banking, almost any aspect of our everyday
life employs some kind of electronic components. In most cases, these components are computer-based
systems, which are not, however, used or perceived as a computers. For instance, they often do not have
a keyboard or a display to interact with the user, and they do not run standard operating systems and
applications. Sometimes, these systems constitute a self-contained product themselves (e.g., a mobile
phone), but they are frequently embedded inside another system, for which they provide better function-
alities and performance (e.g., the engine control unit of a motor vehicle). We call these computer-based
systems embedded systems.
The huge success of embedded electronics has several causes. The main one in our opinion is that
embedded systems bring the advantages of Moores Law into everyday life, that is, an exponential increase
in performance and functionality at an ever decreasing cost. This is possible because of the capabilities
of integrated circuit technology and manufacturing, which allows one to build more and more complex
devices, and because of the development of newdesign methodologies, which allows one to efciently and
cleverly use those devices. Traditional steel-based mechanical development, onthe other hand, has reached
3-1
ZURA: 2824_C003 2005/6/21 20:01 page 2 #4
a plateau near the middle of the twentieth century, and thus it is not a signicant source of innovation
any longer, unless coupled to electronic manufacturing technologies (microelectromechanical systems
MEMS) or embedded systems, as argued above.
There are many examples of embedded systems in the real world. For instance, a modern car contains
tens of electronic components (control units, sensors, and actuators) that perform very different tasks. The
rst embedded systems that appeared in a car were related to the control of mechanical aspects, such as
the control of the engine, the antilock brake system, and the control of suspension and transmission.
However, nowadays cars also have a number of components that are not directly related to mechanical
aspects, but are mostly related to the use of the car as a vehicle for moving around, or the communication
needs of the passengers: navigation systems, digital audio and video players, and phones are just a few
examples. Moreover, many of these embedded systems are connected together using a network, because
they need to share information regarding the state of the car.
Other examples come from the communication industry: a cellular phone is an embedded system whose
environment is the mobile network. These are very sophisticated computers whose main task is to send
and receive voice, but are also currently used as personal digital assistants, for games, to send and receive
images and multimedia messages, and to wirelessly browse the Internet. They have been so successful
and pervasive that in just a decade they became essential in our life. Other kinds of embedded systems
signicantly changed our life as well: for instance, ATM and Point-of-Sale (POS) machines modied the
way we do payments, and multimedia digital players changed how we listen to music and watch videos.
We are just at the beginning of a revolution that will have an impact on every other industrial sector.
Special purpose embedded systems will proliferate and will be found in almost any object that we use.
They will be optimized for the application and show a natural user interface. They will be exible, in order
to adapt to a changing environment. Most of them will also be wireless, in order to follow us wherever
we go and keep us constantly connected with the information we need and the people we care. Even the
role of computers will have to be reconsidered, as many of the applications for which they are used today
will be performed by specially designed embedded systems.
What are the consequences of this revolution in the industry? Modern car manufacturers today need
to acquire a signicant amount of skills in hardware and software design, in addition to the mechanical
skills that they already had in-house, or they should outsource the requirements they have to an external
supplier. In either case, a broad variety of skills needs to be mastered, from the design of software
architectures for implementing the functionality, to being able to model the performance, because real-
time aspects are extremely important in embedded systems, especially those related to safety critical
applications. Embedded system designers must also be able to architect and analyze the performance of
networks, as well as validate the functionality that has been implemented over a particular architecture
and the communication protocols that are used.
A similar revolution has happened or is about to happen to other industrial and socioeconomical areas
as well, such as entertainment, tourism, education, agriculture, government, and so on. It is therefore clear
that new, more efcient and easy to use embedded electronics design methodologies need to be developed,
in order to enable the industry to make use of the available technology.
3.2 Design of Embedded Systems
Embeddedsystemare informally denedas a collectionof programmable parts surroundedby Application
Specic Integrated Circuits (ASICs) and other standard components (Application Specic Standard Parts,
ASSPs) that interact continuously with an environment through sensors and actuators. The collection
can be physically a set of chips on a board, or a set of modules on an integrated circuit. Software is
used for features and exibility, while dedicated hardware is used for increased performance and reduced
power consumption. An example of an architecture of an embedded system is shown in Figure 3.1.
The main programmable components are microprocessors and Digital Signal Processors (DSPs), that
implement the software partition of the system. One can view recongurable components, especially
ZURA: 2824_C003 2005/6/21 20:01 page 3 #5
Design of Embedded Systems 3-3
mP/mC CoProc
Bridge
Memory
Peripheral
Dual port memory
IP Block Memory DSP
FPGA
FIGURE 3.1 A reactive real-time embedded system architecture.
if they can be recongured at runtime, as programmable components in this respect. They exhibit
area, cost, performance, and power characteristics that are intermediate between dedicated hardware and
processors. Custom and programmable hardware components, on the other hand, implement application-
specic blocks and peripherals. All components are connected through standard and dedicated buses and
networks, and data is stored on a set of memories. Often several smaller subsystems are networked together
to control, for example, an entire car, or to constitute a cellular or wireless network.
We can identify a set of typical characteristics that are commonly found in embedded systems. For
instance, they are usually not very exible and are designed to perform always the same task: if you
buy an engine control embedded system, you cannot use it to control the brakes of your car, or to play
games. A PC, on the other hand, is much more exible because it can perform several very different tasks.
An embedded system is often part of a larger controlled system. Moreover, cost, reliability, and safety
are often more important criteria than performance, because the customer may not even be aware of the
presence of the embedded system, and so he looks at other characteristics, such as the cost, the ease of use,
or the lifetime of a product.
Another common characteristic of many embedded systems is that they need to be designed in an
extremely short time to meet their time to market. Only a few months should elapse from conception
of a consumer product to the rst working prototypes. If these deadlines are not met, the result is a
concurrent increase in design costs and decrease of the prots, because fewer items will be sold. So delays
in the design cycle may make a huge difference between a successful product and an unsuccessful one.
In the current state of the art, embedded systems are designed with an ad hoc approach that is heavily
based on earlier experience with similar products and on manual design. Often the design process requires
several iterations to obtain convergence, because the system is not specied in a rigorous and unambiguous
fashion, and the level of abstraction, details, and design style in various parts are likely to be different. But
as the complexity of embedded systems scales up, this approach is showing its limits, especially regarding
design and verication time.
New methodologies are being developed to cope with the increased complexity and enhance designers
productivity. In the past, a sequence of two steps has always been used to reach this goal: abstraction and
clustering. Abstraction means describing an object (i.e., a logic gate made of metal oxide semiconductor
[MOS] transistors) using a model where some of the low-level details are ignored (i.e., the Boolean
expression representing that logic gate). Clustering means connecting a set of models at the same level of
abstraction, to get a newobject, which usually shows newproperties that are not part of the isolated models
that constitute it. By successively applying these two steps, digital electronic design went from drawing
layouts, to transistor schematics, to logic gate netlists, to register transfer level (RTL) descriptions, as shown
in Figure 3.2.
ZURA: 2824_C003 2005/6/21 20:01 page 4 #6
Transistor model
Gate-level model
Register transfer level
System level
A
b
s
t
r
a
c
t
RTL RTL Cluster
Cluster
A
b
s
t
r
a
c
t
A
b
s
t
r
a
c
t
Cluster
A
b
s
t
r
a
c
t
SW
1970s 2000+ 1990s 1980s
FIGURE 3.2 Abstraction and clustering levels in hardware design.
The notion of platform is key to the efcient use of abstraction and clustering. A platform is a single
abstract model that hides the details of a set of different possible implementations as clusters of lower-
level components. The platform, for example, a family of microprocessors, peripherals, and bus protocol,
allows developers of designs at the higher level (generically called applications in the following) to
operate without detailed knowledge of the implementation (e.g., the pipelining of the processor or the
internal implementation of the Universal Asychronous Receiver/Transmitter [UART]). At the same time,
it allows platform implementors to share design and fabrication costs among a broad range of potential
users, broader than if each design was a one-of-a-kind type.
Today we are witnessing the appearance of a new higher level of abstraction, as a response to the
growing complexity of integrated circuits. Objects can be functional descriptions of complex behaviors,
or architectural specications of complete hardware platforms. They make use of formal high-level models
that can be used to perform an early and fast validation of the nal system implementation, although with
reduced details with respect to a lower-level description.
The relationship between an application and elements of a platform is called a mapping. This exists,
for example, between logic gates and geometric patterns of a layout, as well as between RTL statements and
gates. At the system level, the mapping is between functional objects with their communication links,
and platform elements with their communication paths. Mapping at the system level means associating
a functional behavior (e.g., an FFT [fast Fourier transform] or a lter) to an architectural element that
can implement that behavior (e.g., a CPU or DSP or piece of dedicated hardware). It can also associate
a communication link (e.g., an abstract FIFO[rst in rst out]) to some communication services available
in the architecture (e.g., a driver, a bus, and some interfaces). The mapping step may also need to specify
parameters for these associations (e.g., the priority of a software task or the size of a FIFO), in order to
completely describe it. The object that we obtain after mapping shows properties that were not directly
exposed in the separate descriptions, such as the performance of the selected system implementation.
ZURA: 2824_C003 2005/6/21 20:01 page 5 #7
Performance is not just timing, but any other quantity that can be dened to characterize an embed-
ded system, either physical (area, power consumption, ) or logical (quality of service [QOS], fault
tolerance, ).
Since the system-level mapping operates on heterogeneous objects, it also allows one to nicely separate
different and orthogonal aspects such as:
1. Computation and communication. This separation is important because renement of computa-
tion is generally done by hand, or by compilation and scheduling, while communication makes use
of patterns.
2. Application and platform implementation (also called functionality and architecture, e.g., in
Reference 1), because they are often dened and designed independently by different groups or
companies.
3. Behavior and performance, which should be kept separate because performance information can
either represent nonfunctional requirements (e.g., maximum response time of an embedded con-
troller), or the result of an implementation choice (e.g., the worst-case execution time [WCET] of
a task). Nonfunctional constraint verication can be performed traditionally, by simulation and
prototyping, or with static formal checks, such as schedulability analysis.
All these separations result in better reuse, because they decouple independent aspects, that would other-
wise tie, for example, a given functional specication to low-level implementation details, by modeling it
as assembler or Verilog code. This in turn allows one to reduce design time, by increasing the productivity
and decreasing the time needed to verify the system.
A schematic representation of a methodology that can be derived from these abstraction and cluster-
ing steps is shown in Figure 3.3. At the functional level, a behavior for the system to be implemented
is specied, designed, and analyzed, either through simulation or by proving that certain properties are
satised (the algorithm always terminates, the computation performed satises a set of specications, the
complexity of the algorithm is polynomial, etc.). In parallel, a set of architectures are composed from
a clustering of platform elements, and selected as candidates for the implementation of the behavior.
These components may come from an existing library or may be specications of components that will
be designed later.
Now functional operations are assigned to the various architecture components, and patterns provided
by the architecture are selected for the dened communications. At this level we are now able to verify
the performance of the selected implementation, with much richer details than at the pure functional
Implementation level
Mapping level
Functional level
Mapping
Verify
function
Behavioral
libraries
Function Architecture
Verify
architecture
Architecture
libraries
Refinement
Verify
performance
Verify
refinements
Implementation
FIGURE 3.3 Design methodology for embedded system.
ZURA: 2824_C003 2005/6/21 20:01 page 6 #8
level. Different mappings to the same architecture, or mapping to different architectures, allow one to
explore the design space to nd the best solutions to important design challenges. These kinds of analysis
let the designer identify and correct possible problems early in the design cycle, thus reducing drastically
the time to explore the design space and weed out potentially catastrophic mistakes and bugs. At this
stage it is also very important to dene the organization of the data storage units for the system. Various
kind of memories (e.g., ROM, SRAM, DRAM, Flash, ) have different performance and data persistency
characteristics, and must be used judiciously to balance cost and performance. Mapping data structures to
different memories, and even changing the organization and layout of arrays can have a dramatic impact
on the satisfaction of a given latency in the execution of an algorithm, for example. In particular, a System-
On-Chip designer can afford to do a very ne tuning of the number and sizes of embedded memories
(especially SRAM, but now also Flash) to be connected to processors and dedicated hardware [2].
Finally, at the implementation level, the reverse transformation of abstraction and clustering occurs,
that is, a lower-level specication of the embedded system is generated. This is obtained through a series
of manual or automatic renements and modications that successively add more details, while checking
their compliance with the higher-level requirements. This step does not need to generate directly a
manufacturable nal implementation, but rather produces a new description that in turn constitutes the
input for another (recursive) application of the same overall methodology at a lower level of abstraction
(e.g., synthesis, placement and routing for hardware, and compilation and linking for software). Moreover,
the results obtained by these renements can be back-annotated to the higher level, to perform a better
and more accurate verication.
3.3 Functional Design
As discussed in the previous section, system-level design of embedded electronics requires two distinct
phases. In a rst phase, functional and nonfunctional constraints are the key aspects. In the second
phase, the available architectural platforms are taken into account, and detailed implementation can
proceed after a mapping phase that denes the architectural component on which every functional model
is implemented. This second phase requires a careful analysis of the trade-offs between algorithmic
complexity, functional exibility, and implementation costs.
In this section we describe some of the tools that are used for requirements capture, focusing especially
on those that permit executable specication. Such tools generally belong to two broad classes.
The rst class is represented, for example, by Simulink [3], MATRIXx [4], Ascet-SD [5], SPW [6],
SCADE [7], and SystemStudio [8]. It includes block-level editors and libraries using which the designer
composes data-dominated digital signal processing and embedded control systems. The libraries include
simple blocks, such as multiplication, addition, and multiplexing, as well as more complex ones, such as
FIR lters, FFTs, and so on.
The secondclass is representedby tools, suchas Tau[9], StateMate [10], Esterel Studio[7], StateFlow[3].
It is oriented to control-dominated embedded systems. In this case, the emphasis is placed on the decisions
that must be taken by the embedded system in response to environment and user inputs, rather than on
numerical computations. The notation is generally some form of Harels Statecharts [11].
The Unied Modeling Language (UML), as standardized by the Object Management Group [12], is in
a class by itself, since rst of all it focused historically more on general-purpose software (e.g., enter-
prise and commercial software), rather than on embedded real-time software. Only recently have some
embedded aspects, such as performance and time, been incorporated in UML 2.0 [12,13], and emphasis
has been placed on model-based software generation. However, tool support for UML 2.0 is still limited
(Tau [9], Real Time Studio [14], and Rose RealTime [15] provide some), and UML-based hardware
design is still in its infancy. Furthermore, the UML is a collection of notations, some of which (especially
Statecharts) are supported by several of the tools listed above in the control-dominated class.
Simulink and its related tools and toolboxes, both from Mathworks and from third parties, such as
dSPACE [16], is the workhorse of modern model-based embedded system design. In model-based design,
ZURA: 2824_C003 2005/6/21 20:01 page 7 #9
a functional executable model is used for algorithm development. This is made easier in the case of
Simulink by its tight integration with Matlab, the standard tool in DSP algorithm development. The same
functional model, with added annotations such as bitwidths and execution priorities, is then used for
algorithmic renements, such as oating-point to xed-point conversion and real-time task generation.
Then automated software generators, such as Real-Time Workshop, Embedded Coder [3], and
TargetLink [16], are used to generate task code and sometimes to customize a real-time operating system
(RTOS) on which the tasks will run. Ascet-SD, for example, automatically generates a customization of
the OSEK automotive RTOS [17] for the tasks that are generated from a functional model. In all these
cases, a task is typically generated from a set of blocks that are executed at the same rate or triggered by
the same event in the functional model.
Task formation algorithms can use either direct user input (e.g., the execution rate of each block in
discrete time portions of a Simulink or Ascet-SD design), or static scheduling algorithms for dataow
models (e.g., based on relative block-to-block rate specications in SPW or SystemStudio [18,19]).
Simulink is also tightly integrated with StateFlow, a design tool for control-dominated applications, in
order to ease the integration of decisionmaking and computation code. It also allows one to smoothly gen-
erate bothhardware andsoftware fromthe very same specication. This capability, as well as the integration
withsome sort of Statechart-basednite state machine (FSM) editor, is available frommost tools inthe rst
class above. The difference in market share can be attributed to the availability of Simulinktoolboxes for
numerous embedded system design tasks (from xed-point optimization to FPGA [Field Programmable
Gate Array]-based implementation) and its widespread adoption in undergraduate university courses,
which makes it well known to most of todays engineers.
The second class of tools either plays an ancillary role in the design of embedded control systems (e.g.,
as StateFlow and EsterelStudio), or is devoted to inherently control-dominated application areas, such
as telecommunication protocols. In the latter market the clear dominator today is Tau. The underlying
languages, such as the Specication and Description Language (SDL) and Message Sequence Charts, are
standardized by the International Telecommunication Union (ITU). They are commonly used to describe
in a tool-independent way protocol standards, thus modeling in SDL is quite natural in this application
domain, since validation and renement can proceed formally within a unied environment. Tau also has
code generation capabilities for both application code and customization of real-time kernels on which
the FSM-generated code will run. The use of Tau for embedded code generation (model-based design)
signicantly predates that of Simulink-based code generators, mostly due to the highly complex nature of
telecom protocols and the less demanding memory and computing power constraints that switches and
other networking equipment have.
Tau has links to the requirements capture tool Doors [9], also from Telelogic, which allows one to
trace dependencies between multiple requirements written in English, and connect them to aspects of the
embedded systemdesign les that implement these requirements. The state of the art of such requirement
tracing, however, is far from satisfactory, since there is no formal means in Doors to automatically check
for violations. Similar capabilities are provided by Reqtify [20].
Techniques for automated functional constraint validation, starting from formal languages, are
described in several books, for example, References 21 and 22. Deadline, latency, and throughput con-
straints are special kinds of nonfunctional requirements that have received extensive treatment in the
real-time scheduling community. They are also covered in several books, for example, References 2325.
While model-based functional verication is quite attractive, due to its high abstraction level, it ignores
cost and performance implications of algorithmic decisions. These are taken into account by the tools
described in the next section.
3.4 FunctionArchitecture and HardwareSoftware Codesign
In this section, we describe some of the tools that are available to help embedded system designers to
optimally architect the implementation of the system, and choose the best solution for each functional
ZURA: 2824_C003 2005/6/21 20:01 page 8 #10
component. After these decisions have been made, detailed design can proceed using the languages, tools,
and methods described in the following chapters in this book.
This step of the design process, whose general structure has been outlined in Section 3.2 by using
the platform-based design paradigm, has received various names in the past. Early work [26,27] called
it hardwaresoftware codesign (or cosynthesis), because one of the key decisions at this level is what
functionality has to be implemented in software and in dedicated hardware, and how the two partitions
of the design interact together with minimum cost and maximum performance.
Later on, people came to realize that hardwaresoftware was too coarse a granularity, and that more
implementation choices had to be taken into account. For example, one could trade-off single versus mul-
tiple processors, general-purpose CPUs versus specialized DSPs and Application-Specic Instruction-set
Processors (ASIPs), dedicated ASIC versus ASSP (e.g., an MPEG coprocessor or an Ethernet Medium
Access Controller), standard cells versus FPGA. Thus the term functionarchitecture codesign was
coined[1], torefer tothe more complex partitioning problemof a givenfunctionality ontoa heterogeneous
architecture such as the one in Figure 3.1.
The term system-level design also had some popularity in the industry [6,28], to indicate the level
of design above Register Transfer, at which software and hardware interact. Other terms, such as timed
functional model have also been used [29].
The key problems that are tackled by tools acting as a bridge between the system-level application and
the architectural platform are:
1. How to model the performance impact of making mapping decisions from a virtually
implementation-independent functional specication to an architectural model.
2. How to efciently drive downstream code generation, synthesis, and validation tools to avoid
redoing the modeling effort from scratch at the RTL, C, or assembly code levels respectively. The
notion of automated implementation generation from a high-level functional model is called
model-based design in the software world.
In both cases, the notion of what is an implementation-independent functional specication, which
can be retargeted indifferently to hardware and software implementations, must be carefully evaluated and
considered. Taken in its most literal terms, this idea has often been taunted as a myth. However, current
practice shows that it is already a reality, at least for some application domains (automotive electronics
and telecommunication protocols). It is intuitively very appealing, since it can be considered as a high-
level application of the platform-based design principle, by using a formal system-level platform. Such
a platform, embodied in one of the several models of computation that are used in embedded system
design, is a perfect candidate to maximize design reuse, and to optimally exploit different implementation
options.
In particular, several of the tools that have been mentioned in the previous section (e.g., Simulink,
TargetLink, StateFlow, SPW, SystemStudio, Tau, Ascet-SD, StateMate, Esterel Studio) have code generation
capabilities that are considered good enough for implementation and not just for rapid prototyping
and simulation acceleration. Moreover, several of them (e.g., Simulink, StateFlow, SPW, System Studio,
StateMate, Esterel Studio) can generate indifferently C for software implementation, and synthesizable
VHDL or Verilog for hardware implementation. Unfortunately, these code generation capabilities often
require the laborious creation of implementation models for each target platform (e.g., software in C or
assembler for a given DSP, synthesizable VHDL or macroblock netlist for ASIC or FPGA, etc.). However,
since these careful implementations are instances of the system-level platform mentioned above, their
development cost can be shared among a multitude of designs performed using the tool.
Most block diagram or Statechart-based code generators work in a syntax-directed fashion. A piece
of C or synthesizable VHDL code is generated for each block and connection, or for each hierarchical
state and transition. Thus the designer has tight control over the complexity of the generated software or
hardware. While this is a convenient means to bring manual optimization capabilities within the model-
based design ow, it has a potentially signicant disadvantage in terms of cost and performance (such
as disabling optimizations in the case of a C compiler). On the other hand, more recent tools, such as
ZURA: 2824_C003 2005/6/21 20:01 page 9 #11
Esterel Studio and System Studio, take a more radical approach to code generation, based on aggressive
optimizations [30]. These optimizations, based on logic synthesis techniques also in the case of software
implementation, destroy the original model structure, and thus make debugging and maintenance much
harder. However, they can result in an order of magnitude improvement in terms of cost (memory size)
and performance (execution speed) with respect to their syntax-directed counterparts [31].
Assuming that good automated code generation, or manual design, is available for each block in the
functional model of the application, we are now faced with the functionarchitecture codesign problem.
This essentially means tuning the functional decomposition, as well as the algorithms employed by the
overall functional model and each block within it, to the available architecture, and vice versa.
Several design environments, for example:
POLIS [1], COSYMA [26], Vulcan [27], COSMOS [32], and Roses [33] in the academic world
Real Time Studio [14], Foresight [34], and CARDtools [35] in the commercial world
help the designer in this task by using somehow the notion of independence between functional
specication on one side, and hardwaresoftware partitioning or architecture mapping choices on
the other.
The step of performance evaluation is performed in an abstract, approximate manner by the tools
listed above. Some of them use estimators to evaluate the cost and performance of mapping a functional
block to an architectural block. Others (e.g., POLIS) rely on cycle-approximate simulation to perform the
same task in a manner which better reects real-life effects, such as burstiness of resource occupation and
so on. Techniques for deriving both abstract static performance models (e.g., the WCET of a software
task) and performance simulation models are discussed below.
In all cases, both the cost of computation and that of communication must be taken into account.
This is because the best implementation, especially in the case of multimedia systems that manipulate
large amounts of image and sound data, is often one that reduces the amount of transferred data between
multiple memory locations, rather than one that nds the absolute best trade-off between software
exibility and hardware efciency. Inthis area, the Atomiumproject at IMEC[2,36] has focused onnding
the best memory architecture andschedule of memory transfers for data-dominatedapplications onmixed
hardwaresoftware platforms. By exploiting array access models based on polyhedra, they identify the best
reorganization of inner loops of DSP kernels and the best embedded memory architecture. The goal is to
reduce memory trafc due to register spills, and maximize the overall performance by accessing several
memories in parallel (many DSPs offer this opportunity even in the embedded software domain). A very
interesting aspect of Atomium, which distinguishes it from most other optimization tools for embedded
systems, is the ability to return a set of Pareto-optimal solutions (i.e., solutions which are not strictly
better than one another in at least one aspect of the cost function), rather than a single solution. This
allows the designer to pick the best point based on the various aspects of cost and performance (e.g.,
silicon area versus power and performance), rather than forcing him to abstract optimality into a single
number.
Performance analysis can be based on simulation, as mentioned above, or rely on automatically con-
structed models that reect the WCET of pieces of software (e.g., RTOS tasks) running on an embedded
processor. Such models, which must be both provably conservative and reasonably accurate, can be
constructed by using an execution model called abstract interpretation [37]. This technique traverses
the software code, while building a symbolic model, often in the form of linear inequalities [38,39],
which represents the requests that the software makes to the underlying hardware (e.g., code fetches,
data loads and stores, code execution). A solution to those inequalities then represents the total cost
of one execution of the given task. It can be combined then with processor, bus, cache, and main
memory models that in turn compute the cost of each of these requests in terms of time (clock cycles) or
energy. This nally results in a complete model for the cost of mapping that task to those architectural
resources.
Another technique for software performance analysis, which does not require detailed models of the
hardware, uses an approximate compilation step from the functional model to an executable model
ZURA: 2824_C003 2005/6/21 20:01 page 10 #12
(rather than a set of inequalities as above) annotated with the same set of fetch, load, store, and execute
requests. Then simulation is used, in a more traditional setting, to analyze the cost of implementing
that functionality on a given processor, bus, cache, and memory conguration. Simulation is more
effective than WCET analysis in handling multiprocessor implementations, in which bus conicts and
cache pollution can be difcult, if not utterly impossible, to predict statically in a manner that is not
too conservative. However, its success in identifying the true worst-case depends on the designer ability
to provide the appropriate simulation scenarios. Coverage enhancement techniques from the hardware
verication world [40,41] can be extended to help also in this case.
Similar abstract models can be constructed in the case of implementation as dedicated hardware,
by using high-level synthesis techniques. Such techniques are not yet good enough to generate production-
quality RTL code, but can be considered as a reasonable estimator of area, timing, and energy costs for
both ASIC and FPGA implementations [4244].
SystemC [29] and SpecC [45,46], on the other hand, are more traditional modeling and simulation
languages, for which the design ow is based on successive renement rather than codesign or mapping.
Finally, OPNET [47] and NS [48] are simulators with a rich modeling library specialized for wireline and
wireless networking applications. They help the designer in the more abstract task of generic performance
analysis, without the notion of functionarchitecture separation and codesign.
Communicationperformance analysis, onthe other hand, is generally not done using approximate com-
pilationor WCETanalysis techniques like those outlined above. Communicationis generally implemented
not by synthesis but by renement using patterns andrecipes, such as interrupt-based, DMA-based, and
so on. Thus several design environments and languages at the functionarchitecture level, such as POLIS,
COSMOS, Roses, SystemC, and SpecC, as well as N2C [6], provide mechanisms to replace abstract com-
munication, for example, FIFO-based or discrete-event-based, with detailed protocol stacks using buses,
interrupt controllers, memories, drivers, and so on. These renements can then be estimated either using
a library-based approach (they are generally part of a library of implementation choices anyway), or
sometimes using the approaches described above for computation. Their cost and performance can thus
be combined in an overall system-level performance analysis.
However, approximate performance analysis is often not good enough, and a more detailed
simulation step is required. This can be achieved by using tools, such as Seamless [49], CoMET [50],
MaxSim [51], and N2C [6]. They work at a lower abstraction level, by cosimulating software running on
Instruction Set Simulators (ISSs) and hardware running in a Verilog or VHDL simulator. While the simu-
lation is often slower than with more abstract models, and dramatically slower than with static estimators,
the precision can now be at the cycle level. Thus it permits close investigation of detailed communication
aspects, such as interrupt handling and cache behavior. These approaches are further discussed in the next
section.
The key advantage of using the mapping-based approach over the traditional designevaluateredesign
one is the speed with which design space exploration can be performed. This is done by setting up exper-
iments that change either mapping choices or parameters of the architecture (e.g., cache size, processor
speed, or bus bandwidth). Key decisions, such as the number of processors and the organization of the bus
hierarchy, can thus be based on quantitative application-dependent data, rather than on past experience.
If mapping can then be used to drive synthesis, in addition to simulation and formal verication, advan-
tages in terms of time-to-market and reduction of design effort are even more signicant. Model-based
code generation, as we mentioned in the previous section, is reasonably mature, especially for embedded
software in application areas, such as avionics, automotive electronics, and telecommunications. In these
areas, considerations other than absolute minimum memory footprint and execution time, for example,
safety, sheer complexity, and time-to-market, dominate the design criteria.
At the very least, if some form of automated model-based synthesis is available, it can be used to rapidly
generate FPGA- and processor-based prototypes of the embedded system. This signicantly speeds up
verication, with respect to workstation-based simulation. It permits even some hardware-in-the-loop
validation for cases (e.g., the notion of driveability of a car) in which no formalization or simulation is
possible, but a real physical experiment is required.
ZURA: 2824_C003 2005/6/21 20:01 page 11 #13
3.5 HardwareSoftware Coverication and Hardware Simulation
Traditionally the term hardwaresoftware codesign has been identied with the ability to execute
a simulation of the hardware and software at the same time. We prefer to use the termhardwaresoftware
coverication for this task, and leave codesign for the synthesis- and mapping-oriented approaches out-
lined in the previous section. In the form of simultaneously running an ISS and a Hardware Description
Language (HDL) simulator, while keeping the timing of the two synchronized, the area is not new [52].
In recent years, however, we have seen a number of approaches to speeding up the task, in order to tackle
platforms with several processors, and the need, for example, to boot an operating system in order to
coverify a platform with a processor and its peripherals.
Recent techniques have been devoted to the three main ways in which cosimulation speed can be
increased:
Accelerate the hardware simulator. Coverication generally works at the clock cycle accurate level,
meaning that both the hardware simulator and the ISS view time as a sequence of discrete clock cycles,
ignoring ner aspects of timing (sometimes clock phases are considered, e.g., for DSP systems, in which
different memory banks are accessed in different phases of the same cycle). This allows one to speed up
simulation with respect to traditional event-driven logic simulation, and yet retain enough precision to
identify, for example, bottlenecks, such as interrupt service latency or bus arbitration overhead.
Native-code hardware simulation (e.g., NCSim [28]) and emulation (e.g., QuickTurn [28] and Mentor
Emulation[49]) canbe usedto further speeduphardware simulation, at the expense of longer compilation
times and much higher costs, respectively.
Accelerate the ISS. Compiled-code simulation has been a popular topic in this area as well [53]. The
technique compiles a piece of assembler or Ccode for a target processor into object code that can be run on
a host workstation. This code generally also contains annotations counting clock cycles by modeling the
processor pipeline. The speed-up that can be achieved with this technique over a traditional ISS, which
fetches, decodes, and executes each target instruction individually, is signicant (at least one order of
magnitude). Unfortunately this technique is not suitable for self-modifying code, such as that of a RTOS.
This means that it is difcult to adapt to modern embedded software, which almost invariably runs under
RTOS control, rather than on the bare CPU. However, hybrid techniques involving partial compilation on
the y are reportedly used by companies selling fast ISSs [50,51].
Accelerate the interface between the two simulators. This is the area where the earliest work has been
performed. For example, Seamless [49] uses sophisticated lters to avoid sending requests for memory
accesses over the CPU bus. This allows the bus to be used only for peripheral access, while memory data
are provided to the processor directly by a memory server, which is a simulation lter sitting in between
the ISS and the HDL simulator. The lter reduces stimulation of the HDL simulator, and thus can result in
speed-ups of one or more orders of magnitude, when most of the bus trafc consists of ltered memory
accesses. Of course, also precision of analysis drops, since, for example, it becomes harder to identify an
overload in the processor bus due to a combination of memory and peripheral accesses, since no simulator
component sees both.
In the HDL domain, as mentioned above, progress in the levels of performance has been achieved
essentially by raising the level of abstraction. A cycle-based simulator, that is, one that ignores the
timing information within a clock cycle, can be dramatically faster than one that requires the use of a
timing queue to manage time-tagged events. This is mainly due to two reasons. The rst one is that
now most of the simulation can be executed always, at every simulation clock cycle. This means that it
is much more parallelizable, while event-driven simulators do not t well over a parallel machine due
to the presence of the centralized timing queue. Of course, there is a penalty if most of the hardware is
generally idle, since it has to be evaluated anyway, but clock gating techniques developed for low-power
consumption can obviously be applied here. The second one is that the overhead of managing the time
queue, which often accounts for 50% to 90% of the event-driven simulation time, can now be completely
eliminated.
ZURA: 2824_C003 2005/6/21 20:01 page 12 #14
Modern HDLs either are totally cycle-based (e.g., SystemC 1.0 [29]) or have a synthesizable subset,
which is fully synchronous and thus fully compilable to cycle-based simulation. The same synthesizable
subset, by the way, is also supported by hardware emulation techniques, for obvious reasons.
Another interesting area of cosimulation in embedded system design is analogdigital cosimulation.
This is because such systems quite often include analog components (ampliers, lters, A/D and D/A
converters, de-modulators, oscillators, phase locked loops [PLLs], etc.), and models of the environment
quite often involve only continuous variables (distance, time, voltage, etc.). Simulink includes a component
for simulating continuous-time models, employing a variety of numerical integration methods, which
can be freely mixed with discrete-time sampled-data subsystems. This is very useful when modeling and
simulating, for example, a control algorithm for automotive electronics, in which the engine dynamics
are modeled with differential equations, while the controller is described as a set of blocks implementing
a sampled-time subsystem.
Simulink is still mostly used to drive software design, despite good toolkits implementing it in recon-
gurable hardware [54,55]. Simulators in the hardware design domain, on the other hand, generally use
HDLs as their input languages. Analog extensions of both VHDL [56] and Verilog [57] are available.
In both cases, one can represent quantities that satisfy either of Kirchhoff s Laws (i.e., conserved over
cycles or nodes). Thus one can easily build netlists of analog components interfacing with the digital
portion, modeled using traditional Boolean or multivalued signals. The simulation environment will then
take care of synchronizing the event-driven portion and the continuous time portion. A key problem
here is to avoid causality errors, when an event that happens later in host workstation time (because
the simulator takes care of it later) has an effect on events that preceded it in simulated time. In this
case, one of the simulators has to roll back in time, undoing any potential changes in the state of the
simulation, and restart with the new information that something has happened in the past (generally the
analog simulator does it, since it is easier to reverse time in that case).
Also in this case, as we have seen for hardwaresoftware cosimulation, execution is much slower than
in the pure event-driven or cycle-based case, due to the need to take small simulation steps in the analog
part. There is only one case in which the performance of the interface between the two domains or of the
continuous time simulator is not problematic. It is when the continuous time part is much slower in reality
than the digital part. A classical example is automotive electronics, in which mechanical time constants
are larger by several orders of magnitude than the clock period of a modern integrated circuit. Thus the
performance of continuous time electronics and mechanical cosimulation may not be the bottleneck,
except in the case of extremely complex environment models with huge systems of differential equations
(e.g., accurate combustion engine models). In that case, hardware emulation of the differential equation
solver is the only option (e.g., see Reference 16).
3.6 Software Implementation
The next two sections provide an overview of traditional design ows for embedded hardware and software.
They are meant to be used as a general introduction to the topics described in the rest of the book, and as
a source of references to standard design practice.
The software components of an embedded system are generally implemented using the traditional
designcodetestdebug cycle, which is often represented using a V-shaped diagram to illustrate the fact
that every implementation level of a complex software systemmust have a corresponding verication level
(Figure 3.4). The parts of the V-cycle which relate to system design and partitioning have been described
in the previous sections. Here we outline the tools that are available to the embedded software developer.
3.6.1 Compilation, Debugging, and Memory Model
Compilation of mathematical formulas into binary machine-executable code followed almost immedi-
ately the invention of electronic computer. The rst Fortran compiler dates back to 1954, and subroutines
ZURA: 2824_C003 2005/6/21 20:01 page 13 #15
Requirements Product
System
validation
Function and
system analysis
Subsystem and
communication
testing
System design
partitioning
SW design
specification
SW integration
Implementation
FIGURE 3.4 V-cycle for software implementation.
were introduced in 1958, resulting in the creation of the Fortran II language. Since then, languages have
evolved a little, more structured programming methodologies have been developed, and compilers have
improved quite a bit, but the basic method has remained the same. In particular the C language, originally
designed by Kernighan and Ritchie [58] between 1969 and 1972, and used extensively for programming
the Unix operating system, is now dominant in the embedded system world, almost replacing the more
exible but much more cumbersome and less portable assembler. Its descendants Java and C++ are begin-
ning to make some inroads, but are still viewed as requiring too much memory and computing power
for widespread embedded use. Java, although originally designed for embedded applications [59,60],
has a memory model based on garbage collection, that still dees effective embedded real-time
implementation [61].
The rst compilation step from a high-level language is the conversion of the human-written or
machine-generated code into an internal format, called Abstract Syntax Tree [62], which is then translated
into a representation that is closer to the nal output (generally assembler code) and is suitable for a
host of optimizations. This representation can take the form of a control/dataow graph or a sequence
of register transfers. The internal format is then mapped, generally via a graph-matching algorithm, to
the set of available machine instructions, and written out to a le. A set of assembler les, in which
references to data variables and to subroutine names are still based on symbolic labels, are then con-
verted to an absolute binary le, in which all addresses are explicit. This phase is called assembly
and loading. Relocatable code generation techniques, which basically permit code and its data to be
placed anywhere in memory, without requiring recompilation, are now being used also in the embedded
system domain, thanks to the availability of index registers and relative addressing modes in modern
microprocessors.
Debuggers for modern embedded systems are much more vital than for general-purpose programming,
due to the more limited accessibility of the embedded CPU (often no le system, limited display and
keyboard, etc.). They must be able to show several concurrent threads of control, as they interact with
each other and with the underlying hardware. They must also be able to do so by minimally disrupting
normal operation of the system, since it often has to work in real time, interacting with its environment.
Bothhardware andoperating systemsupport are essential, andthe mainRTOS vendors, suchas WindRiver,
all provide powerful interactive multitask debuggers. Hardware support takes the form of breakpoint and
watchpoint registers, which can be set to interrupt the CPU when a given address is used for fetching or
ZURA: 2824_C003 2005/6/21 20:01 page 14 #16
data load/store, without requiring one to change the code (which may be in ROM) or to continuously
monitor data accesses, which would dramatically slow down execution.
A key difference between most embedded software and most general-purpose software is the memory
model. In the latter case, memory is viewed as an essentially innite uniform linear array, and the compiler
provides a thin layer of abstraction on top of it, by means of arrays, pointers, and records (or structs).
The operating system generally provides virtual memory capabilities, in the form of user functions to
allocate and deallocate memory, and by swapping less frequently used pages of main memory to disk. This
provides the illusion of a memory as large as the disk area allocated to paging, but with the same direct
addressability characteristics as main memory. In embedded systems, however, money is an expensive
resource, both in terms of size and speed. Cost, power, and physical size constraints generally forbid the
use of virtual memory, and performance constraints force the designer to always carefully lay out data
in memory, and match its characteristics (SRAM, DRAM, Flash, ROM) to those of the data and code.
Scratchpads [63], that is, manually managed areas of small and fast memory, often on-chip SRAM, are still
dominant in the embedded world. Caches are frowned upon in the real-time application domain, since
the time at which a computation is performed often matters much more than the accuracy of its result.
This is due to the fact that, despite a large body of research devoted to timing analysis of software code
in the presence of caches (e.g., see References 64 and 65), their performance must still be assumed to be
worst-case, rather than average-case as in general-purpose and scientic computing, thus leading to poor
performance at a high cost (large and power-hungry tag arrays).
However, compilers that traditionally focused on code optimizations for various underlying architec-
tural features of the processor [66], nowoffer more andmore support for memory-orientedoptimizations,
in terms of scheduling data transfers, sizing memories of various types, and allocating data to memory,
sometimes moving it back and forth between fast and expensive and slow and cheap storage
1
[2,63].
3.6.2 Real-Time Scheduling
Another key difference with respect to general-purpose software are the real-time characteristics of most
embedded software, due to its continual interaction with an environment that seldom can wait. In hard
real-time applications, results producedafter the deadline are totally useless. Onthe other hand, in soft real-
time applications a merit function measures QOS, allowing one to evaluate trade-offs between missing
various deadlines and degrading the precision or resolution with which computations are performed.
While the former is often associated with safety-critical (e.g., automotive or avionics) applications and
the latter is associated to multimedia and telecommunication applications, algorithm design can make a
difference even within the very same domain. Consider, for example, a frame decoding algorithm that
generates its result at the end of each execution, and that is scheduled to be executed in real-time every
50th of a second. If the CPU load does not allow it to complete each execution before the deadline, the
algorithm will not produce any results, and thus behave as a hard real-time application, without being
life-threatening. On the other hand, a smarter algorithm or a smarter scheduler would just reduce the
frame size or the frame rate, whenever the CPU load due to other tasks increases, and thus produce a
result that has lower quality, but is still viewable.
Ahuge amount of research, summarized in excellent books, such as References 2325, has been devoted
to solving the problems introduced by real-time constraints on embedded software. Most of this work
models the system (application, environment, and platform) in very abstract terms, as a set of tasks, each
with a release time (when the task becomes ready), a deadline (by which the task must complete), and
a WCET. In most cases tasks are periodic, that is, release times and deadlines of multiple instances of the
same task are separated by a xed period. The job of the scheduler is to nd an execution order such
that each task can complete by its deadline, if it exists. The scheduler may or may not, depending on the
underlying hardware andsoftware platform(CPU, peripherals, andRTOS) be able topreempt anexecuting
1
While this may seem similar to virtual memory techniques, it is generally done explicitly, always keeping cost,
power, and performance under tight control.
ZURA: 2824_C003 2005/6/21 20:01 page 15 #17
task in order to execute another one. Generally the scheduler bases its preemption decision, and the choice
of which task must be run next, on an integer rank assigned to each task and called priority. Priorities
may be assigned statically, at compile time, or dynamically, at runtime. The trade-off is between usage
of precious CPU resources for runtime (also called online) priority assignment, based on an observation
of the current execution conditions, versus the waste of resources inherent in the a priori denition of
a priority assignment. A scheduling algorithm is also supposed in general to be able to tell conservatively
if a set of tasks is unschedulable on a given platform, and given a set of modeling assumptions (e.g.,
availability of preemption, xed or stochastic execution time, and so on). Unschedulability may occur, for
example, because the CPU is not powerful enough and the WCETs are too long to satisfy some deadline.
In this case the remedy could be either the choice of a faster clock frequency, or a change of CPU, or
the transfer of some functionality to a hardware coprocessor, or the relaxation of some of the constraints
(periods, deadlines, etc.).
A key distinction in this domain is between time-triggered and event-triggered scheduling [67]. The
former (also called Time-Division Multiple Access in telecommunications) relies on the fact that the start,
preemption (if applicable), and end times of all instances of all tasks are decided a priori, based on worst-
case analysis. The resulting system implementation is very predictable, easy to debug, and allows one to
guarantee some service even under fault hypotheses [68]. The latter decides start and preemption times
based on the actual time of occurrence of the release events, and possibly on the actual execution time
(shorter than worst-case). It is more efcient than time-triggering in terms of CPU utilization, especially
when release and execution times are not known precisely but subject to jitter. It is, however, more difcult
to use in practice, because it requires either some form of conservative schedulability analysis a priori, and
the dynamic nature of event arrival makes troubleshooting much harder.
Some models and languages listed above, such as synchronous languages and dataow networks, lend
themselves well to time-triggered implementations. Some form of time-triggered scheduling is being, or
will most likely be used for both CPUs and communication resources for safety-critical applications. This
is already state of the art in avionics (y-by-wire, as used e.g., in the Boeing 777 and in all Airbus models),
and it is being seriously considered for automotive applications (X-by-wire, where Xcan stand for brake,
drive, or steer). It is considered, coupled with certied high-level language compilers and standardized
code review and testing processes, to be the only mechanism to comply with the rules imposed by various
governmental certication agencies. Moving such control functions to embedded hardware and software,
thus replacing older mechanical parts, is considered essential in order to both reduce costs and improve
safety. Embedded electronic systems can analyze continuously possible wearing and faults in the sensors
and the actuators, and thus warn drivers or maintenance teams.
The simple task-based model outlined above can also be modied in various ways in order to take into
account:
The cost of various housekeeping operations, such as recomputing priorities, swapping tasks in
and out (also called context switch), accessing memory, and so on.
The availability of multiple resources (processors).
The fact that a task may need more than one resource (e.g., the CPU, a peripheral, a lock on a
given part of memory), and possibly may have different priorities and different preemptability
characteristics on each such resource (e.g., CPU access may be preemptable, while disk or serial
line access may not).
Data or control dependencies between tasks.
Most of these renements of the initial model can be taken into account by appropriately modifying the
basic parameters of a task set (release time, execution time, priority, and so on). The only exception is the
extension to multiple concurrent CPUs, which makes the problem substantially more complex. We refer
the interested reader to References 2325 for more information about this subject. This sort of real-time
schedulability is currently replacing manual trial-and-error and extensive simulation as a means to ensure
satisfaction of deadlines or a given QOS requirement.
ZURA: 2824_C003 2005/6/21 20:01 page 16 #18
3.7 Hardware Implementation
The modern hardware implementation process [69,70] in most cases starts from the so-called RTL. At this
level of abstraction the required functionality of the circuit is modeled with the accuracy of a clock cycle,
that is, it is known in which clock cycle each operation, such as addition or data transfer, occurs, but the
actual delay of each operation, and hence the stabilization time of data on the inputs of the registers, is not
known. At this level the number of registers and their bitwidths are also precisely known. The designer
usually writes the model using an HDL, such as Verilog or VHDL, in which registers are represented using
special kinds of clock-triggered assignments, and combinational logic operations are represented using
the standard arithmetic, relational, and Boolean operators that are familiar to software programmers
using high-level languages.
The target implementation generally is not in terms of individual transistors and wires, but uses the
Boolean gate abstraction as a convenient hand-off point between logic designer and technology specialist.
Such abstraction can take the form of a standard cell , that is, an interconnection of transistors realized
and well characterized on silicon, which implements a given Boolean function, and exhibits a specic
propagation delay from inputs to outputs, under given supply, temperature, and load conditions. It can
also be a Combinational Logic Block (CLB) in a FPGA. The former, which is the basis of the modern ASIC
design ow, is much more efcient than the latter,
2
however, it requires a very signicant investment in
terms of EDA
3
tools, mask production costs and engineer training.
The advantage of ASICs over FPGAs in terms of area, power, and performance efciency comes from
two main factors. The rst one is the broader choice of basic gates: an average standard cell library includes
about 100 to 500 gates, with both different logic functions and different drive strengths, while a given
FPGA contains only one type of CLB. The second one is the use of static interconnection techniques,
that is, wires and contact vias, versus the transistor-based dynamic interconnects of FPGAs.
The much higher nonrecurrent engineering cost of ASICs comes rst of all from the need to create
at least a set of masks for each design (assuming it is correct the rst time, that is, there is no need to
respin), which can be up to about $ 1 million for current technologies and is growing very fast, and from
the long fabrication times, which can be up to several weeks. Design costs are also higher, again in the
million dollar range, both due to the much greater exibility, requiring skilled personnel and sophisticated
implementation tools, and due to the very high cost of design failure, requiring sophisticated verication
tools. Thus ASIC designs are the most economically viable solution only for very high volumes. The rising
mask costs and manufacturing risks are making the FPGA option viable for larger and larger production
counts as technology evolves. A third alternative, structured ASICs, has been proposed recently. It features
xedlayout schemes, similar toFPGAs, but alsoimplements interconnect using contact vias. Acomparison
of the alternatives, for a given design complexity and varying production volumes, is shown in Figure 3.5
(the exact points at which each alternative is best are still subject to debate, and they are moving to the
right over time).
3.7.1 Logic Synthesis and Equivalence Checking
The semantics of HDLs and of languages, such as C or Java, are very different from each other. HDLs
were born in the 1970s in order to model highly concurrent hardware systems, built using registers and
Boolean gates. They, and the associated simulators that allow one to analyze the behavior of the modeled
design in detail, are very efcient in handling ne-grained concurrency and synchronization, which is
necessary when simulating huge Boolean netlists. However, they often lack constructs found in modern
programming languages, such as recursive functions and complex data types (only recently introduced in
Verilog), or objects, methods, and interfaces. An HDL model is essentially meant to be simulated under
2
The difference is about one order of magnitude interms of area, power, andperformance for the current fabrication
technology, and the ratio is expected to remain constant over future technology generations.
3
The term EDA, which stands for Electronic Design Automation, is often used to distinguish this class of tools from
the CAD tools used for mechanical and civil engineering design.
ZURA: 2824_C003 2005/6/21 20:01 page 17 #19
T
o
t
a
l

c
o
s
t
FPGA SA
Standard cell
C
10,000100,000 100,000. . . 110,000
Volume
B A
FIGURE 3.5 Comparison between ASIC, FPGA, and Structured ASIC production costs.
a variety of timing models (generally at the register transfer or gate level, even though cosimulation with
analog components or continuous time models is also supported, that is, in Verilog-AMS and AHDL).
Synthesis fromanHDL into aninterconnectionof registers and gates, normally consists of two substeps.
The rst one, called RTL synthesis and module generation, transforms high-level operators, such as adders,
multiplexers, and so on, into Boolean gates using an appropriate architecture (e.g., ripple carry or carry
lookahead). The second one, called logic synthesis, optimizes the combinational logic resulting from the
above step, under a variety of cost and performance constraints [71,72].
It is well known that, given a function to be implemented (e.g., 32-bit twos-complement addition),
one can use the properties of Boolean algebra in order to nd alternative implementations with different
characteristics in terms of:
1. Area, for example, estimated as the number of gates, or as the number of gate inputs, or as the
number of literals in the Boolean expression representing each gate function, or using a specic
value for each gate selected from the standard cell library, or even considering an estimate of
interconnect area. This sequence of cost functions increases estimation precision, but is more and
more expensive to compute.
2. Delay, for example, estimated as number of levels, or more precisely as a combination of levels and
fanout of each gate, or even more precisely as a table that takes into account gate type, transistor
size, input transition slope, output capacitance, and so on.
3. Power, for example, estimated as transition activity times capacitance times voltage squared, using
the well-known equation valid for Complementary MOS (CMOS) transistors.
It is also well known that generally Pareto-optimal solutions to this problem exhibit an area-delay product
that is approximately constant for a given function.
Modern EDA tools, such as Design Compiler from Synopsys [8], RTL Compiler from Cadence [28],
Leonardo Spectrum from Mentor Graphics [49], Synplify from Synplicity [73], and Blast Create from
Magma Design Automation [74] and others, perform such task efciently for designs that today may
include a few million gates. Their widespread adoption has enabled designers to tackle huge designs in
a matter of months, which would have been unthinkable or extremely inefcient using either manual
or purely block-based design techniques. Such logic synthesis systems take into account the required
functionality, the target clock cycle, and the set of physical gates that are available for implementation
(the standard-cell library or the CLB characteristics, e.g., number of inputs), as well as some estimates
of capacitance and resistance of interconnection wires
4
and generate efcient netlists of Boolean gates,
which can be passed on to the following design steps.
4
Some such tools also include rough placement and routing steps, which will be described below, in order to
increase the precision of such interconnect estimates for current deep submicron (DSM) technologies.
ZURA: 2824_C003 2005/6/21 20:01 page 18 #20
While synthesis is performed using precise algebraic identities, bugs can creep into any program.
Thus, in order to avoid extremely costly respins due to an EDA tool bug, it is essential to verify that the
functionality of the synthesized gate netlist is the same as that of the original RTL model. This verication
step was traditionally performed using a multilevel HDL simulator, comparing responses to designer-
written stimuli in both representations. However, multimillion gate circuits would require too many very
slow simulation steps (a large circuit today can be simulated at the speed of a handful of clock cycles
per second). Formal verication is thus used to prove, using algorithms that are based on the same laws
as synthesis techniques, but which have been written by different people and thus hopefully have different
bugs, that indeed the responses of the two circuit models are identical under all legal input sequences.
This verication, however, solves only half of the problem. One must also check that all combinational
logic computations complete within the required clock cycle. This second check can be performed using
timing simulators, however, complexity considerations also suggest the usage a more static approach.
Static Timing Analysis, based on worst-case longest-path search within combinational logic, is today
a workhorse of any logic synthesis and verication framework. It can either be based on purely topological
information, or consider only so-called true paths along which a transition can propagate [75], or even
include the effects of crosstalk on path delay. Crosstalk may alter the delay of a victim wire, due to
simultaneous transitions of temporally and spatially close aggressor wires, as analyzed by tools such
as PrimeTime from Synopsys [8] and CeltIc from Cadence [28]. This kind of coupling of timing and
geometry makes crosstalk-aware timing analysis very hard, and essentially contributes to the breaking of
traditional boundaries between synthesis, placement, and routing.
Tools performing these tasks are available from all major EDA vendors (e.g., Synopsys, Cadence)
as well as from a host of startups. Synthesis has become more or less a commodity technology, while
formal verication, even in its simplest form of equivalence checking, as well as in other emerging forms,
such as property checking, which are described below, is still an emerging technology, for which disruptive
innovation occurs mostly in smaller companies.
3.7.2 Placement, Routing, and Extraction
After synthesis (and sometimes during synthesis) gates are placed on silicon, either at xed locations (the
positions of CLBs) for FPGAs and Structured ASICs, or with a row-based organization for standard cell
ASICs. Placement must avoid overlaps between cells, while at the same time satisfying clock cycle time
constraints, avoiding excessively long wires on critical paths.
5
Placement, especially for multimillion-gate circuits, is an extremely difcult problem, which requires
complex constrained combinatorial optimization. Modern algorithms [76] drastically simplify the model,
in order to ensure reasonable runtimes. For example, the quadratic placement model used in several
modern EDA tools, minimizes the sum of squares of net lengths. This permits very efcient derivation
of the cost function and fast identication of a minimum cost solution. However, this quadratic cost
only approximately correlates with the true objective, which is the minimization of the clock period, due
to parasitic capacitance. True cost rst of all depends also on the actual interconnect, which is designed
only later by the routing step, and second depends on the maximum among a set of sums (one for each
register-to-register path), rather than on the sum over all gate-to-gate interconnects. For this reason,
modern placers iterate steps solved using fast but approximate algorithms, with more precise analysis
phases, often involving actual routing, in order to recompute the actual cost function at each step.
Routing is the next step, and involves generating (or selecting from the available prelaid-out tracks
in FPGAs) the metal and via geometries that will interconnect placed cells. It is also extremely difcult
in modern submicron technologies, not only due to the huge number of geometries involved (10 million
gates can easily involve a billion wire segments and contacts), but also due to the complexity of modern
5
Power density has recently become a prime concern for placement as well, implying the need to avoid hot spots
of very active cells, where power dissipation through the silicon substrate would be too difcult to manage.
ZURA: 2824_C003 2005/6/21 20:01 page 19 #21
interconnect modeling. A wire used to be modeled, in CMOS technology, essentially as a parasitic capaci-
tance. This (or minor variations considering also resistance) is still the model used by several commercial
logic synthesis tools. However, nowadays a realistic model of a wire, to be used when estimating the cost
of a placement or of a routing solution, must take into account:
Realistic resistance and capacitance, for example, using the Elmore model [77], considering each
wire segment separately, due to the very different resistance and capacitance characteristics of
different metal layers.
6
Crosstalk noise due to capacitive coupling.
7
This means that, exactly as in placement (and sometimes during placement), one needs to alternate
betweenfast routing using approximate cost functions anddetailedanalysis steps that rene the value of the
cost function. Again, all major EDA vendors offer solutions to the routing problem, which are generally
tightly integrated with the placement tool, even though in principle the two perform separate functions.
The reason for the tight coupling lies in the above-mentioned need for the placer to accurately estimate
the detailed route taken by a given interconnect, rather than just estimate it with the square of the distance
between its terminals.
Exactly as in the case of synthesis, a verication step must be performed after placement and routing.
This is required in order to verify that:
All design rules are satised by the nal layout.
All and only the desired interconnects have been realized by placement and routing.
This step is done by extracting electrical and logic models fromlayout masks, and comparing these models
with the input netlist (already veried for equivalence with the RTL). Note that within each standard
cell, design rules are veried independently, since the ASIC designer, for reason of intellectual property
protection, generally does not see the actual layout of the standard cells, but only an external envelope
of active (transistor) and interconnect areas, which is sufcient to perform this kind of verication. The
layout of each cell is known and used only at the foundry, when masks are nally produced.
3.7.3 Simulation, Formal Verication, and Test Pattern Generation
The steps mentioned above create a layout implementation from RTL, while checking simultaneously that
no errors are introduced, either due to programming errors, or due to manual modications, and that
performance and power constraints are satised. However, they neither ensure that the original RTL
model satised the customer-dened requirements, nor that the circuit after manufacturing did not have
any aws compromising either its functionality or its performance.
The former problem is tackled by simulation, prototyping, and formal verication. None of these
techniques is sufcient to ensure that an ill-dened problem has a solution: customer needs are inherently
nonformalizable.
8
However, they help building up condence in the fact that the nal product will
satisfy the requirements. Simulation and prototyping are both trial-and-error procedures, similar to the
compiledebug cycle used for software. Simulation is generally cheaper, since it only requires a general-
purpose workstation (nowadays often a PC running Linux), while prototyping is faster (it is based on
synthesizing the RTL model into one or several FPGAs). Cost and performance of these options differ by
6
Layers that are farther away from silicon are best for long-distance wires, due to the smaller substrate and mutual
capacitance, as well as due to the smaller sheet resistance [78].
7
Inductance fortunately is not yet playing a signicant role, and many doubt that it ever will, for digital integrated
circuits.
8
For example, what is the denition of a correct phone call? Does this refer to not dropping the communication?
To transferring exactly a certain number of voice samples per second? To setting up quickly a communication path?
Since all these desirable characteristics have a cost, what is the maximum price various classes of customers are willing
to pay for them, and what is the maximum degree of violation that can be admitted by each class?
ZURA: 2824_C003 2005/6/21 20:01 page 20 #22
several orders of magnitude. Prototyping on multi-FPGA platforms, such as those offered by Quickturn,
is thus limited to the most expensive designs, such as microprocessors.
9
Unfortunately, both simulation and prototyping suffer from a basic capacity problem. It is true that
cost decreases exponentially and performance increases exponentially over technology generations for the
simulation and prototyping platforms (CPUs and FPGAs). However, the complexity of the verication
problem grows as a double or even triple exponential (approximately) with technology. The reason is that
the number of potential states of a digital design grows exponentially with the number of memory-holding
components (ip-ops and latches), and the complexity of the verication problemfor a sequential entity
(e.g., a FSM) grows even more than exponentially with its state-space. For this reason, the growth in the
number of input patterns which are required to prove up to a given level of condence that a design is
correct, grows triply exponentially with each technology generation, while capacity and performance grow
only as a single exponential. This is clearly an untenable situation, given that the number of engineers is
nite, and the size of the verication teams is already much larger than that of the design teams.
Formal verication, dened as proving semiautomatically that, under a set of assumptions, a given
property holds for a design, is a means of alleviating at least the human aspect of the verication
complexity explosion problem. Formal verication allows one to state a property, such as, for example,
this protocol never deadlocks or the value of this register is never overwritten before being read,
using relatively simple mathematical formulas. Then one can automatically check that the property holds
over all possible input sequences. The problem, unfortunately, is inherently extremely complex (the triple
exponential mentionedabove affects this formulationas well). However, the complexity is nowrelegatedto
the automated portion of the ow. Thus manual generation and checking of individual pattern sequences
is no longer required. Several EDAcompanies onthe market, such as Cadence, Mentor Graphics, Synopsys,
as well as several silicon vendors, such as Intel and IBM, currently offer or internally develop and use such
tools. The key barriers to adoption are twofold:
1. The complexity of the task, as mentioned above, is just shifted. While a workstation costs much
less than an engineer, exponential growth is never tenable in the long term, regardless of the
constant factors. This means that signicant human intervention is still required in order to keep
within acceptable limits the time required to check each individual property. This involves both
breaking properties into simpler subproperties and abstracting away aspects of the system that
are not relevant for the property at hand. Abstraction, however, hides aspects of the real design
from the automated prover, and thus implies the risk of false positive results, that is, of declaring
a system correct even when it is not.
2. Specication of properties is much more difcult than identication of input patterns. A property
must encompass a variety of possible scenarios and state explicitly all assumptions made (e.g.,
there is no deadlock in the bus access protocol only if no master makes requests at every clock
cycle). The language in which properties are specied is often a form of mathematical logics, and
thus is even less familiar than software languages to a typical design engineer.
However, signicant progress is being made in this area every year by researchers, and adoption of such
automated formal verication techniques in the specication verication domain is growing.
Testing a manufactured circuit to verify that it operates correctly according to the RTL model is a
closely related problem. In principle, one would need to prove equivalent behavior under all possible
inputoutput sequences, which is clearly impossible. In practice, test engineers either use a naturally
orthogonal architecture, such as that of a microprocessor, in order to functionally test small sequences of
instructions. Or they decompose testing into that of combinational and sequential logic. Combinational
logic testing is a relatively easy task, as compared to the formal verication described above. If one
considers only Booleanfunctionality (i.e., delay is not tested), its complexity (assuming that no polynomial
9
Nowadays evenmicroprocessors are mostly designedusing a modiedASIC-like ow, except for memories, register
les, and sometimes portions of the ALU, which are still designed by hand down to the polygon level, at least for leading
edge CPUs.
ZURA: 2824_C003 2005/6/21 20:01 page 21 #23
algorithm exists for NP-complete problems) is just a single exponential in the number of combinational
circuit inputs.
While a priori there is no reason why testing only Boolean equivalence between the specication and the
manufactured circuit should be enough to ensure correct functionality, empirically there is a signicant
amount of evidence that fully testing for a relatively small class of Boolean manufacturing faults, namely
stuck-at faults, is sufcient to ensure satisfactory actual yield for ASICs. The stuck-at-fault model assumes
that the only problem that can occur during manufacturing is the fact that some gate inputs are xed
at logical 0 or 1. This may have been a physically realistic model in the early days of bipolar-based
TransistorTransistor Logic. However, in DSM CMOS a host of physical defects may short wires together,
increase or decrease their resistance and capacitance, short a transistor gate to its source or drain, and so
on. At the logic level, a combinational function may become sequential (even worse, may exhibit dynamic
behavior, that is, slowly change output values over time, without changing inputs), or it may become faster
or slower. Still, full checking for stuck-at faults is excellent to ensure that none of these complex physical
problems has occurred, or will affect the operation of the circuit.
For this reason, today testing is mostly accomplished by rst of all reducing sequential testing to
combinational testing, by using special memory elements, the so-called scan ip-ops and latches. Second,
combinational test pattern generation is performed only at the Boolean level, using the above-mentioned
stuck-at model. Test pattern generation is similar to equivalence checking, because it amounts to proving
that two copies of the same circuit, one with and one without a given fault, are indeed not equivalent. The
witness to this nonequivalence is the pattern to be applied to the circuit inputs to identify the fault.
The problem of actually applying the pattern to the physical fragment of combinational logic, and then
observing its outputs to verify if the fault is present, is solved by converting all or most of the registers
of the sequential circuit into one (or a handful of) giant shift registers, each including several hundred
thousand bits. The pattern (and several others, used to test several CLBs in parallel) is rst loaded serially
through the shift register. Then a multiplexer at the input of each ip-op is switched, transforming the
serial loading mode into parallel loading mode, using as register inputs the outputs of each CLB. Finally,
serial conversion is performed again, and the outputs of the logic are checked for correctness by the test
equipment. Figure 3.6 shows an example of this sort of arrangement, in which also the ip-op clock is
changed fromnormal operation (in which it can be gated) to test mode. The only drawback of this elegant
solution, due to the IBM engineers in the 1970s, is the additional time that the circuit needs to spend on
very expensive testing machines, in order to shift patterns in and out through very long ip-op chains.
Test pattern generation for combinational circuits is a very well-established area of research, and again the
reader is referred to one of many books in the area for a more extensive description [79].
Note that memories are not tested using this mechanism, both because it would be too expensive to
convert each cell into a scan register, and because the stuck-at-fault model does not apply to this kind
of circuits. Memories are tested using appropriate inputoutput pattern sequences, which are generated,
applied, and veried on-chip, using either self-test software running on the embedded processor, or some
Test_Data
Test_Mode
Test_Clk
User_Clk
Q Q
Sout Sout
FIGURE 3.6 Two scan ip-ops with combinational logic.
ZURA: 2824_C003 2005/6/21 20:01 page 22 #24
form of Built-In Self-Test (BIST) logic circuitry. Modern RAM generators, that produce directly the layout
in a given process, based on the requested number of rows and columns, often produce directly also the
BIST circuitry.
3.8 Conclusions
This chapter discussed several aspects of embedded system design, including both methodologies that
allow one to perform judicious algorithmic and architectural decisions, and tools supporting various
steps of these methodologies. One must not forget, however, that often embedded systems are complex
compositions of parts that have been implemented by various parties, and thus the task of physical board
or chip integration can be as difcult as, and much more expensive than, the initial architectural decisions.
In order to support the integration and system testing tasks one must use formal models throughout
the design process, and if possible perform early evaluation of the difculties of integration, by virtual
integration and rapid prototyping techniques. These allow one to nd or avoid completely subtle bugs
and inconsistencies earlier in the design cycle, and thus reduce overall design time and cost.
Thus the ow and tools that we described in this chapter help not only with the initial design, but
also with the nal integration. This is because they are based on executable specications of the whole
system (including models of its environment), early virtual integration, and systematic (often automated)
renement toward implementation.
The last part of the chapter summarized the main characteristics of the current hardware and software
implementation ows. While complete coverage of this huge topic is beyond its scope, a lightweight intro-
duction can hopefully serve to direct the interested reader, who has only a general electrical engineering
or computer science background, toward the most appropriate source of information.
References
[1] F. Balarin, E. Sentovich, M. Chiodo, P. Giusto, H. Hsieh, B. Tabbara, A. Jurecska, L. Lavagno,
C. Passerone, K. Suzuki, and A. Sangiovanni-Vincentelli. HardwareSoftware Co-design of
Embedded Systems The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997.
[2] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecapelle. Custom
Memory Management Methodology: Exploration of Memory Organisation for Embedded Multimedia
SystemDesign. Kluwer Academic Publishers, Dordrecht, 1998.
[3] The Mathworks Simulink and StateFlow. http://www.mathworks.com.
[4] National Instruments MATRIXx. http://www.ni.com/matrixx/.
[5] ETAS Ascet-SD. http://www.etas.de.
[6] N2C CoWare SPW and LISATek. http://www.coware.com.
[7] Esterel Technologies Esterel Studio. http://www.esterel-technologies.com.
[8] Design Compiler Synopsys SystemStudio and PrimeTime. http://www.synopsys.com.
[9] Telelogic Tau and Doors. http://www.telelogic.com.
[10] I-Logix Statemate and Rhapsody. http://www.ilogix.com.
[11] D. Harel, H. Lachover, A. Naamad, A. Pnueli, M. Politi, R. Sherman, A. Shtull-Trauring, and
M.B. Trakhtenbrot. STATEMATE: a working environment for the development of complex reactive
systems. IEEE Transactions on Software Engineering, 16:403414, 1990.
[12] The Object Management Group UML. http://www.omg.org/uml/.
[13] L. Lavagno, G. Martin, and B. Selic, Eds. UML for Real: Design of Embedded Real-Time Systems.
Kluwer Academic Publishers, Dordrecht, 2003.
[14] Artisan Software Real Time Studio. http://www.artisansw.com/.
[15] IBM Rational Rose RealTime. http://www.rational.com/products/rosert/.
[16] dSPACE TargetLink and Prototyper. http://www.dspace.de.
[17] OSEK/VDX. http://www.osek-vdx.org/.
ZURA: 2824_C003 2005/6/21 20:01 page 23 #25
[18] E.A. Lee and D.G. Messerschmitt. Synchronous data ow. IEEE Proceedings, 75(9):12351245,
1987.
[19] J. Buck and R. Vaidyanathan. Heterogeneous modeling and simulation of embedded systems in
El Greco. In Proceedings of the International Conference on Hardware Software Codesign, May 2000.
[20] TNI Valiosys Reqtify. http://www.tni-valiosys.com.
[21] R.P. Kurshan. Automata-Theoretic Verication of Coordinating Processes. Princeton University Press,
Princeton, NJ, 1994.
[22] K. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, Dordrecht, 1993.
[23] G. Buttazzo. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and Applica-
tions. Kluwer Academic Publishers, Dordrecht, 1997.
[24] H. Gomaa. Software Design Methods for Concurrent and Real-Time Systems. Addison-Wesley,
Reading, MA, 1993.
[25] W.A. Halang and A.D. Stoyenko. Constructing Predictable Real Time Systems. Kluwer Academic
Publishers, Dordrecht, 1991.
[26] R. Ernst, J. Henkel, and T. Benner. Hardwaresoftware codesign for micro-controllers. IEEE Design
and Test of Computers, 10:6475, 1993.
[27] R.K. Gupta and G. De Micheli. Hardwaresoftware cosynthesis for digital systems. IEEE Design
and Test of Computers, 10:2941, 1993.
[28] CeltIc Cadence Design Systems RTL Compiler and Quickturn. http://www.cadence.com.
[29] Open SystemC Initiative. http://www.systemc.org.
[30] G. Berry. The foundations of esterel. In Plotkin, Stirling, and Tofte, Eds., Proof, Language and
Interaction: Essays in Honour of Robin Milner. MIT Press, Lanchester, 2000.
[31] S.A. Edwards. Compiling Esterel into sequential code. In International Workshop on
Hardware/Software Codesign. ACM Press, May 1999.
[32] T.B. Ismail, M. Abid, and A.A. Jerraya. COSMOS: a codesign approach for communicating systems.
In International Workshop on Hardware/Software Codesign. ACM Press, 1994.
[33] W. Cesario, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A.A. Jerraya,
and M. Diaz-Nava. Component-based design approach for multicore socs. In Proceedings of the
Design Automation Conference, June 2002.
[34] Foresight Systems. http://www.foresight-systems.com.
[35] CARDtools. http://www.cardtools.com.
[36] IMEC ATOMIUM. http://www.imec.be/design/atomium/.
[37] P. Cousot and R. Cousot. Abstract interpretation: a unied lattice model for static analysis of
programs by construction of approximation of xpoints. In Proceedings of the ACM Symposium
on Principles of Programming Languages. ACM Press, 1977.
[38] AbsInt Worst-Case Execution Time Analyzers. http://www.absint.com.
[39] Y.T.S. Li and S. Malik. Performance analysis of embedded software using implicit path
enumeration. In Proceedings of the Design Automation Conference, June 1995.
[40] 0-In Design Automation. http://www.0-in.com/.
[41] C. Norris Ip. Simulation coverage enhancement using test stimulus transformation. In Proceedings
of the International Conference on Computer Aided Design, November 2000.
[42] Forte Design Systems Cynthesizer. http://www.forteds.com.
[43] Celoxica DK Design suite. http://www.celoxica.com.
[44] K. Wakabayashi. Cyber: high level synthesis system from software into ASIC. In R. Camposano
and W. Wolf, Eds., High Level VLSI Synthesis. Kluwer Academic Publishers, Dordrecht, 1991.
[45] D. Gajski, J. Zhu, and R. Domer. The SpecC Language. Kluwer Academic Publishers, Dordrecht,
1997.
[46] D. Gajski, J. Zhu, R. Domer, A. Gerstlauer, and S. Zhao. SpecC: Specication Language and
Methodology. Kluwer Academic Publishers, Dordrecht, 2000.
[47] OPNET. http://www.opnet.com.
[48] Network Simulator NS-2. http://www.isi.edu/nsnam/ns/.
ZURA: 2824_C003 2005/6/21 20:01 page 24 #26
[49] Mentor Graphics Seamless and Emulation. http://www.mentor.com.
[50] VAST Systems CoMET. http://www.vastsystems.com/.
[51] Axys Design Automation MaxSim and MaxCore. http://www.axysdesign.com/.
[52] J. Rowson. Hardware/software co-simulation. In Proceedings of the Design Automation Conference,
1994, pp. 439440.
[53] V. Zivojnovic and H. Meyr. Compiled HW/SW co-simulation. In Proceedings of the Design
Automation Conference, 1996.
[54] Altera DSP Builder. http://www.altera.com.
[55] Xilinx System Generator. http://www.xilinx.com.
[56] IEEE. Standard 1076.1, vhdl-ams. http://www.eda.org/vhdl-ams.
[57] OVI. Verilog-a standard. http://www.ovi.org.
[58] B. Kernighan and D. Ritchie. The C Programming Language. Prentice-Hall, New York, 1988.
[59] K. Arnold and J. Gosling. The Java Programming Language. Addison-Wesley, Reading, MA, 1996.
[60] Sun Microsystem, Inc. Embedded Java Specication. Available at http://java.sun.com, 1998.
[61] Real-Time for Java Expert Group. The real time specication for Java. Available at http://
rtsj.dev.java.net/, 1998.
[62] A.V. Aho, J.E. Hopcroft, and J.D. Ullman. The Design and Analysis of Computer Algorithms.
Addison-Wesley, Reading, MA, 1974.
[63] P. Panda, N. Dutt, and A. Nicolau. Efcient utilization of scratch-pad memory in embedded
processor applications. In Proceedings of Design Automation and Test in Europe (DATE), February
1997.
[64] Y.T.S. Li, S. Malik, and A. Wolfe. Performance estimation of embedded software with instruc-
tion cache modeling. In Proceedings of the International Conference on Computer-Aided Design,
November 1995.
[65] F. Mueller and D.
B. Whalley. Fast instruction cache analysis via static cache simulation.

In Proceedings of the 28th Annual Simulation Symposium, April 1995.
[66] P. Marwedel and G. Goossens, Eds. Code Generation for Embedded Processors. Kluwer Academic
[67] H. Kopetz. Should responsive systems be event-triggered or time-triggered? IEICE Transactions on
Information and Systems, E76-D:13251332, 1993.
[68] H. Kopetz and G. Grunsteidl. TTP A protocol for fault-tolerant real-time systems. IEEE
Computer, 27:1423, 1994.
[69] G. De Micheli. Synthesis and Optimization of Digital Circuits. McGraw-Hill, New York, 1994.
[70] Jan M. Rabaey, Anantha Chandrakasan, and Borivoje Nikolic. Digital Integrated Circuits, 2nd ed.
Prentice-Hall, New York, 2003.
[71] S. Devadas, A. Ghosh, and K. Keutzer. Logic Synthesis. McGraw-Hill, New York, 1994.
[72] G.D. Hachtel and F. Somenzi. Logic Synthesis and Verication Algorithms. Kluwer Academic
[73] Synplicity Synplify. http://www.synplicity.com.
[74] Magma Design Automation Blast Create. http://www.magma-da.com.
[75] P. McGeer. On the Interaction of Functional and Timing Behavior of Combinational Logic Circuits.
Ph.D. thesis, U.C. Berkeley, 1989.
[76] Naveed A. Sherwani. Algorithms for VLSI Physical Design Automation, 3rd ed. Kluwer Academic
[77] W.C. Elmore. The transient response of damped linear network with particular regard to wideband
ampliers. Journal of Applied Physics, 19:5563, 1948.
[78] R.H.J.M. Otten and R.K. Brayton. Planning for performance. In Proceedings of the Design
Automation Conference, June 1998.
[79] M. Abramovici, M.A. Breuer, and A.D. Friedman. Digital Systems Testing and Testable Design.
Computer Science Press, Rockville, MD, 1990.
4
Models of Embedded
Computation
Axel Jantsch
Royal Institute of Technology
4.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-1
Models of Sequential and Parallel Computation
Nonfunctional Properties Heterogeneity Component
Interaction Time The Purpose of an MoC
4.2 The MoC Framework. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-9
Processes and Signals Signal Partitioning Untimed
MoCs The Synchronous MoC Discrete Timed MoCs
4.3 Integration of MoCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-16
MoC Interfaces Interface Renement MoC Renement
4.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-22
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-23
4.1 Introduction
A model of computation (MoC) is an abstraction of a real computing device. Different computational
models serve different objectives andpurposes. Thus, they always suppress some properties anddetails that
are irrelevant for the purpose at hand, and they focus on other properties that are essential. Consequently,
MoCs have been evolving during the history of computing. In the early decades, between 1920 and
1950, the main focus has been on the question: What is computable? The Turing machine and the
lambda calculus are prominent examples of computational models developedto investigate that question.
1
It turned out that several, very different MoCs, such as the Turing machine, the lambda calculus, partial
recursive functions, register machines, Markov algorithms, Post systems, etc., [1] are all equivalent in the
sense that they all denote the same set of computable mathematical functions. Thus, today the so-called
ChurchTuring thesis is widely accepted:
ChurchTuring thesis. If function f is effectively calculable, then f is Turing-computable. If function f
is not Turing-computable, then f is not effectively calculable [1, p. 379].
It is the basis for our understanding today what kind of problems can be solved by computers, and what
kindof problems principally are beyonda computers reach. Afamous example of what cannot be solvedby
a computer is the halting problem for Turing machines. A practical consequence is that there cannot be an
1
The termmodel of computation came in use only much later in the 1970s, but conceptually the computational
models of today can certainly be traced back to the models developed in the 1930s.
4-1
algorithm that given a function f and a C++ program P (or a program in any other sufciently complex
programming language), could determine if P computes f . This illustrates the principal difculty of
programming language teachers in correcting exams, and of verication engineers in validation programs
and circuits.
Later the focus changed to the question: What can be computed in reasonable time and with reason-
able resources?, which spun off the theories of algorithmic complexity based on computational models
exposing timing behavior in a particular but abstract way. This resulted in a hierarchy of complexity classes
for algorithms according to their asymptotic complexity. The computation time (or other resources) for an
algorithm is expressed as a function of some characteristic gure of the input, for example, the size of the
input. For instance we can state that the function f (n) = 2n, for natural numbers n can be computed in
p(n) time steps by any computer for some polynomial function p(n). In contrast, the function g(n) = n!
cannot be computed in p(n) time steps on any sequential computer for any polynomial function p(n) and
arbitrary n. With growing n the time steps required to compute g(n) grows faster than can be expressed
by any polynomial function.
This notion of asymptotic complexity allows us to express properties about algorithms in general
disregarding details of the algorithms and the computer architecture. This comes at the cost of accuracy.
We may only know that there exists some polynomial function p(n) for every computer, but we do not
know p(n) since it may be very different for different computers. To be more accurate one needs to take
into account more details of the computer architecture. As a matter of fact the complexity theories rest on
the assumption that one kind of computational model, or machine abstraction, can simulate another one
with a bounded and well-dened overhead. This simulation capability has been expressed in the thesis
given below:
Invariance thesis. Reasonablemachines can simulate each other with a polynomially bounded overhead
in time and a constant overhead in space [2].
This thesis establishes an equivalence between different machine models and makes results for a particular
machine more generally useful. However, some machines are equipped with considerably more resources
and cannot be simulated by a conventional Turing machine according to the invariance thesis. Parallel
machines have been the subject of a huge research effort and the question, how parallel resources increase
the computational power of a machine has lead to a renement of computational models and an accuracy
increase for estimating computation time. The fundamental relation between sequential and parallel
machines has been captured by the following thesis.
Parallel computation thesis. Whatever can be solved in polynomially bounded space on a reasonable
sequential machine model, can be solved in polynomially bounded time on a reasonable parallel
machine, and vice versa [2].
Parallel computers prompted researchers to rene computational models to include the delay of
communication and memory access, which we review briey in Section 4.1.1.
Embedded systems require a further evolution of computational models due to newdesign and analysis
objectives and constraints. The termembedded triggers two important associations. First, an embedded
component is squeezed into a bigger system, which implies constraints on size, the form factor, weight,
power consumption, cost, etc. Second, it is surrounded by real-world components, which implies timing
constraints and interfaces to various communication links, sensors, and actuators. As a consequence, the
computational models that are used and are useful in embedded system design are different from those
in general purpose sequential and parallel computing. The difference comes from the nonfunctional
requirements and constraints and from the heterogeneity.
4.1.1 Models of Sequential and Parallel Computation
Arguably, general purpose sequential computing had for a long time a privileged position, in
that it had a single, very simple, and effective MoC. Based on the van Neumann machine, the
Models of Embedded Computation 4-3
randomaccess machine (RAM) model [3] is a sufciently general model to express all important algorithms
and reects the salient nonfunctional characteristics of practical computing engines. Thus, it can be used
to analyze performance properties of algorithms in a hardware architecture and implementation inde-
pendent way. This favorable situation for sequential computing has been eroded over the years as processor
architectures andmemory hierarchies became ever more complex anddeviatedfromthe ideal RAMmodel.
The parallel computationcommunity has beensearching invainfor a similar simple andeffective model
[4]. Without a universal model of parallel computation, the foundations for the development of portable
andefcient parallel applications andarchitectures were lacking. Consequently, parallel computing has not
gained as wide acceptance as sequential computing and is still conned to niche markets and applications.
The parallel randomaccess machine (PRAM) [5] is perhaps the most popular model of parallel computation
and closest to its sequential counterpart with respect to simplicity. A number of processors execute in a
lock-step way, that is, synchronized after each cycle governed by a global clock, and access global, shared
memory simultaneously within one cycle. The PRAM models main virtue is its simplicity but it poorly
captures the costs associated with computing. Although the RAMmodel has a similar cost model, there is a
signicant difference. In the RAM model the costs (execution time, program size) are in fact well reected
and grow linearly with the size of the program and the length of the execution path. This correlation is
in principle correct for all sequential processors. The PRAMmodel does not exhibit this simple correlation
because in most parallel computers the cost of memory access, communication and synchronization can
be vastly different depending on which memory location is accessed and which processors communicate.
Thus, the developer of parallel algorithms does not have sufcient information from the PRAM model
alone to develop efcient algorithms. He or she has to consult the specic cost models of the target
machine.
Many PRAMvariants have beendevelopedto more realistically reect real cost. Some made the memory
access more realistic. The exclusive readexclusive write (EREW) and the concurrent readexclusive write
(CREW) models [5] serialize access to a given memory location by different processors but still maintain
the unit cost model for memory access. The local memory PRAM(LPRAM) model [6] introduces a notion
of memory hierarchy while the queued readqueued write (QRQW) PRAM [7] models the latency and
contention of memory access. Ahost of other PRAMvariants have factored in the cost of synchronization,
communication latency, and bandwidth. Other models of parallel computation, many of which are not
directly derived from the PRAM machine, focus on memory. There either the distributed nature of
memory is the main concern [8] or the various cost factors of the memory hierarchy are captured [6,9,10].
An introductory survey of models of parallel computation has been written by Maggs et al. [4].
4.1.2 Nonfunctional Properties
A main difference between sequential computation and parallel computation comes from the role of
time. In sequential computing, time is solely a performance issue which is moreover captured fairly well
by the simple and elegant RAM model. In parallel computing the execution time can only be captured
by complex cost functions that depend heavily on various details of the parallel computer. In addition,
the execution time can also alter the functional behavior, because the changes in the relative timing of
different processors and the communication network can alter the overall functional behavior. To counter
this danger, different parts of the parallel program must be synchronized properly.
In embedded systems the situation is even more delicate if real-time deadlines have to be observed.
A system that responds slightly too late may be as unacceptable as a system that responds with incorrect
data. Even worse, it is entirely context dependent if it is better to respond slightly too late, incorrectly,
or not at all. For instance when transmitting a video stream, incorrect data arriving on time may be
preferable to correct data arriving too late. Moreover, it may be better not to send data that arrive too late
to save resources. On the other hand, control signals to drive the engine or the breaks in a car must always
arrive and a tiny delay may be preferable to no signal at all. These observations lead to the distinction of
different kinds of real-time systems, for example, hard versus soft real-time systems, depending on the
requirements on the timing.
Since most embedded systems interact with real-world objects they are subject to some kind of real-time
requirements. Thus, time is an integral part of the functional behavior and cannot be abstracted away
completely in many cases. So it should not come as a surprise that MoCs have been developed to allow the
modeling of time in an abstract way to meet the application requirements while at the same time avoiding
the unnecessary burden of too detailed timing. We will discuss some of these models below. In fact, the
timing abstractions of different MoCs is a main organizing principle in this chapter.
Designing for low power is a high priority for most, if not all, embedded systems. However, power
has been treated in a limited way in computational models because of the difculty to abstract the
power consumption from the details of architecture and implementation. For very large-scale integration
(VLSI) circuits computational models have been developed to derive lower and upper bounds with respect
to complexity measures that usually include both circuit area and computation time for a given behavior.
AT
2
has been found to be a relevant and interesting complexity measure, where A is the circuit area and T is
the computation time either in clock cycles or in physical time. These models have also been used to derive
bounds on the energy consumption by usually assuming that the consumed energy is proportional to the
state changes of the switching elements. Such analysis shows for instance that AT
2
optimal circuits, that
is, circuits which are optimal up to a constant factor with respect to the AT
2
measure for a given boolean
function, utilize their resources to a high degree, which means that on average a constant fraction of the
chip changes state. Intuitively this is obvious since, if large parts of a circuit are not active over a long period
(do not change state), it can presumably be improved by making it either smaller or faster and thus utilizing
the circuit resources to a higher degree on average. Or, to conclude the other way round, an AT
2
optimal
circuit is also optimal with respect to energy consumption for computing a given boolean function. One
can spread out the consumed energy over a larger area or a longer time period, but one cannot decrease the
asymptotic energy consumption for computing a given function. Note, that all these results are asymptotic
complexity measures with respect to a particular size metric of the computation, for example, the length
in bit of the input parameter of the function. For a detailed survey of this theory see Reference 11.
These models have several limitations. They make assumptions about the technology. For instance,
in different technologies the correlation between state switching and energy consumption is different.
Inn-channel metal oxide semiconductor (NMOS) technologies the energy consumptionis more correlated
with the number of switching elements. The same is true for complementary metal oxide semiconductor
(CMOS) technologies if leakage power dominates the overall energy consumption. Also, they provide
asymptotic complexity measures for very regular and systematic implementation styles and technologies
with a number of assumptions and constraints. However, they do not expose relevant properties for
complex modern microprocessors, VLIW (Very Large Instruction Word) processors, DSPs (Digital Signal
Processings), FPGAs (Field Programmable Gate Arrays), or ASIC(Application Specic Integrated Circuit)
designs in a way useful for system level design decisions. And we are again back at our original question
about what exactly is the purpose of a computational model and how general or how specic should it be.
In principle, there are two alternatives to integrate nonfunctional properties such as power, reliability,
and also time in an MoC:
First, we can include these properties in the computational model and associate every functional
operation with a specic quantity of that property. For example, an add operation takes 2 nsec and
consumes 60 pW. During simulation or some other analysis we can calculate the overall delay and
power consumption.
Second, we can allocate abstract budgets for all parts of the design. For instance, in synchronous
design styles, we divide the time axis in slots or cycles and assign every part of the design to exactly
one slot. Later on during implementation, we have to nd the physical time duration of each slot,
which determines the clock frequency. We can optimize for high clock frequency by identifying the
critical path and optimizing that design part aggressively. Alternatively, we can move some of the
functionality from the slot with the critical part to a neighboring slot, thus balancing the different
slots. This budget approach can also be used for managing power consumption, noise, and other
properties.
The rst approach suffers from inefcient modeling and simulation when all implementation details
are included in a model. Also, it cannot be applied to abstract models since there these implementation
details are not available. Recall, that a main idea of computational models is that they should be abstract
and general enough to support analysis of a large variety of architectures. The inclusion of detailed timing
and power consumption data would obstruct this objective. Even the approach to start out with an abstract
model and later on back-annotate the detailed data from realistic architectural or implementation models
does not help, because the abstract model does not allow to draw concrete conclusions and the detailed,
back-annotated model is valid only for a specic architecture.
The second approach with abstract budgets is slightly more appealing to us. On the assumption that
all implementations will be able to meet the budgeted constraints, we can draw general conclusions about
performance or power consumption on an abstract level valid for a large number of different architectures.
One drawback is that we do not know exactly for which class of architectures our analysis is valid, since
it is hard to predict which implementations will at the end be able to meet the budget constraints.
Another complication is, that we do not know the exact physical size of these budgets and it may indeed
be different for different architectures and implementations. For instance an ASIC implementation of
a given architecture may be able to meet a cycle constraint of 1 nsec and run at 1 GHz clock frequency,
while an FPGA implementation of exactly the same algorithms requires a cycle budget of 5 nsec. But still,
the abstract budget approach is promising because it divides the overall problem into more manageable
pieces. At the higher level we make assumptions about abstract budgets and analyze a system based
on these assumptions. Our analysis will then be valid for all architectures and implementations that
meet the stated assumptions. At the lower level we have to ensure and verify that these assumptions are
indeed met.
4.1.3 Heterogeneity
Another salient feature of many embedded systems is heterogeneity. It comes from various environ-
mental constraints on the interfaces, from heterogeneous applications and from the need to nd different
tradeoffs among performance, cost, power consumption, and exibility for different parts of the system.
Consequently, we see analog and mixed signal parts, digital signal processing parts, image, and video
processing parts, control parts, and user interfaces coexisting in the same system or even on the same
VLSI device. We also see irregular architectures with microprocessors, DSPs, VLIWs, custom hardware
coprocessors, memories, and FPGAs connected via a number of different segmented and hierarchical
interconnection schemes. It is a formidable task to develop a uniform MoC that exposes all relevant
properties while nicely suppressing irrelevant details.
Heterogeneous MoCs is one way to address heterogeneity at the application, architecture, and imple-
mentation level. Different computational models are connected and integrated into a hierarchical,
heterogeneous MoC that represents the entire system. Many different approaches have been taken to
either connect two different computational models or provide a general framework to integrate a number
of different models. It turns out that issues of communication, synchronization, and time representation
pose the most formidable challenges. The reason is that the communication, and, in particular, the syn-
chronization semantics between different MoC domains correlates the time representation between the
two domains. As we will see below, connecting a timed MoC with an untimed model leads to the import
of a time structure from the timed to the untimed model resulting in a heterogeneous, timed MoC. Thus
the integration cannot stop supercially at the interfaces leaving the interior of the two computational
domains unaffected.
Due to the inherent heterogeneity of embedded systems, different MoCs will continue to be used and
thus different MoC domains will coexist within the same system. There are two main possible relations,
one is due to renement and the other due to partitioning. A more abstract MoC can be rened into a
more detailed model. In our framework, time is the natural parameter that determines the abstraction
level of a model. The untimed MoC is more abstract then the synchronous MoC, which in turn is more
abstract than the timed MoC. It is in fact common practice that a signal processing algorithm is rst
modeled as an untimed dataow algorithm, which is then rened into a synchronous circuit description,
which in turn is mapped onto a technology dependent netlist of fully timed gates.
However, this is not a natural owfor all applications. Control dominated systems or subsystems require
some notion of time already at the systemlevel and sensor and actuator subsystems may require a continu-
ous time model right from the start. Thus, different subsystems should be modeled with different MoCs.
4.1.4 Component Interaction
A troubling issue in complex, heterogeneous systems is unexpected behavior of the system due to subtle
and complex ways of interaction of different MoCs parts. Eker et al. [12] call this phenomenon emergent
behavior. Some examples shall illustrate this important point:
Priority inversion. Threads in a real-time operating systemmay use two different mechanismof resource
allocation[12]. One is based on priority and preemption to schedule the threads. The second is based on
monitors. Both are well dened and predictable in isolation. For instance, priority and preemption based
scheduling means that a higher priority thread cannot be blocked by a lower priority thread. However, if
the two threads also use a monitor lock, the lower priority thread may block the high priority thread via
the monitor for an indenite amount of time.
Performance inversion. Assume there are four CPUs on a bus. CPU
1
sends data to CPU
2
, CPU
3
sends
data to CPU
4
over the bus [13]. We would expect that the overall system performance improves when
we replace one CPU with a faster processor, or at least that the system performance does not decrease.
However, replacing CPU
1
with a faster CPU
1
may mean that data is sent fromCPU
1
to CPU
2
with a higher
frequency, at least for a limited amount of time. This means, that the bus is more loaded by this trafc,
which may slow down the communication from CPU
3
to CPU
4
. If this communication performance has
a direct inuence on the system performance, we will see a decreased overall system performance.
Over synchronization. Assume that the upper andlower branches inFigure 4.1 have nomutual functional
dependence as the dataow arrows indicate. Assume further that process B is blocked when it tries to send
data to C1 or D1, but the receiver is not ready to accept the data. Then, a delay or deadlock in branch D
will propagate back through process B to both A and the entire C branch.
These examples are not limited to situations when different MoCs interact. They show that, when
separate, seemingly unrelated subsystems interact via a nonobvious mechanism, which is often a shared
resource, the effects can be hard to analyze. When the different subsystems are modeled in different
MoCs the problem is even more pronounced due to different communication semantics, synchronization
mechanisms, and time representation.
4.1.5 Time
The treatment of time will serve for us as the most important dimension to distinguish MoCs. We can
identify at least four levels of accuracy, which are continuous time, discrete time, clocked time, and
causality. In the sequel, we only cover the last three levels.
When time is not modeled explicitly, events are only partially ordered with respect to their causal
dependences. In one approach, taken for instance in deterministic dataow networks [14, 15], the system
A B
C1 C2 C3
D1 D2 D3
FIGURE 4.1 Over synchronization between functionally independent subsystems.
behavior is independent of delays and timing behavior of computation elements and communication
channels. These models are robust with respect to time variations in that any implementation, no matter
how slow or fast it is, will exhibit the same behavior as the model. Alternatively, different delays may affect
the systems behavior and we obtain an inherently nondeterministic model since time behavior, which is
not modeled explicitly is allowed to inuence the observable behavior. This approach has been taken both
in the context of dataow models [1619] and process algebras [20, 21]. In this chapter we follow the
deterministic approach, which can be generalized to approximate nondeterministic behavior by means of
stochastic processes as shown in Reference 22.
To exploit the very regular timing of some applications, the synchronous dataow (SDF) [23] has been
developed. Every process consumes and emits a statically xed number of events in each evaluation cycle.
The evaluation cycle is the reference time. The regularity of the application is translated into a restriction
of the model, which in turn allows efcient analysis and synthesis techniques that are not applicable for
more general models. Scheduling, buffer size optimization, and synthesis have been successfully developed
for the SDF.
One facet related to the representation of time is the dichotomy of dataow dominated and control ow
dominated applications. Dataow dominated applications tend to have events that occur in very regular
intervals. Thus, explicit representation of time is not necessary and in fact often inefcient. In contrast,
control dominated applications deal with events occurring at very irregular time instants. Consequently,
explicit representation of time is a necessity because the timing of events cannot be inferred. Difculties
arise in systems that contain both elements. Unfortunately, these kind of systems become more common
since the average system complexity steadily increases. As a consequence, several attempts to integrate
dataow and control dominated modeling concepts have emerged.
In the synchronous piggybacked dataow model [24] control events are transported on dataow
streams to represent a global state without breaking the locality principle of dataow models.
The composite signal ow [25] distinguishes between control and dataow processes and puts sig-
nicant effort to maintain the frame-oriented processing which is so common in dataow and signal
processing applications for efciency reasons. However, conicts occur when irregular control events
must be synchronized with dataow events inside frames. The composite signal ow addresses this prob-
lem by allowing an approximation of the synchronization and denes conditions when approximations
are safe and do not lead to erroneous behavior.
Time is divided up into time slots or clock cycles by various synchronous models. According to the
perfect synchrony assumption [26, 27] neither communication nor computation takes any noticeable
time and the time slots or evaluation cycles are completely determined by the arrival of input events. This
assumption is useful because designer and tools can concentrate solely on the functionality of the system
without mixing this activity with timing considerations. Optimization of performance can be done in
a separate step by means of static timing analysis and local retiming techniques. Even though timing
does not appear explicitly in synchronous models, the behavior is not independent of time. The model
constrains all implementations such that they must be fast enough to process input events properly and
to complete an evaluation cycle before the next events arrive. When no events occur in an evaluation
cycle, a special token called absent event is used to communicate the advance of time. In our framework
we use the same technique in Sections 4.2.4 and 4.2.5 for both the synchronous MoC and the fully
timed MoC.
Discrete timed models use a discrete set, usually integers or natural numbers, to assign a time stamp to
each event. Many discrete event models fall into this category [2830] as well as most popular hardware
descriptionlanguages, suchas VHDLandVerilog. Timing behavior canbe modeled most accurately, which
makes it the most general model we consider here and makes it applicable to problems such as detailed
performance simulation where synchronous and untimed models cannot be used. The price for this is
the intimate dependence of functional behavior on timing details and signicantly higher computation
costs for analysis, simulation, and synthesis problems. Discrete timed models may be nondeterministic,
as mainly used in performance analysis and simulation (see e.g., Reference 30), or deterministic, as more
desirable for hardware description languages such as VHDL.
The integration of these different timing models into a single framework is a difcult task. Many
attempts have been made on a practical level with a concrete design task, mostly simulation, in mind
[3135]. On a conceptual level Lee and Sangiovanni-Vincentelli [36] have proposed a tagged time model
in which every event is assigned a time tag. Depending on the tag domain we obtain different MoCs.
If the tag domain is a partially ordered set, it results in an untimed model according to our denition.
Discrete, totally ordered sets lead to timed MoCs and continuous sets result in continuous time MoCs.
There are two main differences between the tagged time model and our proposed framework. First, in the
tagged time model processes do not know how much time has progressed when no events are received
since global time is only communicated via the time stamps of ordinary events. For instance, a process
cannot trigger a time-out if it has not received events for a particular amount of time. Our timed model
in Section 4.2.5 does not use time tags but absent events to globally order events. Since absent events are
communicated between processes whenever no other event occurs, processes are always informed about
the advance of global time. We chose this approach because it resembles better the situation in design
languages, such as VHDL, C, or SDL (Specication and Description Language) where processes always can
experience time-outs. Second, one of our main motivations was the separation of communication and
synchronization issues from the computation part of processes. Hence, we strictly distinguish between
process interfaces and process functionality. Only the interfaces determine to which MoC a process
belongs, while the core functionality is independent of the MoC. This feature is absent from the tagged
token model. This separation of concerns has been inspired by the concept of ring cycles in dataow
process networks [37]. Our mechanism for consuming and emitting events based on signal partitionings
as described in Sections 4.2.2 and 4.2.3.1 is only slightly more general than the ring rules described by
Lee [37] but it allows a useful denition of process signatures based on the way processes consume and
emit events.
4.1.6 The Purpose of an MoC
As mentioned several times, the purpose of a computational model determines, how it is designed, what
properties it exposes, and what properties it suppresses.
We argue that MoCs for embedded systems should not address principal questions of computability or
feasibility, but should rather aid the design and validation of concrete systems. How this is accomplished
best remains a subject of debate, but for this chapter we assume that an MoCshould support the following
properties:
Implementation independence. An abstract model should not expose too much details of a possible
implementation, for example, which kind of processor is used, how much parallel resources are available,
what kind of hardware implementation technology is used, details of the memory architecture, etc. Since
an MoC is a machine abstraction, it should, by denition, avoid unnecessary machine details. Practically
speaking, the benets of an abstract model include that analysis and processing is faster and more efcient,
that analysis results are relevant for a larger set of implementations, and that the same abstract model
can be directed to different architectures and implementations. On the downside we note diminished
analysis accuracy and a lack of knowledge of the target architecture that can be exploited for modeling
and design. Hence, the right abstraction level is a ne line that is also changing over time. While many
embedded system designers could for long safely assume a purely sequential implementation, current
and future computational models should avoid such an assumption. Resource sharing and scheduling
strategies become more complex, and an MoC should thus either allow the explicit modeling of such a
strategy or restrict the implementations to follow a particular, well-dened strategy.
Composability. Since many parts and components are typically developed independently and integrated
into a system, it is important to avoid unexpected interferences. Thus, some kind of composability property
[38] is desirable. One step in this direction is to have a deterministic computational model such as Kahn
process networks that guarantee a particular behavior independent of the time or individual activities and
independent of the amount of available resources in general.
This is of course only a rst step since, as argued earlier, time behavior is often an integral part of
the functional behavior. Thus, resource sharing strategies, that greatly inuence timing, will still have
a major impact on the system behavior even for fully deterministic models. We can reconcile good system
composability with shared resources by allocating a minimum but guaranteed amount of resources for
each subsystem or task. For instance, two tasks get a xed share of the communication bandwidth of
a bus. This approach allows for ideal composability but has to be based on worst case behavior. It is very
conservative and hence, does not utilize resources efciently.
We can relax this approach by allocating abstract resource budgets as part of the computational model.
Then we require from the implementation to provide the requested resources, and at the same time to
minimize the abstract budgets and thus the required resources. As example consider two tasks that have
a particular communication need per abstract time slot, where the communication need may be different
for different slots. The implementation has to fulll the communication requirements of all tasks by
providing the necessary bandwidth in each time slot, tuning the length of the individual time slots, or by
moving communication from one slot to another. These optimizations will also have to consider global
timing and resource constraints. In any case, in the abstract model we can deal with abstract budgets and
assume that they will be provided by any valid implementation.
Analyzability. A general tradeoff exists between the expressiveness of a model and its analyzability.
By restricting models in clever ways, one can apply powerful and efcient analysis and synthesis methods.
For instance, the SDF model allows all actors only a constant amount of input and output tokens in
each activation cycle. While this restricts the expressiveness of the model, it allows to efciently compute
static schedules when they exist. For general dataow graphs this may not be possible because it could be
impossible to ensure that the amount of input and output is always constant for all actors, even if they
are in a particular case. Since SDF covers a fairly large and important application domain, it has become
a very useful MoC. The key is to understand what are the important properties (nding static schedules,
nding memory bounds, nding maximum delays, etc.) and devising an MoC that allows to handle these
properties efciently and does not restrict the modeling power too much.
In the following sections we discuss a framework to study different MoCs. The idea is to use different
types of process constructors to instantiate processes of different MoCs. Thus, one type of process con-
structors would yield only untimed processes, while another type results in timed processes. The elements
for process construction are simple functions and are in principle independent of a particular MoC.
However, the independence is not complete since some MoCs put specic constraints on the functions.
But still the separation of the process interfaces from the internal process behavior is fairly far reaching.
The interfaces determine the time representation, synchronization, and communication, hence the MoC.
In this chapter we will not elaborate all interesting and desirable properties of computational models.
Rather we will use the framework to introduce four different MoCs that only differ in their timing
abstraction. Since time plays a very prominent role in embedded systems, we focus on this aspect and
show how different time abstractions can serve different purposes and needs. Another dening aspect of
embedded systems is heterogeneity, which we address by allowing different MoCs to coexist in a model.
The common framework makes this integration semantically clean and simple. We study two particular
aspects of this coexistence, namely the interfaces between two different MoCs and the renement of one
MoC into another.
Other central issues of embedded systems, such as power consumption, global analysis and
optimization, are not covered, mostly because they are not very well understood in this context and
few advanced proposals exist on how to deal with them from an MoC perspective.
4.2 The MoC Framework
In the remainder of this chapter we discuss a framework that accommodates MoCs with different timing
abstractions. It is based on process constructor, which is a mechanism to instantiate processes. A process
constructor takes one or more pure functions as arguments and creates a process. The functions represent
the process behavior and have no notion of time or concurrency. They simply take arguments and produce
results. The process constructor is responsible for establishing communication with other processes.
It denes the time representation, the communication, and synchronization semantics. A set of process
constructors determines a particular MoC. This leads to a systematic and clean separation of computation
and communication. A function, that denes the computation of a process, can in principle be used
to instantiate processes in different computational models. However, a computational model may put
constraints on functions. For instance, the synchronous MoC requires a function to take exactly one event
on each input and produce exactly one event for each output. The untimed MoC does not have a similar
requirement.
After some preliminary denitions in this section, we introduce the untimed processes, give a formal
denition of an MoC, and dene the untimed MoC (Section 4.2.3) the perfectly synchronous and the
clocked synchronous MoC (Section 4.2.4), and the discrete time MoC (Section 4.2.5). Based on this
we introduce interfaces between MoCs and present an interface renement procedure in the next section.
Furthermore, we discuss the renement fromanuntimedMoCtoa synchronous MoCandtoa timedMoC.
4.2.1 Processes and Signals
Processes communicate with each other by writing to and reading from signals. Given is a set of values V,
which represents the data communicated over the signals. Events, which are the basic elements of signals,
are or contain values. We distinguish among three different kinds of events.
Untimed events

E are just values without further information,

E = V. Synchronous events

E include
a pseudo-value in addition to the normal values, hence

E = V {}. Timed events

E are identical
to synchronous events,

E =

E. However, since it is often useful to distinguish them, we use different
symbols. Intuitively, timedevents occur at muchner granularity thansynchronous events andthey would
usually represent physical time units, such as a nanosecond. In contrast, synchronous events represent
abstract time slots or clock cycles. This model of events and time can only accommodate discrete time
models. Continuous time would require a different representation of time and events. We use the symbols
e, e, and e to denote individual untimed, synchronous, andtimedevents, respectively. We use E =

E
E
and e E to denote any kind of event.
Signals are sequences of events. Sequences are ordered and we use subscripts as in e
i
to denote the
ith event in a signal. For example, a signal may be written as e
0
, e
1
, e
2
. In general, signals can be nite
or innite sequences of events and S is the set of all signals. We also distinguish among three kinds of
signals and

S,

S, and

S denote the untimed, synchronous, and timed signal sets, respectively, and s, s, and
s designate individual untimed, synchronous, and timed signals.
is the empty signal and concatenates two signals. Concatenation is associative and has the empty
signal as its neutral element: s
1
(s
2
s
3
) = (s
1
s
2
) s
3
, s = s = s. To keep the notation
simple we often treat individual events as one-event sequences, for example, we may write e s to denote
e s.
We use angle brackets, and not only to denote ordered sets or sequences of events, but also to
denote sequences of signals if we impose an order on a set of signals.
#s gives the length of signal s. Innite signals have innite length and # = 0.
[] is an index operation to extract an event on a particular position from a signal.
For example, s[2] = e
2
if s = e
1
, e
2
, e
3
.
Processes are dened as functions on signals
p : S S.
Processes are functions in the sense that for a given input signal we always get the same output signal,
that is, s = s
p(s) = p(s
). Note, that this still allows processes to have an internal state. Thus,
a process does not necessarily react identical to the same event applied at different times. But it will
s=1r
0
,r
1
, ...2=11e
0
, e
1
, e
2
2,1e
3
, e
4
, e
5
2, ...2
p
n
(s) =1r
i
2 for n(i ) =3 for all i
s=1r
0
, r
1
, ...2=11e
0
, e
1
2, 1e
2
,e
3
2, ...2
n
(s) =1r
i
2 for n(i ) =2 for all i
p
FIGURE 4.2 The input signal of process p is partitioned into an innite sequence of subsignals each of which
contains three events, while the output signal is partitioned into subsignals of lengths 2.
produce the same, possibly innite, output signal when confronted with identical, possibly innite, input
signals provided it starts with the same initial state.
4.2.2 Signal Partitioning
We shall use the partitioning of signals intosubsequences todene the portions of a signal that is consumed
or emitted by a process in each evaluation cycle.
Apartition (, s) of a signal s denes an ordered set of signals, r
i
, which, when concatenated together,
form almost the original signal s. The function : N
0
N
0
denes the lengths of all elements in the
partition. (0) = #r
0
gives the length of the rst element in the partition, (1) = #r
1
gives the length of
the second element, etc.
Example 4.1 Let s
1
= 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 and
1
(0) =
1
(1) = 3,
1
(2) = 4. Then we get the
partition (
1
, s
1
) = 1, 2, 3, 4, 5, 6, 7, 8, 9, 10.
Let s
2
= 1, 2, 3, . . . be the innite signal with ascending integers. Let
2
(i) = 2 for all i 0.
The resulting partition is innite: (
2
, s
2
) = 1, 2, 3, 4, . . ..
The function (i) denes the length of the subsignals r
i
. If it is constant for all i we usually omit the
argument and write . Figure 4.2 illustrates a process with an input signal s and an output signal s
. s is
partitioned into subsignals of length 3 and s
into subsignals of length 2.

4.2.3 Untimed MoCs
4.2.3.1 Process Constructors
Our aimis to relate functions of events to processes, which are functions of signals. Therefore we introduce
process constructors that can be considered as higher-order functions. They take functions on events as
arguments and return processes. We dene only a few basic process constructors that can be used to
compose more complex processes and process networks. All untimed process constructors and processes
operate exclusively on untimed signals.
Processes with arbitrary number of input and output signals are cumbersome to deal with in a formal
way. To avoid this inconvenience we mostly deal with processes with one input and one output. To handle
arbitrary processes also, we introducezipandunzipprocesses that merge two input signals into one and
split one input signal into two output signals, respectively. These processes together with appropriate
process composition allows us to express arbitrary behavior.
Processes instantiated with the mealyU constructor resemble Mealy state machines in that they have a
next state function and an output encoding function that depend on both the input and the current state.
Denition 4.1 Let V be an arbitrary set of values, let g, f : (V

S)

S the next state and output encoding
functions, let : V N be a function dening the input partitioning, and let w
0
V be an initial state.
mealyU is a process constructor which, given , f , g, and w
0
as arguments, instantiates a process p :

S

S.
The function determines the number of events consumed by the process in the current evaluation cycle. is
dependent on the current state. p repeatedly applies g on the current state and the input events to compute
the next state. Further, it applies f repeatedly on the current state and the input events, to compute the output
events.
Processes instantiated by mealyU are general state machines with one input and one output. To create
processes with arbitrary inputs and outputs, we also use the following constructors:
zipU instantiates a process with two inputs and one output. In every evaluation cycle this process
takes one event from the left input and one event from the right input and packs them into an event
pair that is emitted at the output.
unzipU instantiates a process with one input and two outputs. In every evaluation cycle this
process takes one event from the input. It requires it to be an event pair. The rst event of this pair
is emitted to the left output, the second event of the event pair is emitted to the right output.
For truly general process networks we would in fact need more complex zip processes, but for the
purpose of this chapter the simple constructors are sufcient and we refer the reader for details to
Reference 39.
4.2.3.2 Composition Operators
We consider only three basic composition operators, namely sequential composition, parallel composition,
and feedback.
We give the denitions only for processes with one or two input and output signals, because the
generalization to arbitrary numbers of inputs and outputs is straightforward.
Denition 4.2 Let p
1
, p
2
:
S

S be two processes with one input and one output each, and let s
1
, s
2

S be
two signals. Their parallel composition, denoted as p
1
p
2
, is dened as follows.
(p
1
p
2
)(s
1
, s
2
) = p
1
(s
1
), p
2
(s
2
).
Since processes are functions we can easily dene sequential composition in terms of functional
composition.
Denition 4.3 Let again p
1
, p
2
:
S

S be two processes and let s

S be a signal. The sequential composition,
denoted as p
1
p
2
, is dened as follows.
(p
2
p
1
)(s) = p
2
(p
1
(s)).
Denition 4.4 Given a process p : (S S) (S S) with two input signals and two output signals we
dene the process p : S S by the equation
(p)(s
1
) = s
2
where p(s
1
, s
3
) = (s
2
, s
3
).
The behavior of the process p is dened by the least xed point semantics based on the prex order of signals.
The operator gives feedback loops (Figure 4.3) a well-dened semantics. Moreover, the value of the
feedback signal can be constructed by repeatedly simulating the process network starting with the empty
signal until the values on all feedback signals stabilize and do not change any more [39].
Now we are in a position to dene precisely what we mean with an MoC.
Denition 4.5 An MoC is a 2-tuple MoC = (C, O), where C is a set of process constructors, each of which,
when given constructor specic parameters, instantiates a process. O is a set of process composition operators,
each of which, when given processes as arguments, instantiates a new process.
p
s
1
s
3
s
2
mp
FIGURE 4.3 Feedback composition of a process.
Denition 4.6 The untimed MoC is dened as untimed MoC = (C, O ), where
C = {mealyU, zipU, unzipU}
O = {, , }.
In other words, a process or a process network belongs to the untimed MoC domain iff all its processes
and process compositions are constructed either by one of the named process constructors or by one of the
composition operators. We call such processes U-MoC processes.
Because the process interface is separated from the functionality of the process, interesting transforma-
tions can be done. For instance, a process can be mechanically transformed into a process that consumes
and produces a multiple number of events of the original process. Processes can be easily merged into
more complex processes. Moreover, there may be the opportunity to move functionality from one process
to another. For more details on this kind of transformations see Reference 39.
4.2.4 The Synchronous MoC
The synchronous languages StateCharts [40], Esterel [41], Signal [42], Argos, Lustre [43], and some others
have been developed on the basis of the perfect synchrony assumption.
Perfect synchrony hypothesis. Neither computation nor communication takes time.
Timing is entirely determined by the arriving of input events because the system processes input
samples in zero time and then waits until the next input arrives. If the implementation of the system is
fast enough to process all inputs before the next sample arrives, it will behave exactly as the specication
in the synchronous language.
4.2.4.1 Process Constructors
Formally, we develop synchronous processes as a special case of untimed processes. This will allow us later
to easily connect different domains.
Synchronous processes have two specic characteristics. First, all synchronous processes consume and
produce exactly one event on each input or output in each evaluation cycle, that is, the signature is
always {1, . . .}, {1, . . .}. Second, in addition to the value set V events can carry the special value ,
which denotes the absence of an event; this is the way we dened synchronous events

E and signals

S in
Section 4.2.1. Both, the processes and their contained functions must be able to deal with these events.
All synchronous process constructors and processes operate exclusively on synchronous signals.
Denition 4.7 Let V be an arbitrary set of values,

E = V {}, let g, f : (
E

S)

S and let w
0
V be
an initial state. mealyS is a process constructor which, given f , g, and w
0
as arguments, instantiates a process
p :

S

S. p repeatedly applies g on the current state and the input event to compute the next state. Further it
applies repeatedly f on the current state and the input event to compute the output event. p consumes exactly
one input event in each evaluation cycle and emits exactly one output event.
We only require that g and f are dened for absent input events and that the output signal partitioning is
the constant 1.
When we merge two signals into one we have to decide how to represent the absence of an event in one
input signal in the compound signal. We choose to use the symbol for this purpose also, which has the
consequence, that appears also in tuples together with normal values. Thus, it is essentially used for two
different purposes. Having claried this, the denition for zipS and unzipS is straightforward. zipS-
based processes pack two events from the two inputs into an event pair at the output, while unzipS
performs the inverse operation.
4.2.4.2 The Perfectly Synchronous MoC
Again, we can now make precise what we mean by synchronous MoC.
Denition 4.8 The synchronous MoC is dened as synchronous MoC = (C, O), where
C = {mealyS, zipS, unzipS}
O = {, ,
S
}.
In other words, a process or a process network belongs to the synchronous MoC domain iff all its processes
and process compositions are constructed either by one of the named process constructors or by one of the
composition operators. We call such processes S-MoC processes.
Note, that we do not use the same feedback operator for the synchronous MoC.
S
denes the semantics
of the feedback loop based on the Scott order of the values in

E. It is also based on a xed point semantics
but it is resolved for each event and not over a complete signal. We have adopted
S
to be consistent with
the zero-delay feedback loop semantics of most synchronous languages. For our purpose here this is not
signicant and we do not need to go into more details. For precise denitions and a thorough motivation
the reader is referred to Reference 39.
Merging of processes and other related transformations are very simple in the synchronous MoC
because all processes have essentially identical interfaces. For instance, the merge of two mealyS-based
processes can be formulated as follows.
mealyS(g
1
, f
1
, v
0
) mealyS(g
2
, f
2
, w
0
) = mealyS(g, f , (v
0
, w
0
))
where g((v, w), e) = (g
1
(v, f
2
(w, e)), g
2
(w, e)) f ((v, w), e) = f
1
(v, f
2
(w, e)).
4.2.4.3 The Clocked Synchronous MoC
It is useful to dene a variant of the perfectly synchronous MoC, the clocked synchronous MoC that is
based on the following hypothesis.
Clocked synchronous hypothesis. There is a global clock signal controlling the start of each computation in
the system. Communication takes no time and computation takes one clock cycle.
First, we dene a delay process that delays all inputs by one evaluation cycle.
= mealyS( f , g, )
where g(w, e) = e, f (w, e) = w.
Based on this delay process we dene the constructors for the clocked synchronous model.
Denition 4.9
mealyCS(g, f , w
0
) = mealyS(g, f , w
0
)
zipCS()( s
1
, s
2
) = zipS()(( s
1
), ( s
2
))
unzipCS() = unzipS() .
(4.1)
Thus, elementary processes are composed of a combinatorial function and a delay function that essentially
represents a latch at the inputs.
Denition 4.10 The clocked synchronous MoC is dened as clocked synchronous MoC = (C, O), where
C = {mealyCS, zipCS, unzipCS}
O = {, , }.
In other words, a process or a process network belongs to the clocked synchronous MoC Domain iff all its
processes and process compositions are constructed either by one of the named process constructors or by one
of the composition operators. We call such processes CS-MoC processes.
4.2.5 Discrete Timed MoCs
Timed processes are a blend of untimed and synchronous processes in that they can consume and produce
more than one event per cycle and they also deal with absent events. In addition, they have to comply
with the constraint that output events cannot occur before the input events of the same evaluation cycle.
This is achieved by enforcing an equal number of input and output events for each evaluation cycle, and
by prepending an initial sequence of absent events. Since the signals also represent the progression of
time, the prex of absent events at the outputs corresponds to an initial delay of the process in reacting
to the inputs. Moreover, the partitioning of input and output signals corresponds to the duration of each
evaluation cycle.
Denition 4.11 mealyT is a process constructor which, given , f , g, and w
0
as arguments, instantiates
a process p :

S

S. Again, is a function of the current state and determines the number of input events
consumed in a particular evaluation cycle. Function g computes the next state and f computes the output
events with the constraint that the output events do not occur earlier than the input events on which they
depend.
This constraint is necessary because in the timed MoC each event corresponds to a time stamp and we
have a globally total order of time, relating all events in all signals to each other. To avoid causality aws
every process has to abide by this constraint.
Similarly zipT-based processes consume events from their two inputs and pack them into tuples of
events emitted at the output. unzipT performs the inverse operation. Both have also to comply with the
causality constraint.
Again, we can now make precise what we mean by timed MoC.
Denition 4.12 The timed MoC is dened as timed MoC = (C, O), where
C = {mealyT, zipT, unzipT}
O = {, , }.
In other words, a process or a process network belongs to the timed MoCdomain iff all its processes and process
compositions are constructed either by one of the named process constructors or by one of the composition
operators. We call such processes T-MoC processes.
Merging other transformations as well as analysis of time process networks is more complicated than
for synchronous or untimed MoCs, because the timing may interfere with the pure functional behavior.
However, we can further restrict the functions used in constructing the processes, to more or less separate
behavior from timing also in the timed MoC. To illustrate this we discuss a few variants of the Mealy
process constructor.
mealyPT. In mealyPT ( , f , g, w
0
) based processes the functions f and g are not exposed to absent
events and they are only dened on untimed sequences. The interface of the process strips-off all absent
events of the input signal, hands over the result to f and g, and inserts absent events at the output as
appropriate to provide proper timing for the output signal. The function , which may depend on the
process state as usual, denes how many events are consumed. Essentially, it represents a timer and
determines when the input should be checked the next time.
mealyST. In mealyST ( , f , g, w
0
) based processes determines the number of nonabsent events
that should be handed over to f and g for processing. Again, f and g never see or produce absent
events and the process interface is responsible for providing them with the appropriate input data and
for synchronization and timing issues on inputs and outputs. Unlike mealyPT processes, functions f
and g in mealyST processes have no inuence on when they are invoked. They only control how many
nonabsent events have appeared before their invocation. f and g in mealyPT processes on the other
hand determine the time instant of their next invocation independent of the number of nonabsent events.
mealyTT. However, a combination of these two process constructors is mealyTT, which allows to
control the number of nonabsent input events and a maximum time period, after which the process is
activated in any case independent of the number of nonabsent input events received. This allows to model
processes that wait for input events but can set internal timers to provide time-outs.
These examples illustrate that process constructors and MoCs could be dened, which allow us to
precisely dene to which extent communication issues are separated from the purely functional behavior
of the processes. Obviously, a stricter separation greatly facilitates verication and synthesis but may
restrict expressiveness.
4.3 Integration of MoCs
4.3.1 MoC Interfaces
Interfaces between different MoCs determine the relation of the time structure in the different domains
and they inuence the way a domain is triggered to evaluate inputs and produce outputs. If an MoC
domain is time triggered, the time signal is made available through the interface. Other domains are
triggered when input data is available. Again, the input data appears through the interfaces.
We introduce a fewsimple interfaces for the MoCs of the previous sections, in order to be able to discuss
concrete examples.
Denition 4.13 A stripS2U process constructor takes no arguments and instantiates a process p :

S

S,
which takes a synchronous signal as input and generates an untimed signal as output. It reproduces all data
from the input in the output in the same order with the exception of the absent event, which is translated into
the value 0.
Denition 4.14 An insertU2S process constructor takes no arguments and instantiates a process
p :

S

S, which takes an untimed signal as input and generates a synchronous signal as output. It reproduces
all data from the input in the output in the same order without any change.
These interface processes between the synchronous and the untimed MoCs are very simple. However, they
establish a strict and explicit time relation between two connected domains.
Connecting processes from different MoCs also requires a proper semantic basis, which we provide by
dening a hierarchical MoC.
Denition 4.15 A hierarchical model of computation (HMoC) is a 3-tuple HMoC = (M, C, O), where M
is a set of HMoCs or simple MoCs, each capable of instantiating processes or process networks; C is a set of
process constructors; O is a set of process composition operators that governs the process composition at the
highest hierarchy level but not inside process networks instantiated by any of the HMoCs of M.
In the following examples and discussion we will use a specic but rather simple HMoC.
Denition 4.16 H = (M, C, O) with
M = {U-MoC, S-MoC}
C = {stripS2U, insertU2S}
O = {, , }.
Example 4.2 As example, consider the equalizer systemof Figure 4.4 [39]. The control part consists of two
synchronous MoC processes and the dataow part, modeled as untimed MoC processes, lter and analyze
an audio stream. Depending on the analysis results of the Analyzer process, the Distortion control will
modify the lter parameters. The Button control takes also user input into account to steer the lter. The
purpose of Analyzer and Distortion control are to avoid dangerously strong signals that could jeopardize
the loud speakers.
Control and dataow parts are connected via two interface processes. The dataow processes can be
developed and veried separately in the untimed MoC domain, but as soon as they are connected to the
synchronous MoC control part, the time structure of the synchronous MoC domain gets imposed on all
the untimed MoC processes. With the simple interfaces of Figure 4.4, the Filter process consumes 4096
data tokens from the primary input, 1 token from the stripS2U process, and it emits 4096 tokens in
every synchronous MoC time slot. Similarly, the activity of the Analyzer is precisely dened for every
synchronous MoC time slot. Also, the activities of the two control processes are related precisely to the
activities of the dataow processes in every time slot. Moreover, the timing of the two primary inputs
and the primary outputs are now related timewise. Their timing must be consistent because the timing
of the primary input data determines the timing of the entire system. For example, if the input signal to
U-MoC
S-MoC
insertU2S
4096 4096
1
1
1
1 1
1
1
1
1
1
1
4096
Filter
Button
control
}
Distortion
control
stripS2U
Analyzer
FIGURE4.4 Adigital equalizer consisting of a dataowpart and control. The numbers annotating process inputs and
outputs denote the number of tokens consumed and produced in each evaluation cycle. (From A. Jantsch. Modeling
Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.)
the Button control process assumes that each time slot has the same time duration, the 4096 data samples
of the Filter input in each evaluation cycle must correspond to the same constant time period. It is the
responsibility of the domain interfaces to correctly relate the timing of the different domains to each other.
It is required that the time relation established by all interfaces is consistent with each other and with the
timing of the primary inputs. For instance if the stripS2U takes 1 token as input and emits 1 token
as output in each evaluation cycle, the insertU2S process cannot take 1 token as input and produce 2
tokens as output.
The interfaces in Figure 4.4 are very simple and lead to a strict coupling between the two MoC domains.
Could more sophisticated or nondeterministic interfaces avoid this coupling effect? The answer is no
because even if the input and output tokens of the interfaces vary from evaluation cycle to evaluation
cycle in complex or nondeterministic ways, we still have a very precise timing relation in each and every
time slot. Since in every evaluation cycle all interface processes must consume and produce a particular
number of tokens, this determines the time relation in that particular cycle. Even though this relation
may vary from cycle to cycle, it is still well dened for all cycles and hence for the entire execution of the
system.
The possibly nondeterministic communication delay between MoC domains, as well as between any
other processes, can be modeled, but this should not be confused with establishing a time relation between
two MoC domains.
4.3.2 Interface Renement
In order to show this difference and to illustrate how abstract interfaces can be gradually rened to
accommodate channel delay information and detailed protocols, we propose an interface renement
procedure, given below:
1. Add a time interface. When we connect two different MoC domains, we always have to dene the
time relation between the two. This is the case even if the two domains are of the same type, for example,
both are synchronous MoC domains, because the basic time unit may or may not be identical in the two
domains.
In our MoC framework the occurrence of events also represent time in both the synchronous MoC
and timed MoC domains. Thus, setting the time relation means to determine the number of events in
one domain that correspond to one event in the other domain. For example, in Figure 4.4 the interfaces
establish a one-to-one relation while the interface in Figure 4.5 represents a 3/2 relation.
MoC B
I
1
MoC A
MoC A MoC B
Q
3 2
P
P Q
FIGURE 4.5 Determining the time relation between two MoC domains. (From A. Jantsch. Modeling Embedded
Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.)
In other frameworks the establishing of a time relation will take a different form. For instance, if
languages such as SystemC or VHDL are used, the time of the different domains have to be related to the
common time base of the simulator.
2. Rene the protocol. When the time relation between the two domains is established, we have to
provide a protocol that is able to communicate over the nal interface at that point. The two domains
may represent different clocking regimes on the same chip, or one may end up as software while the other
is implemented as hardware, or both may be implemented as software on different chips or cores, etc.
Depending on the nal implementations we have to develop a protocol fullling the requirements of the
interface, such as buffering and error control.
In our example in Figure 4.6 we have selected a simple handshake protocol with limited buffering
capability. Note, however, that this assumes that for every three events arriving fromMoC A there are only
two useful events to be delivered to MoC B. The interface processes I
1
and I
2
, and the protocol processes
P
1
, P
2
, Q
1
, and Q
2
must be designed carefully to avoid both losing data and deadlock.
3. Model the channel delay. In order to have a realistic channel behavior, the delay can be modeled
deterministically or stochastically. In Figure 4.7 we have added a stochastic delay varying between 2 and 5
MoC Bcycles. The protocol will require more buffering to accommodate the varying delays. To dimension
the buffers correctly we have to identify the average and the worst-case behavior that we should be able to
handle.
This renement procedure proposed here is consistent with and complementary to other techniques
proposed, for example, in the context of SystemC [44]. We only want to emphasize here that the time
relation between domains from channel delay and protocol design have to be separated. Often these issues
MoC A
MoC B
P
1
P
2
I
1
I
2
Q
2
Q
1
FIGURE 4.6 A simple handshake protocol. (From A. Jantsch. Modeling Embedded Systems and SoCs. Morgan
Kaufmann Publishers, San Francisco, CA, 2004. With permission.)
MoC B
MoC A
Q
2
Q
1
P
2
P
2
I
1
I
2
D
[2,5]
D
[2,5]
FIGURE 4.7 The channel delay can vary between 2 and 5 cycles measured in MoC B cycles. (From A. Jantsch.
Modeling Embedded Systems and SoCs. Morgan Kaufmann Publishers, San Francisco, CA, 2004. With permission.)
are not separated clearly making interface design more complicated than necessary. More details about
this procedure and the example can be found in Reference 39.
4.3.3 MoC Renement
The three introducedMoCs represent three time abstractions and, naturally, designoftenstarts withhigher
time abstractions and gradually leads to lower abstractions. It is not always appropriate to start with an
untimed MoC because when timing properties are an inherent and crucial part of the functionality, a
synchronous model is more appropriate to start with. But if we start with an untimed model, we need
to map it onto an architecture with concrete timing properties. Frequently, resource sharing makes the
consideration of time functionally relevant, because of deadlock problems and complicated interaction
patterns. All the three phenomena discussed in Section 4.1.4, priority inversion, performance inversion,
and over-synchronization, emerged due to resource sharing.
Example 4.3 We discuss therefore an example for MoC renement from the untimed through the syn-
chronous to the timed MoC, which is driven by resource sharing. In Figure 4.8 we have two unlimited MoC
process pairs, which are functionally independent from each other. At this level, under the assumption of
innite buffers and unlimited resources, we can analyze and develop the core functionality embodied by
the process internal functions f and g.
In the rst renement step, shown in Figure 4.9, we introduce nite buffers between the processes. B
n,2
and B
m,2
represent buffers of size n and m, respectively. Since the untimed MoCassumes implicitly innite
buffers between two communicating processes, there is no point in modeling nite buffers in the untimed
MoCdomain. We just would not see any effect. In the synchronous MoCdomain, however, we can analyze
S
1
R
1
P
1
Q
1
P
1
=mealyU(1, f
P
1
, g
P
1
, w
P
1
)
Q
1
=mealyU(1, f
Q
1
, g
Q
1
, w
Q
1
)
R
1
=mealyU(1, f
R
1
, g
R
1
, w
R
1
)
S
2
=mealyU(1, f
S
1
, g
S
1
, w
S
1
)
FIGURE 4.8 Two independent process pairs.
B
m,2
B
n,2 P
2
Q
2
S
2
R
2
P
2
=mealyS: 2:1(f
P
2
, g
P
2
, w
P
2
)
Q
2
=mealyS(f
2
Q
, g
2
Q
, w
2
Q
)
B
n,2
=mealyS(f
2
B
n
, g
2
B
n
, w
2
B
n
)
R
2
=mealyS:2:1(f
R
2
, g
R
2
, w
R
2
)
S
2
=mealyS(f
2
S
, g
2
S
, w
S
2
)
B
m,2
=mealyS(f
2
B
m
, g
2
B
m
, w
2
B
m
)
FIGURE 4.9 Two independent process pairs with explicit buffers.
the consequences of nite buffers. The processes need to be rened. Processes P
2
and R
2
have to be able to
handle full buffers while processes Q
2
and S
2
have to handle empty buffers. In the untimed MoC, processes
always block on empty input buffers. This behavior can also be modeled in synchronous MoC processes
easily. In addition more complicated behavior such as time-outs can be modeled and analyzed. To nd
the minimum buffer sizes while avoiding deadlock and ensuring the original system behavior is by itself
a challenging task. Basten and Hoogerbrugge [45] propose a technique to address this. More frequently,
the buffer minimization problem is formulated as part of the process scheduling problem [46, 47].
The communication infrastructure is typically shared among many communicating actors.
In Figure 4.10 we map the communication links onto one bus, represented as process I
3
. It contains
an arbiter that resolves conicts when both processes B
n,3
and B
m,3
try to access the bus at the same time.
It also implements a bus access protocol, that has to be followed by connecting processes. The synchronous
MoC model in Figure 4.10 is cycle true and the effect of bus sharing on system behavior and perform-
ance can be analyzed. A model checker can prove and use the soundness and fairness of the arbitration
algorithm and performance requirements on the individual processes can be derived to achieve a desirable
system performance.
Sometimes, it is a feasible option to synthesize the model of Figure 4.10 directly into a hardware or
software implementation, provided we can use standard templates for the process interfaces. Alternatively
we can rene the model into a fully timed model. However, we still have various options depending
on what exactly we would like to model and analyze. For each process we can decide how much of the
timing and synchronization details should be explicitly taken care of by the process and how much can be
handled implicitly by the process interfaces. For instance in Section 4.2.5 we have introduced constructors
mealyST and mealyPT. The rst provides a process interface that strips-off all absent events and
inserts absent events at the output as needed. The internal functions have only to deal with the functional
events but they have no access to timing information. This means that an untimed mealyU process can be
directly rened into a timed mealyST process with exactly the same functions f and g. Alternatively, the
constructor mealyPT provides an interface that invokes the internal functions at regular time intervals.
If this interval corresponds to a synchronous time slot, a synchronous MoC process can be easily mapped
onto a mealyPT type of process, with the only difference, that the functions in a mealyPT process may
receive several nonabsent events in each cycle. But in both cases the processes experience a notion of time
based on cycles.
In Figure 4.11 we have chosen to rene processes P, Q, R, and S into mealyST-based processes to
keep them as similar to the original untimed processes. Thus, the original f and g functions can be used
without major modication. The process interfaces are responsible to collect the inputs, present them to
the f and g functions and emit properly synchronized output.
The buffer and the bus processes however have been mapped onto mealyPT processes. The constants
and /2 represent the cycle time for the processes. Process B
m,4
operates with half the cycle time of
B
n,3
P
3
R
3
B
m,3
I
3
S
3
I
3
=mealyS:4:2(f
3
I
, g
3
I
, w
3
I
)
Q
3
P
3
=mealyS(f
P
3
, g
P
3
, w
P
3
)
Q
2
=mealyS(f
3
Q
, g
3
Q
, w
3
Q
)
B
n,3
=mealyS:2:1(f
3
B
n
, g
3
B
n
, w
3
B
n
)
P
3
=mealyS(f
R
3
, g
R
3
, w
R
3
)
S
3
=mealyS(f
3
S
, g
3
S
, w
3
S
)
B
n,3
=mealyS:2:1(f
3
B
m
, g
3
B
m
, w
3
B
m
)
FIGURE 4.10 Two independent process pairs with explicit buffers.
B
n,3
P
3
R
3
B
m,3
I
3
S
3
I
4
=mealyPT:4:2(l, f
4
I
, g
4
I
, w
4
I
)
Q
3
P
4
=mealyST(1, f
P
4
, g
P
4
, w
P
4
)
Q
2
=mealyST(f
4
Q
, g
4
Q
, w
4
Q
)
B
n,4
=mealyPT:2:1(l,f
4
B
n
, g
4
B
n
, w
4
B
n
)
R
4
=mealyST(1, f
R
4
, g
R
4
, w
R
4
)
S
4
=mealyS(1, f
4
S
, g
4
S
, w
4
S
)
B
m,4
=mealyPT:2:1( ,f
4
B
m
, g
4
B
m
, w
4
B
m
)
l
2
FIGURE 4.11 All processes are rened into the timed MoC but with different synchronization interfaces.
the other processes, which illustrates that the modeling accuracy can be arbitrarily selected. We can also
choose other process constructors and hence interfaces if desirable. For instance, some processes can be
mapped onto mealyT-type processes in a further renement step to expose them to even more timing
information.
4.4 Conclusion
We tried to motivate that MoC for embedded systems should be different from the many computational
models developed in the past. The purpose of model of embedded computation should be to support
analysis and design of concrete systems. Thus, it needs to deal with salient and critical features of embed-
ded systems in a systematic way. These features include real-time requirements, power consumption,
architecture heterogeneity, application heterogeneity, and real-world interaction.
We have proposed a framework to study different MoCs that allow us to appropriately capture
some, but unfortunately not all, of these features. In particular power consumption and other non-
functional properties are not covered. Time is of central focus in the framework but continuous
time models are not included in spite of their relevance for the sensors and actuators in embedded
systems.
Despite the deciencies of this framework we hope that we were able to argue well for a few important
points:
Different computational models should and will continue to coexist for a variety of technical and
nontechnical reasons.
To use the right computational model in a design and for a particular design task can greatly
facilitate the design process and the quality of the result. What is the right model depends on the
purpose and objectives of a design task.
Time is of central importance and computational models with different timing abstractions should
be used during system development.
From an MoC perspective, several important issues are open research topics and should be addressed
urgently to improve the design process for embedded systems:
We need to identify efcient ways to capture a few important nonfunctional properties in MoCs.
At least power and energy consumption and perhaps signal noise issues should be attended to.
The effective integration of different MoCs will require (1) the systematic manipulation and
renement of MoCinterfaces and interdomain protocols; (2) the crossdomain analysis of function-
ality, performance, and power consumption; (3) the global optimization and synthesis including
migration of tasks and processes across MoC domain boundaries.
In order to make the benets and the potential of well-dened MoCs available in the practical
design work, we need to project MoCs into design languages, such as VHDL, Verilog, SystemC,
C++, etc. This should be done by properly subsetting a language and by developing pragmatics to
restrict the use of a language. If accompanied by tools to enforce the restrictions and to exploit the
properties of the underlying MoC, this will be accepted quickly by designers.
In the future we foresee a continuous and steady further development of MoCs to match future
theoretical objectives and practical design purposes. But we also hope that they become better accepted
as practically useful devices for supporting the design process just like design languages, tools, and
methodologies.
References
[1] Ralph Gregory Taylor. Models of Computation and Formal Language. Oxford University Press,
New York, 1998.
[2] Peter van Embde Boas. Machine models and simulation. In J. van Leeuwen, Ed., Handbook of
Theoretical Computer Science, Vol. A: Algorithms and Complexity. Elsevier Science Publishers B.V.,
Amsterdam, 1990, chap. 1, pp. 166.
[3] S. Cook and R. Reckhow. Time bounded randomaccess machines. Journal of Computer and System
Sciences, 7:354375, 1973.
[4] B.M. Maggs, L.R. Matheson, and R.E. Tarjan. Models of parallel computation: a survey and
synthesis. In Proceedings of the 28th Hawaii International Conference on System Sciences (HICSS),
Vol. 2, 1995, pp. 6170.
[5] S. Fortune and J. Wyllie. Parallelism in random access machines. In Proceedings of the 10th Annual
Symposium on Theory of Computing, San Diego, CA, 1978.
[6] Alok Aggarwal, Ashok K. Chandra, and Marc Snir. Communication complexity of PRAMs.
Theoretical Computer Science, 71:328, 1990.
[7] Phillip B. Gibbons, Yossi Matias, and Vijaya Ramachandran. The QRQW PRAM: accounting for
contention in parallel algorithms. In Proceedings of the 5th Annual ACM-SIAM Symposium on
Discrete Algorithms, Arlington, VA, January 1994, pp. 638648.
[8] Eli Upfal. Efcient schemes for parallel communication. Journal of the ACM, 31:507517, 1984.
[9] A. Aggarwal, B. Alpern, A.K. Chandra, and M. Snir. A model for hierarchical memory.
InProceedings of the 19thAnnual ACMSymposiumonTheory of Computing, May 1987, pp. 305314.
[10] Bowen Alpern, Larry Carter, EphraimFeig, and Ted Selker. The uniformmemory hierarchy model
of computation. Algorithmica, 12:72109, 1994.
[11] Thomas Lengauer. VLSI theory. In J. van Leeuwen, Ed., Handbook of Theoretical Computer
Science, Vol. A: Algorithms and Complexity, 2nd ed., Elsevier Science Publishers, Amsterdam,
1990, chap. 16, pp. 835868.
[12] Johan Eker, Jrn W. Janneck, Edward A. Lee, Jie Liu, Xiaojun Liu, Jozsef Ludvig,
Stephen Neuendorffer, Sonia Sachs, and Yuhong Xiong. Taming heterogeneity? The Ptolemy
approach. Proceedings of the IEEE, 91:127144, 2003.
[13] Rolf Ernst. MPSOC Performance Modeling and Analysis. Paper Presented at the 3rd International
Seminar on Application-Specic Multi-Processor SoC, Chamonix, France, 2003.
[14] Gilles Kahn. The semantics of a simple language for parallel programming. In Proceedings of the
IFIP Congress 74. North-Holland, Amsterdam, 1974.
[15] Edward A. Lee and T.M. Parks. Dataow process networks. Proceedings of the IEEE, 83:773801,
1995.
[16] Jarvis Dean Brock. A Formal Model for Non-Deterministic Dataow Computation. Ph.D. thesis,
Massachusetts Institute of Technology, Cambridge, MA, 1983.
[17] J. Dean Brock and William B. Ackerman. Scenarios: a model of nondeterminate computation.
In J. Diaz and I. Ramos, Eds., Formalism of Programming Concepts, Vol. 107 of Lecture Notes in
Computer Science. Springer Verlag, Heidelberg, 1981, pp. 252259.
[18] Paul R. Kosinski. A straight forward denotational semantics for nondeterminate data ow
programs. In Proceedings of the 5th ACM Symposium on Principles of Programming Languages,
1978, pp. 214219.
[19] David Park. The fairness problem and nondeterministic computing networks. In J.W. De Baker
and J. van Leeuwen, Eds., Foundations of Computer Science IV, Part 2: Semantics and Logic.
Mathematical Centre Tracts, Amsterdam, The Netherlands, 1983, Vol. 159, pp. 133161.
[20] Robin Milner. Communication and Concurrency. International Series in Computer Science.
Prentice Hall, New York, 1989.
[21] C.A.R. Hoare. Communicating sequential processes. Communications of the ACM,
21:666676, 1978.
[22] Axel Jantsch, Ingo Sander, and Wenbiao Wu. The usage of stochastic processes in embedded
system specications. In Proceedings of the Ninth International Symposium on Hardware/Software
Codesign, April 2001.
[23] Edward Ashford Lee and David G. Messerschmitt. Static scheduling of synchronous data ow
programs for digital signal processing. IEEE Transactions on Computers, C-36:2435, 1987.
[24] Chanik Park, Jaewoong Jung, and Soonhoi Ha. Extended synchronous dataow for efcient DSP
system prototyping. Design Automation for Embedded Systems, 6:295322, 2002.
[25] Axel Jantsch and Per Bjurus. Composite signal ow: a computational model combining
events, sampled streams, and vectors. In Proceedings of the Design and Test Europe Conference
(DATE), 2000.
[26] Nicolas Halbwachs. Synchronous programming of reactive systems. In Proceedings of Computer
Aided Verication (CAV), 2000.
[27] Albert Benveniste and Grard Berry. The synchronous approach to reactive and real-time systems.
Proceedings of the IEEE, 79:12701282, 1991.
[28] Frank L. Severance. System Modeling and Simulation. John Wiley & Sons, New York, 2001.
[29] Averill M. Law and W. David Kelton. Simulation, Modeling and Analsysis, 3rd ed., Industrial
Engineering Series. McGraw Hill, New York, 2000.
[30] Christos G. Cassandras. Discrete Event Systems. Aksen Associates, Boston, MA, 1993.
[31] Per Bjurus and Axel Jantsch. Modeling of mixed control and dataow systems in MASCOT. IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 9:690704, 2001.
[32] Peeter Ellervee, Shashi Kumar, Axel Jantsch, Bengt Svantesson, Thomas Meincke, and Ahmed
Hemani. IRSYD: an internal representation for heterogeneous embedded systems. In Proceedings
of the 16th NORCHIP Conference, 1998.
[33] P. Le Marrec, C.A. Valderrama, F. Hessel, A.A. Jerraya, M. Attia, and O. Cayrol. Hardware,
software and mechanical cosimulation for auto-motive applications. In Proceedings of the Ninth
International Workshop on Rapid System Prototyping, 1998, pp. 202206.
[34] Ahmed A. Jerraya and K. OBrien. Solar: an intermediate format for system-level modeling
and synthesis. In Jerzy Rozenblit and Klaus Buchenrieder, Eds., Codesign: Computer-Aided
Software/Hardware Engineering. IEEE Press, Piscataway, NJ, 1995, chap. 7, pp. 145175.
[35] Edward A. Lee and David G. Messerschmitt. An Overview of the Ptolemy Project. Report from
Department of Electrical Engineering and Computer Science, University of California, Berkeley,
January 1993.
[36] Edward A. Lee and Alberto Sangiovanni-Vincentelli. A framework for comparing models of
computation. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
17:12171229, 1998.
[37] Edward A. Lee. A Denotational Semantics for Dataow with Firing. Technical report UCB/ERL
M97/3, Department of Electrical Engineering and Computer Science, University of California,
Berkeley, January 1997.
[38] Axel Jantsch and Hannu Tenhunen. Will networks on chip close the productivity gap? In Axel
Jantsch and Hannu Tenhunen, Eds., Networks on Chip, Kluwer Academic Publishers, Dordrecht,
2003, chap. 1, pp. 318.
[39] Axel Jantsch. Modeling Embedded Systems and SoCs Concurrency and Time in Models of
Computation. Systems on Silicon. Morgan Kaufmann Publishers, San Francisco, CA, 2003.
[40] D. Harel. Statecharts: A visual formalism for complex systems. Science of Computer Programming,
8:231274, 1987.
[41] G. Berry, P. Couronne, and G. Gonthier. Synchronous programming of reactive systems: an
introduction to Esterel. In Kazuhiro Fuchi and M. Nivat, Eds., Programming of Future Generation
Computers, Elsevier, New York, 1988, pp. 3555.
[42] Paul le Guernic, Thierry Gautier, Michel le Borgne, and Claude le Maire. Programming real-time
applications with SIGNAL. Proceedings of the IEEE, 79:13211336, 1991.
[43] N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud. The synchronous data ow programming
language LUSTRE. Proceedings of the IEEE, 79:13051320, 1991.
[44] Thorsten Grtker, Stan Liao, Grant Martin, and Stuart Swan. System Design with SystemC. Kluwer
Academic Publishers, Dordrecht, 2002.
[45] Twan Basten and Jan Hoogerbrugge. Efcient execution of process networks. In Alan Chalmers,
Majid Mirmehdi, and Henk Muller, Eds., Communicating Process Architectures. IOS Press,
Amsterdam, 2001.
[46] Sundararajan Sriram and Shuvra S. Bhattacharyya. Embedded Multiprocessors: Scheduling and
Synchronization. Marcel Dekker, New York, 2000.
[47] Shuvra S. Bhattacharyya, Praveen K. Murthy, and Edward A. Lee. Software Synthesis from Dataow
Graphs. Kluwer Academic Publishers, Dordrecht, 1996.
5
Modeling Formalisms
for Embedded System
Design
Lus Gomes
Universidade Nova de Lisboa and
UNINOVA
Joo Paulo Barros
Instituto Politcnico de Beja and
UNINOVA
Anik Costa
Universidade Nova de Lisboa and
UNINOVA
5.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1
5.2 Notions of Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3
5.3 Communication Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4
5.4 Common Modeling Formalisms. . . . . . . . . . . . . . . . . . . . . . . . 5-5
Finite State Machines Finite State Machines with
Datapath Statecharts and Hierarchical/Concurrent Finite
State Machines Program-State Machines Codesign Finite
State Machines Specication and Description Language
Message Sequence Charts Petri Nets Discrete Event
Synchronous/Reactive Models Dataow Models
5.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-31
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-32
5.1 Introduction
The importance of the system specication phase is directly proportional to the respective system
complexity. Embedded systems have become more and more complex not only due to the increasing
systems dimension, but also due to the interactions among the different system design aspects. These
include, among others, correctness, platformheterogeneity, performance, power consumption, costs, and
time-to-market.
Therefore, a multitude of modeling formalisms have been applied to embedded system design.
Typically, these formalisms strive for a maximumof preciseness, as they rely on a mathematical (formal)
model.
Modeling formalisms are often referred as models of computation (MoC) [15]. An MoC is composed
by a notation and by the rules for computation of the behavior. Instead of notation, we talk about the
syntax of the model; the rules that dene the model semantics.
Usage of formal models in embedded systemdesign allows (at least) one of the following [2]:
Unambiguously capture the required systems functionality.
Verication of functional specication correctness with respect to its desired proprieties.
Support synthesis onto a specic architecture and communication resources.
5-1
Use different tools based on the same model (supporting communication among teams involved
in design, producing, and maintaining the system).
It has to be stressed that model-based verication of proprieties is a subject of major importance in
embedded system design (and also in system design in general terms), as far as it allows to verify model
correctness even if the system does not exist (physically), or if it is difcult/dangerous/costly to analyze the
system directly. The construction of a system model brings several advantages, as it forces a more complete
system comprehension and allows the comparison of distinct approaches. Hence, it becomes easier to
identify desired and undesired system properties, as the requirements become more precise and complete.
Most modeling formalisms for embedded systems design are based on a particular diagrammatic
(or graphic) language. Despite known arguments against diagrammatic languages (e.g., Reference 6), they
are presently widely acknowledged as extremely useful and popular for software development and also
for embedded system development in general. The history of the Unied Modeling Language (UML),
Specication and Description Language (SDL), and Message Sequence Charts (MSCs) certainly proves it.
Even though diagrammatic languages are often seen as inherently less precise than textual languages, this
is certainly not true (see, e.g., References 7 and 8).
These diagrammatic representations are usually graph-based. Finite state machines (FSMs), in their
different forms (Moore, Mealy) and extensions (hierarchical and concurrent, Statecharts, etc.) are
a well-known example. The same is true for dataows and Petri nets. These formalisms offer a variety of
semantics for the modeling of time, communication, and concurrency modeling.
Besides distinct graphical syntaxes and semantics, different formalisms also have different analysis and
verication capabilities.
The plethora of MoCs ready to be used by embedded system designers means that the task of the
modeler trying to choose thebestformalismis a very difcult task. Different embedded systems can, and
often do, emphasize different aspects, namely the reactive nature associated with their behavior, the real-
time constraints, or data processing capabilities. The same happens with the available MoCs. For example,
some MoCs for embedded systems are control dominated (data processing and computationare minimal),
emphasizing the reactiveness of systems behavior. Others emphasize data processing, containing complex
data transformations, normally described by dataows. Reactive control systems are in the rst group and
digital signal-processing applications are in the second. For example, digital signal-processing applications
emphasize the usage of dataow models, whereas FSMs explicitly emphasize reactiveness. Unfortunately,
other aspects are also important to be considered when producing the model for the system; for example,
the need to model specic notions of time or different modes of communication among components may
further complicate the search for the right MoC.
So, in some embedded systemdesigns, heterogeneity in terms of the implementation platforms has to
be faced, and it is not possible to nd a unique formalismto model the whole system. In those situations,
the goal is to decompose the systems model into submodels and to pick up the right formalism for the
different submodels; although, at the end, designer has to be able to integrate all those models in a coherent
way [9]. Several formalisms allowthe modeler to partitionthe systems model and describe it as a collection
of communicating modules (components). In this sense, behaviors modeling and communication among
components are often interdependent. Yet, separating behavior and communication is a sound attitude
as it allows handling system design complexity and reusability of components. In fact, it is very difcult
to reuse components if behavior and communication are intertwined, as behavior is dependent on the
communication mechanisms with other components of the systemdesign [2].
Modeling formalisms for embedded system design have been widely studied, and several reviews
and textbooks about MoCs can be found in the literature [15]. This chapter surveys several modeling
formalisms for embedded system design taking Reference 5 as the main reference and expanding it to
encompass a set of additional modeling formalisms widely used by embedded systemdesigners in several
application areas.
The following sections address aspects of time representation and communication support. Afterwards,
several selected modeling formalisms are presented.
Modeling Formalisms 5-3
5.2 Notions of Time
Embedded systems are often characterized as real-time systems. Thus the notion of time is extremely
important in many of the modeling formalisms for embedded systemdesign.
Generally speaking, we may identify three approaches to time modeling:
1. Continuous time and differential equations.
2. Discrete time and difference equations.
3. Discrete events.
The rst approach (see Figure 5.1[a]) uses differential equations to model continuous time functions.
This attitude is mostly used for the modeling of specic interface components, where the continuous
nature of signal evolution is present, such as analog circuits modeling and physical systems modeling in
a broad sense.
In the second approach (Figure 5.1[b]), it is assumed that the time is discrete; in this sense, difference
equations replace differential equations. A global clock (the tick) denes the specic points in time
where signals have values. For some applications, involving heterogeneous components, it is also useful
to consider multirate difference equations (which mean that several clock signals are available). Digital
signal processing is one of the main application areas.
In the third approach, a signal is seen as a sequence of events (see Figure 5.1[c]). This concept of
events can be associated to physical signals evolution as presented in Figure 5.2 for a Boolean signal; there
event a is associated with the rising edge of the signal x, while event x is generated at all falling edges
of signal x. Extension to other useful types of signals is straightforward, namely for signals that can hold
multivalued, enumerated or integer values. Each event has a value and a time tag. The events are processed
in a chronological order, based on a predened precedence.
If the time tags are totally ordered [10], we are in presence of a timed system: for any distinct t
1
and t
2
,
either t
1
< t
2
or t
2
< t
1
(this is called a total order). It is possible to dene an associated metric, for instance
f (t
1
, t
2
) = |t
1
t
2
|. If the metric is a continuumwe have a continuous time system. A discrete-event system
is a timed systemwhere the time tags are totally ordered.
Time (sec)
Time (sec)
A
(a) (b) (c)
A
Time
Events
a b c d e
FIGURE5.1 Time representations. (FromLus Gomes andJoo Paulo Barros, Models of Computationfor Embedded
Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed., Section VI Real Time and
Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)
Boolean
signal x
x+ x x+
t
x=0 x=0 x=1 x=1
Associated
events
Associated
holding
conditions
a a
FIGURE 5.2 Fromsignals to events and conditions.
Two events are synchronous if they have the same time tag attached (they occur simultaneously).
Similarly, two signals are synchronous if for each event in one signal, there is a synchronous event in the
other signal, and vice versa. Asystemis synchronous if every signal in the systemis synchronous with every
other signal in the system. In this sense, a discrete-time systemis a synchronous discrete-event system.
Totally ordered events are used with digital hardware simulators, namely the ones associated with
VHDL and Verilog hardware description languages. Any two events are either simultaneous, which means
that they have the same time tag, or any one of them can precede the other. Events can be considered
partially ordered in the sense that the order does not include all the events in the system. When tags are
partially ordered instead of totally ordered, the system is untimed. This means that we can build several
event sequences that do not contain all the system events. These missing events are included in other
completely ordered event sequences. It is known [11] that total order of the events cannot be maintained
in distributed systems, where a partial order is sufcient to analyze system behavior. Partial orders have
also been used to analyze Petri nets [12].
An asynchronous system is a system in which no two events can have the same tag [1]. The system
is asynchronous interleaved if tags are totally ordered, and is asynchronous concurrent if tags are partially
ordered.
As time is intrinsically continuous, real systems are asynchronous by nature. Yet, synchronicity is a very
convenient abstraction, allowing efcient and robust implementations, through the use of a reference
clock signal.
5.3 Communication Support
The embedded systems complexity usually motivates their decomposition in several interacting
components. These can be more or less independent, for example, they can be executed in true concur-
rency or in an interleaved way, but probably all will have to communicate with some other components.
Therefore, communication is of topmost importance. It can be classied as implicit or explicit [3]:
Implicit communication generally requires totally ordered tag events, normally associated with
physical time. In order to support this form of communication it is necessary to have a physically
shared signal (for instance, a clock signal), whose availability may be difcult or unfeasible in a large
number of embedded systemapplications.
Explicit communication imposes an order on the events: the sender process will guarantee that all
the receiver processes are informed about some part of its internal state.
The following models of communication are normally considered:
Handshake using a synchronization mechanism; all intervening components are blocked, waiting
for conclusion.
Message passing using a sendreceive pattern where the receiver will wait for the message.
Shared variables; the blocking is decided by the control part of the memory where the shared
variable is stored.
The referred communication modes are supported by a set of communication primitives (or by some
combination), namely [3]:
Unsynchronized: producer and consumer(s) are not synchronized; there are no guarantees that
the producer does not overwrite previously produced data, or that the consumer(s) will get all
produced data.
Read-modify-write: this is the common way to get access to shared data structures from different
processes in software; access to the data structure is locked during a data access (either readwrite
or read-modify-write), it is an atomic action (indivisible, and thus uninterruptible).
Unbounded FIFO (rst in rst out) buffered: producer generates a sequence of data tokens and the
consumer will get those tokens using a FIFO discipline.
Bounded FIFO buffered: as in the latter but the buffer size is limited, so the difference between
writings and readings will be bounded by some value. This means that writings can be blocked if
the buffer is full.
Petri net places: producers generate sequences of data tokens and consumers will read those tokens.
Rendezvous: the writing process and the reading process must simultaneously be at the point where
the write and the read occur.
5.4 Common Modeling Formalisms
Most modeling formalisms are control dominated or data dominated. However, as already referred,
embedded systems are composed of a mixture of reactive behavior, control functions, and data processing,
especially those targeted for networking and multimedia applications. In the following sections, a set of
selected formalisms are presented, taking FSMs as the starting point, which proved to be adequate for low
to medium complexity control-dominated system modeling.
We can nd in the literature numerous proposals extending FSMs in several directions. Each extension
tries to overcome one or more intrinsic FSMs shortages, from concurrency modeling inability and the
associated state-space explosion problem, to data processing modeling and the absence of hierarchical
structuring mechanisms (supporting specication at different levels of abstraction).
After control-dominated formalisms (emphasizing the reactive nature of embedded systems),
dataow-dominant formalisms will be presented.
5.4.1 Finite State Machines
Finite state machines are common computational models that have been used by system designers for
decades. It is common to represent FSM in different ways: from graphical-based representations (like
state diagrams and owcharts), to textual-based representations. In this chapter, state diagrams are used.
The modeling attitude is based on the characterization of the system in terms of the global states that the
system can exhibit, and also in terms of the conditions that can cause a change in those states (transitions
between states). A basic FSM consists of a nite set of states S (with a specied initial state, s
is
), a set of input
signals I , a set of output signals O, an output function f , and a next-state function h. Output and next-state
functions ( f and h, respectively) map a crossproduct of S and I into S and O, respectively ( f : S I S,
h: S I O). Two basic models can be considered for output modeling: Moore-type machine [13],
also called state-based FSM, where outputs are associated with state activation (and where the output
function f only maps states S into outputs O), and Mealy-type machine [14], also called transition-based
FSM, where outputs are associated with transitions between states. It is important to note that both models
have the same modeling capabilities. The referred FSMmodel can be limited or extended to accommodate
different needs (specic modeling capabilities or target architectures), as analyzed in some of the following
sections.
Figure 5.3 illustrates a basic notation for a state diagram. Circles or ellipses represent states; transitions
between states use a directed arc. Each arc has an attached expression, potentially containing reference to
the input event and/or to an external condition that will cause the change of state. Outputs can be modeled
as Moore-type output actions (associated with states, such as z in state S
2
), or as Mealy-type output events
(associated with transitions, such as x in the presented transition expression).
FSMs are a control-dominated MoC, so intrinsically adequate to model the embedded systemreactive
component.
We will introduce a running example, adapted from Reference 15, and we will start using an FSM to
model the systemcontroller. The systemto be modeled is the controller of an electric car, which is installed
in an industrial plant. The electric car has to carry goods fromone point to another, and come back. The
controller receives commands from the operator, namely actuation on key GO to start the movement
from home position, and actuation on the key BACK to force returning the car to home position after
S0
a (C) / x
S1
z
S2
State
Transition
Transition
expression
FIGURE 5.3 State diagram basic notation.
Controller
M
DIR
A
B
GO
BACK
A B
GO
BACK
M
DIR
From
plant
From
operator
To
motor
(a) (b)
FIGURE 5.4 Electric car plant running example.
A=0
A=1
B=1
B=0
BACK=0
BACK=1
GO=0
(a) (b)
GO=1
S0
S1
S2
S3
M=1
M=1
DIR=right
DIR=left
M=0
M=0
A
B
BACK
GO
S0
S1
M=1
DIR=right
S3
M=1
DIR=left
M=0
S2
M=0
FIGURE 5.5 State diagram models of an electric car plant controller. (From Lus Gomes and Joo Paulo
Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook,
Richard Zurawski, Ed., Section VI Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL,
2005. With permission.)
end position reached. After receiving an order, the car motor is adequately activated, while the initial, or
the nal, position is not reached. There are two sensors available for detecting home and end position
reached, A and B, respectively. Figure 5.4(a) represents the external view of the controller in terms of
inputs and outputs, and Figure 5.4(b) illustrates the layout of the plant.
Figure 5.5(a) and (b) present two possible (and equivalent) models for the control of the referred
system. The rst relies on the evaluation of external conditions (signal values are explicitly checked),
while the second relies on external events (obtained through the preprocessing of external signals). It is
clear that events usage will produce a lighter model, with less arcs and inscriptions (it is assumed in this
representation that an event associated to a signal is generated when the signal changes its state from
0 to 1).
State
variables
Output
function
Output
function
Next-state
function
Inputs
Mealy
outputs
Moore
outputs
N
e
x
t

s
t
a
t
e
C
u
r
r
e
n
t

s
t
a
t
e
FIGURE 5.6 FSM implementation reference model.
From the point of view of the implementation model, it is common to decompose the system into a set
of functions to compute next state and outputs, and a set of state variables, as presented in Figure 5.6.
From the execution semantics point of view, one of two reference approaches can be chosen (which
correspond to different MoCs) [2]:
1. Synchronous FSMs.
2. Asynchronous FSMs.
In synchronous FSMs, both computation and communication happen instantaneously at discrete-time
instants (under the control of clock ticks). In this sense, from the point of view of active state changes, each
transition arc expression is implicitly ANDed with a rising (or falling) edge event of the clock signal.
Referring to Figure 5.6, the clock signal will be connected to the State variables block (for hardware
implementations, a register will be used to implement this block, while for software implementation the
clock will be used to trigger the execution cycle). One strong aspect toward synchronous FSMs usage is its
implementation robustness, especially when using synchronous hardware. However, when heterogeneous
implementations are foreseen, some difculties or inefciencies may arise (namely synchronous clock
signal distribution). For distributed heterogeneous systems, it is also of interest to consider a globally
asynchronous locally synchronous attitude (GALS systems), where the interaction between components is
asynchronous, although the implementation of each component is synchronous. So, within a synchronous
implementation island (a component), it is possible to rely on robust compilation techniques, either to
optimally map FSMs into Boolean and sequential circuits (hardware) or into software code (supported
by specic tools).
In asynchronous FSMs, process behavior is similar to the one on synchronous FSMs, but without
dependency on a clock tick. An asynchronous systemis a systemin which two events cannot have the same
time tag. In this sense, two asynchronous FSMs never execute a transition at the same time (asynchronous
interleaving). For heterogeneous architectures or for multirate specications, implementationcanbe easier
than in synchronous case. The difculties come fromthe need to synchronize communicating transitions,
and to assure that they occur at the same instant, which is essential for a correct implementation of
rendezvous on a distributed architecture.
FSMs have well-known strengths and weaknesses. Among the strengths, we should mention that they
are simple and intuitive to understand; also that they benet from the availability of robust compilation
tools. These are some of the reasons why designers have extensively used them in the past, and continue
to use. Unfortunately, several weaknesses prevent their usage for complex systems modeling. Namely,
FSMs do not provide data processing capabilities, support for concurrency modeling, (practical) support
Datapath
control
inputs
... ...
... ...
...
...
Datapath
Datapath
control
outputs
Control
part
External
control
inputs
External
data
inputs
External
control
outputs
External
data
outputs
FIGURE 5.7 Control and datapath decomposition.
for data memory, and hierarchical constructs. Several of the modeling formalisms to be presented try to
overcome one, some, or all the referred weaknesses.
5.4.2 Finite State Machines with Datapath
One common extension to FSMs trying to cope with the lack of support for data memory and data
processing capabilities are Finite State Machines with Datapath (FSMD) [16].
For instance, to model an 8-bit variable with 256 possible values through an FSM, it is necessary to use
256 states; the model looses its expressiveness and the designer cannot manage the specication.
An FSMD adds to a basic FSM a set of variables and redenes the next-state and output functions.
So, an FSMD consists of a nite set of states S (with a specied initial state, s
is
), a set of input signals I ,
a set of output signals O, a set of variables V , an output function f , and a next-state function h. The
next-state function h maps a crossproduct of S, I , and V into S ( f : S I V S). Output function f
maps current states to outputs and variables (h: S O +V ). As dened, output function f only supports
Moore-type outputs; it can also be easily extended to accommodate Mealy-type outputs.
From the implementation point of view, an FSMD model is decomposed as presented in Figure 5.7,
where the control part can be represented by a simple FSM model, and the datapath part can be charac-
terized through a register transfer architecture. So, the datapath is decomposed into a set of variables to
store operands and results, and a set of processing blocks to performcomputation on those values. It has
to be stressed that this is the common reference architecture for single-purpose processor and simple
microprocessor designs.
As a simple example, Figure 5.8 presents the decomposition associated with the modeling of a multiplier
of two numbers, A and B, producing result C through successive additions. Figure 5.8(a) presents top-level
decomposition and interconnections of control and data blocks, while Figure 5.8(c) presents a simple FSM
to model the control part and Figure 5.8(d) shows the register transfer architecture to support the required
computations (left-hand side is responsible for counting B times, while right-hand side is responsible for
the successive additions of A into C).
5.4.3 Statecharts and Hierarchical/Concurrent Finite State Machines
A second common extension to FSMs tries to cope with the lack of support for concurrency and
hierarchical structuring mechanisms of the model (still emphasizing the reactive part of the model).
A B
C OK
GO
C=A B
C=A+A+
...
+A
B times
RA
RC
C
B
A
CB
=
RB
RA
RC
CB
RB
RA
RC
CB
=
RB
CB
RB
Control
part
Datapath
Clock
2N bits
STOP
LO AD_C
CLEAR_C
INC_B
CLEAR_B
LOAD_B
LOAD_A
N bits N bits
S0
OK
S4
OK
GO=0
GO=0
GO=1
GO=1
S1
LOAD_A
LOAD_B
CLEAR_B
CLEAR_C
STOP=1
STOP=1
STOP=0
STOP=0
S3
INC_B
LOAD_C
S2
INC_B
CLEAR_C
LOAD_C
LOAD_A
2N bits
2N bits
RC
RA
STOP
N bits
N bits
N bits
N bits
LOAD_B
CLEAR_B
INC_B
(a)
(c)
(b)
(d)
FIGURE 5.8 Decomposition of a multiplier into control and datapath.
Several formalisms can be included in the group of hierarchical/concurrent nite state machines
(HCFSMs), all of them including mechanisms for concurrency and hierarchy support, but having differ-
ent execution semantics. Among them, Statecharts [7,17] are the most well-known modeling formalism
providing a MoC to specify complex reactive systems. One main advantage of Statecharts over FSMs is the
structuring of the specication, magnifying the legibility, and improving the system maintenance. Those
characteristics were key points that supported its adoption as one of the specication formalisms within
the UML [1820].
Statecharts are based on state diagrams, plus the notions of hierarchy, parallelism, and commu-
nication between parallel components. Statecharts were informally dened in [17] as Statecharts =
state-diagrams + depth + orthogonality + broadcast-communication.
Depth concept encapsulates the multilevel hierarchical structuring mechanism and is supported by the
XOR renement mechanism, while orthogonality concept allows concurrency modeling and is supported
by the AND renement mechanism. Unfortunately, the broadcast-communication mechanism semantics
is not similar in all Statecharts variants as it was dened in different ways by several authors. This fact had
a strong impact on possible Statecharts operational semantics, as discussed later in this section.
Statecharts dene three types of state instances: the set (implementing theAND renement mechanism),
the cluster (implementing the XOR renement mechanism), and the simple state. The cluster supports the
hierarchy concept, through encapsulation of state machines. The set supports the concurrency concept,
through parallel execution of clusters.
Figure 5.9 illustrates the usage of the cluster mechanism, adopting a bottom-up approach. Starting with
the SYS_C model, the state diagramcomposed by states C and D, and associated arcs can be encapsulated
by the state A, as represented in SYS_B model. This provides us with a top-level view of the model
composed only by the states A and B, complemented by the inner level if one wants to get further details
about the system behavior, as represented in SYS_A model. In this sense, the designer has the possibility
to describe the system at different levels of abstraction. The designer is free to follow a top-down or
A
B
C
D
x
y
w
z
A B
w
z
B
C
D
x y
w
z
z
SYS_C SYS_B
SYS_A
FIGURE 5.9 Usage of XOR renement in Statecharts basic models.
A
B
E
F
x
y
w
z
C
I
v p
J
D
q r
G
H
K
L
q p
M
H H*
FIGURE 5.10 Usage of AND renement in Statecharts basic models. (From Lus Gomes and Joo
Paulo Barros, Models of Computation for Embedded Systems. In The Industrial Information Technology Handbook,
Richard Zurawski, Ed., Section VI Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005.
With permission.)
a bottom-up approach while producing the systems model by applying the hierarchical decomposition
constructs available through the XOR renement mechanism.
Figure 5.10 presents a simple model containing a set A composed by three AND components
(B, C, and D); whenever A is activated/deactivated the associated components B, C, and D will also
be activated/deactivated.
Apart from the referred main characteristics, the Statecharts formalism presents some interesting
features, such as:
The default state that denes which state will take control in the case where a transition reaches
a cluster state. In Figure 5.9, SYS_Bmodel, the systeminitial state is state B, and, after the occurrence
of w, states A and C will become active.
The notion of history, simple or deep, can be associated to cluster state instances. When the system
enters a cluster with history property, the state that will be active upon entrance will be the one that
was active upon the last exit from that cluster. In the case of the rst entrance in the cluster, the
active state will be the default one. This is the case for the cluster C of Figure 5.10, which holds
the H attribute inside a circle. The history property can also be deep history, meaning that all
the clusters inside the cluster with the deep-history property, also have that property; this is the
case for cluster B in Figure 5.10, which holds the H
attribute inside a circle.

The activation expressions present in transitions can also have special events, special conditions,
and special actions. Special events can be the entered(state) event and the exited(state) event, which
are generated whenever the system enters or exits some state, respectively. One special condition
is the in(state), which indicates if the system is currently in the specied state. Examples of special
actions are: clear_history(state) and clear_deep_history(state). These initialize a cluster history state
to the default state.
So far we have only characterized syntactic issues for the Statecharts formalism. Its semantic char-
acterization includes, among other things, the characterization of the step algorithm and the denition
of the criteria for solving conicts (which set of triggering transitions to choose among a set of conicting
transitions), in order to obtain a deterministic behavior.
The step algorithm is of fundamental importance as it dictates the way the system evolves from one
global state to another. The triggering of a nite set of transitions causes the evolution of the system.
In Reference 21, the trigger of this set of transitions is called microstep (conceptually, can be seen as
equivalent to the delta concept, used in discrete event formalism). Transitions red in a microstep can
generate internal events that, in turn, can re other transitions. Thus, a step is dened as a nite sequence
of microsteps. This means that, after all the microsteps have taken place, there will be no more active
transitions and the system will remain in a stable state. External events are not considered during a step,
so one must assume that a system can completely react to an external event before the occurrence of
the next external event. The initial Statecharts proposal left their semantics partially open [22]. As a
consequence, different semantics associated with the handling of broadcasted event have been considered
possible. Statecharts semantics proposals are discussed in Reference 23.
Yet, as pointed by Harel and Naamad [22] that survey does not mention the semantics implemented
in the STATEMATE tool (in the following citation the original reference citations were replaced by the
citation numbering used in this chapter): the survey [23] does not mention the STATEMATE imple-
mentation of Statecharts or the semantics adopted for it at all, although this semantics is different from
the ones surveyed therein (and was developed earlier than all of them except for [22]). Even so, most
common semantics consider that a generated event will only be valid during the next microstep (more
used) or for the remaining microsteps within the same step (persistent events).
Coming back to the example of the electric car controller, if one considers controlling two or three of
those cars using the same controller, and tries to use a state diagram to model the whole system, he or
she will face the well-known problem of state explosion, due to the orthogonal composition of state
diagrams: if we have two independent state diagrams with N1 and N2 states, the composed state diagram
will contain N1 N2, as for each state of machine 1, one may have machine 2 in any of its states. This
illustrates the already presented weakness resulting from the lack of support for concurrency modeling of
FSM. Now consider that we want to keep the cars synchronized at both ends, which means that we will
wait for all cars at each end, before launching them in the opposite direction. The associated FSM model
is presented in Figure 5.11, where the state explosion phenomenon is observed, with the lack of intuitive
legibility of the resulting model.
Figure 5.12 presents a Statecharts model (TWO_CARS) enabling a compact modeling of the referred
system. The system is composed by an AND set (TWO_CARS) of two XOR clusters (CAR1 and CAR2),
modeling each one of the cars. It has also to be noted that the rounded rectangle representation for
states takes advantage of the area, allowing a more readable graphical notation (compared with the circle
representation of common state diagrams). Additional constraints associated with global start and global
A1=1
S6
S7
M1=1
M2=0
DIR1=left
DI R2=l eft
S0
M1=0
M2=0
S6
GO=0
GO=1
A2=0
A2=1
A1=1
A2=1
A1=1
A2=0
A1=0
A2=0
A1=0
M1=0
M2=1
DIR2=left
S1 M1=1
M2=1
DIR1=right
DIR2=right
S3
M1=1
M2=0
S4
M1=0
M2=0
DIR1=right
S5
M1=1
M2=1
DIR1=left
DIR2=left
S2
M1=0
M2=1
DIR2=right
B1=0
B2=0 B1=1
B2=0
B1=1
B2=1
B1=0
B2=1
B2=0
B2=1
B1=1
B1=0
A1=0
A2=0
BACK=1
BACK=0
FIGURE 5.11 An FSMmodel for two cars controller system.
A1
B1
BACK[in(S22)]
GO[in(S02)]
S01
S11
S21
S31
DIR1=right
M1=1
DIR1=left
M1=1
A2
B2
BACK[in(S21)]
GO[in(S01)]
S02
S12
S22
S32
TWO_CARS
CAR1
CAR2
M1=0 M1=0
M2=0
M2=0
DIR2=right
M2=1
DIR2=right
M2=1
FIGURE 5.12 Simple Statecharts model for electric cars example.
return are explicitly modeled through specic arc inscriptions; for instance, arc expression leaving state
S01 is augmented with the condition [in(S02)], imposing that CAR1 only starts its travel if CAR2 is also
in its original position (identical constraint for CAR2).
The reader should refer to References 7, 17, and 21 for further details concerning the Statecharts
formalism.
5.4.4 Program-State Machines
A third extension to FSMs is the program-state machine (PSM) model, which allows the use of sequential
program code to dene a state action. A PSM includes the hierarchy and concurrency extensions of
PS1
PS2
PS3
PS4
PS5
PS6
x=0
y=1
y=0
a=0; b=0;
while(c<3) {
...;
}
a=1;
Concurrent Sequential Leaf
FIGURE 5.13 A simple PSM model.
HCFSMs, but also merges the sequential program model, subsuming both models [4]. So, in a PSM,
the concept of state was substituted by the concept of program-state, in the sense that on top of the state
concept, the state actions are dened using a sequential program.
So, we can consider two extreme situations:
1. A PSM model with only one state and with a sequential program is associated as a state action.
In this case, the model is equivalent to a sequential program.
2. A PSM model with many states, hierarchically organized or not, within clusters and sets, but where
all output actions are simple assignment statements (like the ones used in Figure 5.12), which are
equivalent to an HCFSMmodel.
In common modeling applications, a mixture of the two extreme situations will be found. Figure 5.13
present a simple PSM model skeleton. It is also possible, in some situations, to model a specic process
through a set of states (an FSM) or through a sequential program (associated with one program-state);
in these situations, the designer has the capability to produce completely different models, although with
equivalent behaviors. So, the designer may either emphasize the graphical expressiveness of an FSM,
or take advantage of the conciseness associated with programtextual representations.
Some new features of PSMhave to be highlighted. First of all, only well-formed hierarchical constructs
are allowed. This means that it is not possible to dene a transition arc between two states crossing the
boundary of a cluster or of a set; only transitions between states in the same cluster are allowed, which
means arcs between states with the same parent state. This is due to the fact that the PSMs model of
hierarchy is similar as in sequential programlanguages, using subroutines with only one entry point, and
not specifying the point to return. This is a big difference regarding Statecharts, where no restrictions on
arc connectivity were imposed.
Fromthe computational point of view, three types of program-states are dened: leaf, sequential, and
concurrent. Hierarchical composition of program-states can be sequential or concurrent (in the common
sense usedinHCFSM: inconcurrent constructs all the components will be active/inactive at a specic point
in time, while in sequential construct only one component will be active at a time); leaf program-states
exist at the bottomof the behavioral hierarchy and can have a sequential programattached with it.
Fromthe execution point of view, a program-state can be in one of three states: inactive, executing, and
complete. If the program-state is a sequential program, then reaching the end of the code means that the
program-state is complete.
Special complete states can be added to the model, represented by a dark square (as in Figure 5.13).
Complete states only receive arc transitions fromthe substates at the same level (withthe same parent state);
no transition can start at a complete state. When the control is transferred to the complete state, the
associated program-state is complete.
As a consequence, PSM supports two types of transitions:
1. A transition-immediately (TI), as in Statecharts, which means that the transition occurs as soon as
the associated condition becomes true, regardless of the source program-state. This is the case of
the transition in Figure 5.13 starting at PS5 and having condition x = 0 attached.
2. A transition-on-completion (TOC) is dened, occurring when the associated condition is true and
the source program-state is complete.
The graphical notation used considers that a TI starts at the boundary of the program-state, while
a TOC starts from a dark square inside the state; Figure 5.13 presents several examples. In this sense, the
reactive nature of HCFSM/Statecharts is preserved using TI transitions, while the data processing and
transformational nature of sequential programmodels is supported using TOC transitions.
The PSM modeling formalism was used by SpecCharts [24] and SpecC [25] languages. In SpecCharts
language, VHDL was used for inscriptions and code generation, while in SpecC language, C was initially
used for inscriptions and code generation (currently it is also possible to generate HDL code).
5.4.5 Codesign Finite State Machines
Codesign Finite State Machines (CFSMs) are another modeling formalism for embedded system design
[3,26,27]. Asystemis described by a network of CFSMs. They extend FSMs with support for data handling
and asynchronous communication. Data handling is associated to transitions and these are associated to
external instantaneous functions. This implies that data manipulation should typically exhibit a low
algorithmic complexity.
In this sense, CFSMs supports the specication of embedded systems, involving both control and
dataow aspects, and having the following key characteristics:
Are composed by FSMs components.
Have a data computation part, in the form of references in the transition relation to external,
instantaneous (combinatorial) functions.
Use a GALS reference model, where:
(a) Local behavior is synchronous (as seen inside each component).
(b) Global behavior is asynchronous (as seen by the rest of the system). Each CFSMreads inputs,
executes a transition, and produces outputs in an unbounded but nite amount of time.
Communication between CFSMs is supported by signals, control signals, and data signals. Signals carry
information in the form of signal events and/or signal values, and may function as input signals, output
signals, or state signals. A connection, with an associated input buffer (1-place buffer), is used to sup-
port signal communication between two CFSMs. The buffer contains one memory element for the event
(event buffer) and one for the data (data buffer). An event can be produced by a sender CFSM, by setting
the event buffer to 1. Afterwards, the event can be detected (through its reading) and consumed (by setting
the buffer to 0) by a receiver CFSM. Before being consumed by the receiver CFSM, and after being emitted
by the sender CFSM, an event is present (asynchronous communication among CFSMs). For a CFSMA
send a signal S to CFSMB, it is necessary that CFSMA writes the data of S into the buffer, and afterwards
emits its event. The CFSM B detects the event, reads data, and consumes the event. This order ensures
that data are transferred.
A multicast communication mechanismis also available through a net, which is a set of connections on
the same output signal, in the sense that a sender communicates a signal to several receivers with a single
emission, and each receiver has a private copy of the communicated signal. Afterwards, each receiver
independently detects the event, reads associated data, and consumes the event.
0
1
2
x=1/z=1
x,y=0,0
y=0
x=0
x=1/z=1
y=1/z=1
x,y=0,1
.r 0
#transition: X Y st => Z st
.trans
1 - 0 1 0
0 0 0 0 1
0 1 0 0 2
- 1 1 1 0
- 0 1 0 1 # trivial
1 2 1 0
0 2 0 2 # trivial
FIGURE 5.14 Simple CFSMand associated SHIFT representation.
From the implementation point of view, a system described by a set of CFSMs can be decomposed
into a set of CFSMs to be implemented in software, a set of CFSMs to be implemented in hardware, and
a communication interface between them.
Any high-level language withprecise semantics basedonextendedFSMs canbe usedtomodel individual
CFSMs; currently Esterel is directly supported.
The CFSMmodel is strongly connected with the POLIS approach [27], and can be represented using a
textual notation named SHIFT (SoftwareHardware Intermediate FormaT), which extends to BLIF-MV
[28], which in turn is a multivalued version and an extension of the Berkeley Logic Interchange Format
(BLIF). A CFSM representation in SHIFT is composed of a set of input signals, a set of output signals,
a set of states or feedback signals, a set of initial values for output and state signals, and a transition
relation describing reactive behavior. Figure 5.14 presents a simple CFSM and associated SHIFT textual
representation.
5.4.6 Specication and Description Language
Specication and Description Language (SDL) is a modeling language standardized by the ITU
(International Telecommunication Union) [29]. Basically, an SDL model is a set of FSMs running in
parallel. These machines have their own input queues and communicate using discrete signals. The
signals can be asynchronous or synchronous. The latter correspond to synchronous remote procedure
calls. All signals can have associated parameters to interchange and synchronize information between SDL
processes and also between these and the environment. SDL has been designed for the development of
concurrent, reactive, real-time, distributed, and heterogeneous systems. Yet, by large, the major application
area for SDL is in telecommunications. It is the usually chosen language to dene and standardize com-
munication protocols. The specications of many communications systems use SDL. These include the
GSMsecond generation mobile telephony system, the UMTS third-generation mobile telephony system,
the ETSI HiperLAN 2 Broadband Radio Access Network or the IEEE 802.11 wireless Ethernet local area
network [30].
The language has been evolving since the rst ITU Z.100 Recommendation in 1980 with updates in
1984, 1988, 1992, 1996, and 1999. The 1992 version added some object-oriented concepts. These were sig-
nicantly expanded in the latter version (SDL-2000) toward a better integration with the UML [31]. This
trend should continue as SDL is increasingly being used together with UML. An article by Rick Reed [32]
provides a quite comprehensive overview of SDLs history up to 1999. More recent developments, and
all other types of SDL related information, are available at the SDL Forum Society Internet site [33].
In particular SDL-2000 [29] breaks the compatibility with SDL-96 and adds numerous signicant
modications. Here, we restrict to the main fundamental concepts thus avoiding the main incompatibility
issues.
The SDL has four structuring concepts: system, blocks, processes, and procedures. Besides these
structuring concepts, SDL has two communication-related constructions:
1. Channels (operate at system level connecting blocks).
2. Signal routes (operate inside blocks connecting processes).
The system is the entry point in an SDL specication. Its main components are blocks and channels.
Channels connect blocks with each other and also with the environment. Communication is only possible
along the specied channels. These can be unidirectional or bi-directional. By default, channels are free
of errors and preserve the order of the transmitted signals. These properties can be modied through
the denition of channel substructures.
A block is a static structure dened as a set of interrelated blocks or as a set of interrelated processes.
In SDL-2000, system, blocks, and processes were harmonized as agents. The state machines are addressed
by the name of the enclosing agent. Agents may include their own state machine, as well as other agents,
and shared data declarations. In SDL-2000 blocks can contain other blocks together with processes.
Yet, processes still cannot contain blocks or systems. The main difference between blocks and processes
is in the implicit concurrency type: blocks are true concurrent while agents inside processes have an
interleaved semantics.
Each process has its own signal queue and usually only contains a state machine. Differently, blocks
are usually used as a structuring diagram showing other blocks and other processes. The system is the
top-level block that interfaces with the environment.
Figure 5.15 shows an SDL block named controller. This block, together with the process car in
Figure 5.16, models the example system of one electric car controller, already presented in Figure 5.4.
The controller is dened as an SDL block. As it is the only block, it can be seen as the system (more
precisely, the system contains a single block named controller). The numbers on the top-right corner
specify the page number being shown and the total number of pages in the current level.
The block denes the signals inside a text element (a rectangle with a folded top-right corner). Note
that the two output signals Mand DIR carry data (an Integer parameter). These are used to assign values
to the outputs.
Besides the text element, the controller block contains a single process named car. This process is
connected to the environment through two input channels (passage and gb) and two output channels
[DIR]
[M]
signal A, B;
signal GO, BACK;
signal M(Integer);
signal DIR(Integer);
[GO, BACK]
[A,B]
Car
1(1)
Block controller
Passage
gb
Motor
Direction
FIGURE 5.15 A controller for one electric car.
S3 S1 S2
S0
M(0)
A
S3
DIR(left)
M(1)
BACK
S2
M(0)
B
S1
dcl right Integer :=0;
dcl left Integer :=1;
M(1)
GO
S0
1(1) Process car
DIR(right)
FIGURE 5.16 The process car.
(motor and direction). The signals handled by the channels are specied as signals lists (inside square
brackets). For example, the passage channel handles two signals: A and B.
Notice the graphical similarity between the process car notation and the controller external view in
terms of inputs and outputs, as in Figure 5.4(a). Figure 5.16 shows the car process. It starts with a start
symbol in the top-left corner and immediately proceeds to state S0. The process waits for the reception
of signal GO on its input queue and emits two output signals: one assigns 1 to M and the other assigns
right (dened as a value of zero) to the DIR output signal. It stops in state S1. Afterward, the process waits
for the B signal, assigns zero to the M output and arrives at state S2. It then proceeds accordingly to the
presented example until the initial state S0.
We now present a generalized version for N car processes. We consider N = 3.
Figure 5.17 shows a block controller able to model three cars, which synchronize on the GO and BACK
signals. Now, the process car has the notation car(3,3), which species that three instances are created
on startup and a maximum of three can exist. Additionally, we dene another process named Sync. This
process forces the synchronization of the three parallel car processes on the GO and BACK events.
A car process (see Figure 5.18) starts by registering itself in the Sync process by passing its process iden-
tier (PId) in the signal carPId through channel register. When each car process is ready to receive
the GOSync or the BACKSync signals from process Sync (which means that the car is ready at its
home or end position), it sends a GOok or a BACKok signal through channel ok. Afterwards, each
car process can receive the respective GOSync or BACKSync signals from process Sync through channel
gbSync.
The remaining signals are now indexed by the respective process identiers (a PId type parameter).
This allows a simple way to model an arbitrary number of car processes. In the presented example, the
initial, and also the maximal, number car process instances is three.
The Sync process (see Figure 5.19) starts by registering the N process car instances. It receives N carPId
signals carrying the pIds. Each of these signals is sent by a different car process. To this end, the process
denes an Array data structure named cars. It will act as a list: when all the N car processes have sent
their GOok signal, the Sync process sends a GOSync process to each of the pIds in the cars Array, after
reception of the GO signal. The same happens for the BACKok and BACKSync signals. Differently, from
[carPId]
[GOok,BACKok]
[GO,BACK]
[GOSync,BACKSync]
Sync
[DIR]
[M]
signal A(PId), B(PId);
signal carPId(PId);
signal GO, BACK;
signal GOok, BACKok;
signal GOSync, BACKSync;
signal M(PId, Integer);
signal DIR(PId, Integer);
[A,B]
car(3,3)
1(1) Block controller
Passage Motor
Direction
gbSync
gb
ok
Register
FIGURE 5.17 A controller for three cars synchronizing in the GO and BACK events.
the previous processes, this Sync process uses tasks: the rectangles with language instructions, for example,
c := c + 1. It also uses decision nodes. These are the lozenge with an associated condition (c = N) and
two possible outputs (true and false).
A process can also invoke procedures. A procedure is executed sequentially while the caller process waits
for it to terminate. When the procedure terminates, the process continues from the symbol following
the procedure invocation. This mimics what happens in a typical procedural programming language
(e.g., PASCAL). Accordingly, SDL procedures can also return a value and have parameters passed by value
and one parameter passed by reference.
Specication and Description Language is usually used in combination with other languages, namely
MSCs [34], ASN.1 (Abstract Syntax Notation One), and TTCN (Testing and Test Control Notation). The
ASN.1 is an international standard for describing the structure of data exchanged between communicating
systems (ITU-T X.680 to X.683 | ISO/IEC 8824-1 to 4). TTCN is a language used to write detailed test
specications. The latest version of the language, TTCN version 3 (TTCN-3), is standardized by ETSI
(ES 201 873 series) and the ITU-T (Z.140 series).
The use of the object model notation of SDL-2000 in combination with MSC, traditional SDL state
models and ASN.1 is a powerful combination that covers most systemengineering aspects.
These languages have been studied in the same group within the ITU:
ITU Z.105(11/99) and Z.107(11/99) standards dene the use of SDL with ASN.1.
ITU Z.109(11/99) standard denes a UML prole for SDL.
ITU Z.120(11/99) standard denes MSCs.
GOok
Initial
carPId(self)
BACKok
p=self
p=self
S3 S1 S2
M(self,0)
A(p)
S3
DIR(self,left)
M(self,1)
BACKSync
S2
M(self,0)
B(p)
S1
dcl right Integer :=0;
dcl left Integer :=1;
dcl p PId;
DIR(self,right)
M(self,1)
GOSync
1(1) Process car
FIGURE 5.18 The new process car.
5.4.7 Message Sequence Charts
Message Sequence Chart is a scenario language standardized by the ITU[34]. Although originally invented
in the telecommunications domain, nowadays it is widely used in embedded, real-time, and distributed
systems among others. It is not tailored to any particular application domain.
Before the approval at the ITUmeeting at Geneva in 1992, the MSC recommendation was used merely
as an informal language. After being approved as a standard, MSC has increased its usage in numerous
application areas. After the rst revision in 1996 the object-oriented community has shown an increasing
interest in MSC standard, namely for use case formalization [35].
Basically, we can say that MSC is a graphical and textual language, which are used for the description
and specication of interaction between system components. It is used in the early stages of the system
specication to represent systembehavior.
In the present chapter, we will focus on the MSC graphical representation. It is a simple, two-
dimensional diagram, giving an overview of the system behavior. The behavior is characterized through
message exchanged between the system entities. Usually, these entities are called instances or processes
and could be any part of the system: a subsystem, a component, a process, an object, or a user among
others. The entities are represented by vertical lines and the messages between them are oriented arrows
fromthe sender to the receiver.
S1
BACKSync to
cars(c)
c :=c+1
c=N
true
c :=0 c :=0 c :=0
S4
BACK
c=N
c :=c+1
GOSync to
cars(c)
S3
S2
GO
c :=0
S4 S2
S1
c :=0
c=N
cars(c) :=pid
carPId(pid)
c=N
c :=c+1
BACKok
S3
c=N
dcl N Integer :=3;
dcl c Integer :=0;
dcl pid PId;
dcl cars Array(Integer, PId);
c :=c+1
GOok
S1
S0
1(1) Process Sync
false
false
true
true
true
false
false
true
false
FIGURE 5.19 Sync process.
A
m_ab
m_c
m_ac
m_b
B C
FIGURE 5.20 An elementary MSC.
Figure 5.20 shows an example of an elementary MSC in graphical form. Here, we have three entities,
A, B, and C, represented by vertical lines (axes). The axis starts in the instance head symbol (a nonlled
rectangle) and ends in the instance end symbol (a lled rectangle). The exchanged messages between
the instances are represented by oriented labeled arrows (e.g., A sends the message m_ab to B).
Time has no explicit representation on MSCdiagrams, but we can assume that time ows on the vertical
axis, top-down direction. Yet, it is not possible to represent the exact time when the events or messages
arrives; it is only possible to establish a partial order between messages.
Within basic MSCmodels, we can also represent conditions and timers. Conditions are states that must
be reachedbefore theexecutioncancontinue. They are representedas aninformal text ina hexagonal box.
m_a
m_c
m_b
A B C
Condition
FIGURE 5.21 Using a synchronizing condition in an MSC model.
m_a
m_c
m_b2
m_b1
A B C
t 1
t 1
t 2
FIGURE 5.22 Using timers in an MSC model.
It is also possible to dene a shared condition for synchronization purposes. This means that two or more
processes must reach a specic state before continuing with further procedures (see Figure 5.21).
On the other hand, three special events can be associated with timers: timer set, timer reset and time-out.
In the graphical representation the timer set symbol has the form of an hour glass connected to the instance
axis. The time-out is a message arrow attached to the hour glass symbol and pointing to the instance axis.
The timer reset is represented by a cross symbol connected to the instance axis by a line (see Figure 5.22).
Message Sequence Chart also supports structured design. This means that beyond representation of
simple scenarios using basic MSCs, it is possible to represent more complex specications by means of
high-level message sequence charts (HMSCs). HMSCs are used for the hierarchical top-down specication
of scenarios as a directed graph. An HMSC model has a start node and an end node, and the vertices can
represent other lower-level HMSCs, basic MSCs (MSC reference) or conditions. All conditions at
HMSC level are considered as global. Each MSC reference refers to a specic MSC, which denes it.
Figure 5.23 presents a simple HMSC model with associated basic MSCs.
Apossible scenariodescriptionfor our electric car controller running example, already presentedbefore,
is the following:
Considering that the car is at its home position, the operator activates the car motor in order to move
it to the end of the line. Detection of car presence by the sensor_B means that the car reached the
Do
m
A B C A B C
A B C
Start node
False
Connection point
End node
when MSC true
True
when MSC
Reference
Initial condition
msc Reference msc False
m_ab
m_ac
msc True msc Do
when MSC
m_bat
m_cat
m_t
when MSC true
when MSC true
Check_A
Check_B
m_caf
m_baf
m_f
(a) (b)
when MSC
Initial condition when MSC
Initial condition
FIGURE 5.23 An HMSC model: (a) top-level model, (b) associated inner level MSCs.
Go
Back
Stop
Stop
Operator Sensor_A Motors Sensor_B
Car at the
end position
Car at the
start position
FIGURE 5.24 An MSC model for the electric car controller system.
end of the line, and the car must be stopped. After that, the operator can activate the motor in the
reverse way, in order to take the car back to home position. When the car presence is detected by the
sensor_A, the car must be stopped, because it is again on its home position.
This scenario can be represented graphically by the MSC of Figure 5.24. For the case when we want to
consider more than one car in the system, it is necessary to replicate instances and messages. In order to
activate each car, it is necessary to send a message to each of them. The basic MSC for two electric cars
controller system is presented in Figure 5.25. Alternatively, we could also get some benet from the HMSC
representation.
Operator Sensor_A1 Motor1 Sensor_B1 Sensor_A2 Motor2 Sensor_B2
Go car2
Car2 at the
end position
Car1 at the
end position
Car2 at the
start position
Car1 at the
start position
Stop2
Stop
Back car2 Back car1
Stop car2
Stop car1
Go car1
FIGURE 5.25 An MSC model for two electric cars controller system.
p
1
p
2
p
3
p
1
p
2
p
3
p
4
p
4
p
5
p
5
t
1
t
1
(a) (b)
FIGURE 5.26 Transition t
1
ring: (a) before and (b) after.
5.4.8 Petri Nets
Carl Adam Petri proposed Petri nets in 1962 [36]. Petri nets can be viewed as a generalization of a state
machine where several states can be active and transitions can start at a set of states and end in another
set of states. More formally, Petri nets are directed graphs with two distinct sets of nodes: places, drawn
as circles, are the passive entities; and transitions, drawn as bars, rectangles, or squares, are the active
entities. Places and transitions are connected by arcs in an alternate way: places can only be connected
to transitions and transitions can only be connected to places. Places can contain tokens, also called
marks, specied by small circles, inside places. Figure 5.26 shows two nets, each with one transition (t
1
),
ve places (p
1
to p
5
), and ve arcs. The nets have distinct markings: the net (b) corresponds to the net
(a) marking after the transition ring. A transition can only re if all its input places contain, at least, one
token. Fromthe transition point of view, one token is taken fromeach input place and one token is put in
each output place; these destructions and creations of tokens are an atomic operation, in the sense that it
cannot be decomposed. Notice that a transition can have only input arcs (sink transition) or only output
arcs (source transition).
More exactly, this Petri net model is a Place/Transition net [12,37]. They are the most well-known and
best-studied class of Petri nets. In fact, this class is sometimes just called Petri nets. Generalized Petri nets
simply add the possibility of weights in arcs with the expected change in the semantics: each arc must now
take or deposit the specied number of tokens.
GO
BACK
A[1]
A[2]
A[3]
B[1]
B[2]
B[3]
GO BACK
3 3 3
3
A[i ] (i : 1,..,3)
B[i ] (i : 1,..,3)
(a) (b)
FIGURE 5.27 Petri net models for our electric cars example. (From Lus Gomes and Joo Paulo Barros, Models
of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed.,
Section VI Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)
Coming back to our electric cars example, Figure 5.27(a) presents a Petri net model suitable for the
modeling of a systemwith three electric cars, maintaining the constraint of synchronized starts and returns
(in order to improve legibility, annotations associated with outputs are omitted, but can be associated
with different places or transitions, as well in state diagrams and Statecharts). It should be noted that
transitions have the dynamics of the model associated with them, while places hold the current states
of each car. Figure 5.27(b) uses a generalized Petri net model (arcs have weights associated with them),
enabling a more compact representation of the system. While thinking about introducing more cars into
our system(scalability problem), it is clear that, considering this last model, it is only necessary to change
the initial marking of the net: the left-most place should contain as many tokens as the number of electric
cars in the system.
As eventhis simple example shows, Petri nets donot give special emphasis tostates, as state diagrams, nor
to actions, as dataows: states and actions are bothrst-class citizens. In other words, Petri nets provide
a pure dual interpretation of the modeled world: each entity is either passive or active. As already stated,
passive entities (e.g., states, resources, buffers, channels, conditions) are modeled by places; active entities
(e.g., actions, events, signals, execution of procedures) are modeled as transitions. Another fundamental
characteristic of Petri nets is their locality of cause and effect. More concretely, this means transitions are
only re based on their vicinity, namely their input places and, possibly (for some classes of Petri nets),
in their output places.
Two other fundamental aspects of Petri nets are their concurrent and distributed execution, and the
instantaneous ring of transitions. The concurrent and distributed execution results from the locality
of cause and effect: all transitions with no input or output places in common, re independently. This
implies the model is intrinsically distributed, as all transitions with disjoint locality do not depend on
other transitions. The instantaneous ring means that each transition res instantaneously: there are no
intermediate states between token destruction in input places and token creation in output places.
Petri nets are well known for their inherent simplicity due to the small number of elementary concepts.
Yet, simplicity of Petri nets is a two-edge sword [38]: on one hand, it avoids the addition of further
complications in the model resulting from the language complexity; on the other hand, it invariably
implies that some desired feature is missing. As such, this basic type of Petri net has been modied in
numerous ways to the point that Petri net is really a general name for a large group of different denitions
and extensions, although all claim some close relation to the seminal work by Carl Petri [36]. A very
readable and informal discussion, about the existence of many different Petri nets classes can be found in
the paper by Desel and Juhs [39]. This evidence has its origins in the fact, easily conrmed by experience,
that conventional Petri nets, also known as low-level Petri nets, are sometimes difcult to use in practice
due not only to the problem of rapid model growth, but also to the lack of more specic constructions.
Namely, time modeling, interface with the environment, and structuring mechanisms are frequent causes
for extensions. While the former two are specic to some types of system, structuring mechanisms are
useful for even the simplest systemmodel.
It is important to note that, contrary to state machines, the number of places grows linearly with the
size of the system model as not all the states have to be represented explicitly but can result from the
combination of several already existent states (modeled by places). This implies that Petri net models
scale better than state machines (including Statecharts). Even so, model growth is still a signicant
problem for Petri net models due to their graphical nature and low-level concepts. To minimize this,
several abstraction and composition mechanisms exist. These can be classied into two groups:
1. Renement/abstraction mechanisms.
2. Composition mechanisms.
Renement/abstraction constructs correspond, roughly, to the programming languages concepts of
macros or procedure invocation [53]. Intuitively macro expansion is an internal composition where
we get nets inside nets through the use of nodes (places or transitions). The use of places and transitions
as encapsulated Petri nets goes back to the works of Carl Adam Petri [36] and was extensively presented
in Reference 15. They are usually named macros (or macronodes) [40]. These constructs support hierar-
chical specication of systems, in the traditional sense of module decomposition/aggregation and divide
to conquer attitude. Figure 5.28 presents a two-level model of a typical producerconsumer system.
At the top-level both producer and consumer are modeled by transitions (associated with the dynamics
of the system), while storage is associated with a place (static part of the system). It has to be noted that
the executable model is obtained through the at model, the one resulting from the merging of the
different subsystems, as also presented in Figure 5.28. Several proposals have even added object-oriented
features to Petri nets (see Reference 41 for up-to-date detailed survey of some approaches).
On the other hand, a common compositional mechanism is the folding of nets found in high-level
nets. High-level nets can be seen as an internal composition made possible by structural symmetries
inside a low-level net model. This is achieved by the use of high-level tokens. These tokens are no
longer undistinguishable, but have data structures attached to them. Figure 5.29 shows the high-level
Petri net model associated with our three electric cars system. Tokens presented at the left-most place
(initial marking) now contain integer values, which enable the identication of a specic car status inside
the model.
p
1p
p
2p
p
2s
p
2c
t
2c
t
1c
p
1c
p
1s
t
1p
t
2p
Producer Consumer Storage
FIGURE 5.28 Hierarchical decomposition of a producerconsumer Petri net model.
GO BACK
1i 2+1j 2+1k2
1i 2+1j 2+1k2
1i 2+1j 2+1k2
1i 2+1j 2+1k2
112
122
132
1i 2 1i 2
1i 2 1i 2
A[i ]
B[i ]
FIGURE 5.29 High-level Petri net models for our electric cars example. (From Lus Gomes and Joo Paulo Barros,
Models of Computationfor EmbeddedSystems. InThe Industrial InformationTechnology Handbook, Richard Zurawski
Ed., Section VI Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With
permission.)
X Y Z
t
t
X Y Z
t
t
t
(a)
(b)
X Y Z
t
t
X Y Z
t +
(c)
(d)
t +
FIGURE 5.30 Handling of simultaneous events in a discrete event system. (FromLus Gomes and Joo Paulo Barros,
permission.)
As tokens can carry data, it is necessary to add some inscriptions to the arcs, in order to properly select
the right token to involve in a specic transition ring. Taking transition GOof Figure 5.29 as an example,
we will need three tokens to enable the transition to re, obtained through the adequate binding of the
variables i, j, and k with data fromtokens presented at the incoming place. On transition A[i], inscription
at the incoming and outgoing arcs ([i]) simply means that no transformation is performed in the token
attributes. For some classes of high-level nets, it is also possible to add within those arc inscriptions
references to functions that support token attributes transformation. This is the case for Colored Petri
nets [42]. In this way, data processing capabilities can also be embedded in the model.
5.4.9 Discrete Event
In a discrete event MoC, events are sorted by a time stamp stating when they occurred and analyzed in
chronological order. A discrete event simulator usually maintains a global event queue. The operational
semantics of a discrete event system consists of the sequential ring of the processes accordingly to the
chronological time stamps of queued events. Each specic queued event will be removed fromthe queue
and the processes, that have it as an input, will be red. New events will be produced due to the previous
ring and should be added to the event queue, according to its time tag.
Like most digital hardware simulation engines, hardware description languages, such as VHDL and
Verilog, use a discrete-event-based simulation approach. Simultaneous events, especially those originated
by zero-time delay, can bring some problems, due to the ambiguity resulting from equal time tags (they
can not be ordered). Figure 5.30 illustrates one simple situation.
Considering that X, Y, and Z represent some processing modules/processes/components, and that
module Y has zero-time delay, it will produce output events with the same time stamp as input events.
If module X produces an event with time stamp t (Figure 5.30[a]), both modules Y and Z will receive that
event; there is ambiguity about whether Y or Z should be executed rst.
If Z is executed rst, the event presented at its input will be consumed; afterwards Y will be executed
and will produce an event that will be consumed in turn by Z.
On the other hand, if Y is executed rst, an event will be produced with the same time stamp, because
module Y has zero-time delay (Figure 5.30[b]). At this point, execution of Z can be accomplished in two
ways: (1) taking both events at the same invocation step, or (2) taking one of the events rst and the
other afterwards (it is also not clear which should be processed rst). In order to enable a coherent system
simulation (producing the same result even if Y or Z are executed rst), the concept of delta delay is
introduced. Events with the same time stamp are ordered accordingly to the delta value. Now, the event
produced by module Y will have a time stamp of t +(Figure 5.30[c]), resulting in a execution of Z rst
consuming the event originated by Xwith time stamp t , and afterwards (in the next delta step) consuming
the event originated by Y, with time stamp t + (Figure 5.30[d]). If a different zero-time delay process
receives the event with time stamp t + , it will generate a further event with time stamp t + 2. All
events with the same time tag delta tag will be consumed at the same time.
This means a two-level model of time was introduced: on top of the totally ordered events, each
time instant can be further decomposed into (totally ordered) delta steps applicable to generated events.
However, from the simulated time point of view reported to the user, no delta information is included.
In this sense, delta steps do not relate to time as perceived by the observer. They are simply a mechanism
to compute the behavior of the system at a point in time. Similar mechanisms can also be used for
synchronous/reactive models. These are analyzed in the next section.
5.4.10 Synchronous/Reactive Models
Reactive systems are those that interact continuously with their environment. David Harel andAmir Pnueli
gave the name in the early 1980s. In the synchronous MoC, all events are synchronous: signals have events
with equal tags to other events in other signals. The tags are totally ordered. In this computational model,
the output responses are produced simultaneously with the input stimuli. Conceptually, synchrony means
that reactions take no time, so, there is no observable time delay in the reaction (outputs become available
as inputs become available). Unlike the discrete event model, all signals have events at all clock ticks,
simplifying the simulator, as far as no sorting is needed. Simulators that exploit this characterization are
called cycle based (in opposition to the so-called discrete-event simulators). A cycle is the processing of all
the events at a given clock tick. Cycle based simulation is excellent for clocked synchronous circuits. Some
applications use a multirate cycle-based model, where every nth event in one signal aligns with the events
in another signal.
Synchronous languages [43,44], such as Esterel [45], Lustre [46], and Signal [47], use the
synchronous/reactive MoC. Statecharts graphical formalism belongs to the same language family and
is sometimes referred as quasi-synchronous. Statecharts will be presented in a separated section. Esterel
is a language for describing a collection of interacting synchronous FSMs supporting the description of
concurrent and sequential statements:
S1;S2 represents the sequential execution of S1 followed by S2.
S1||S2 represents the parallel execution of S1 and S2.
[S1||S2] executes until S1 and S2 terminates.
The synchronized modules communicate through signals, using an instantaneous broadcast mechanism.
As an introductory example, we consider an FSMthat implements the following behavior [48]: Emit
an output O as soon as two inputs A and B have occurred. Reset this behavior each time the input R
occurs.
not (A)
and B
A and
not (B)
A and
B/O
A/O B/O
R
R
R
FIGURE 5.31 A simple Mealy state machine.
Figure 5.31 presents the associated state diagram. The respective Esterel representation is presented
below:
module SMSM:
input A, B, R;
output O;
loop
[ await A || await B ];
emit O
each R
end module
Waiting for the occurrences of A and B is accomplished through [await A || await B]. The emit O
statement emits the signal O and terminates at the time it starts. So, O is emitted exactly at the time
where the last of A and B occurs. The handling of the reset condition is accomplished by the looploop p
each R. When the loop starts, its body p runs freely until the occurrence of R. At that time, the body
is aborted and it is immediately restarted. If the body terminates before the occurrence of R, one simply
waits for R to restart the body.
It has tobe notedthat Esterel code grows linearly withthe size of the specication(while FSMcomplexity
grows exponentially). For instance, if we want to wait for the occurrence of three events, A, B, and C, the
code change is minimal: [await A || await B || await C].
Esterel is a control ow oriented language. On the other hand, Lustre is a language that adopts a data-
oriented avor. It follows a synchronous dataow approach and supports multirate clocking mechanism.
In addition to the ability to translate these languages into nite state descriptions, it is possible to
compile these languages directly into hardware (execution in software of these languages is made possible
through the simulation of the generated circuit).
5.4.11 Dataow Models
Dataow models are graphs where nodes represent operations (also called actors) and arcs represent
datapaths (also called channels). These datapaths contain totally ordered sequences of data (also called
tokens). A dataow is a distributed modeling formalism, in the sense that there is no single locus of control.
Yet, the conventional dataow model is not suitable for representing the control part of the system (only
what is implied by the graph structure). Regarding time handling, dataow systems are asynchronous
concurrent.
Dataow graphs have been extensively used in modeling data-dominated systems, namely digital
signal-processing applications, and applications dealing with streams of periodic/regular data samples.
Computationally intensive systems, carrying complex data transformation, can be conveniently repre-
sented by a directed graph where the nodes represent computation (to be coded in a suitable programming
language) and the arcs represent the order in which the computations are performed. This is the case for
signal-processing algorithms, for example, encode/decode, ltering, convolution, compression, etc., which
are expressed as block diagrams and coded in a specic programming language, such as C.
Several dataow graphs have been referred in the literature, each one with its specic semantics,
namely:
Kahn Process Network [49].
Dataow Process Networks [50].
Synchronous dataow graphs [51].
In Kahn Process Networks, and also in Dataow Process Networks (which constitute a special case of Kahn
Process Network), processes communicate by passing data tokens through unidirectional unbounded
FIFO channels. A process can re according to a set of ring rules. During each atomic ring, the actor
consumes tokens from input channels, executes some behavior using a selected programming language,
and produces tokens that are put on output channels. Writing to the channel is nonblocking, although
reading from the channel blocks the process until there is sufcient data in it. Kahn Process Networks
have guaranteed deterministic behavior: for a certain sequence of inputs, there is only one possible
sequence of outputs. This is independent from the computation and communication durations, and also
from the actors ring order. The dataow process analysis can only be based on graph inspection. The
possibility of choosing the actors ring order allows very efcient implementation. Hence, it is frequent for
the signal-processing applications to have as a goal the scheduling of the actor rings, at compile time. This
results in an interleaved implementation for the concurrent model, represented by a list of processes to be
executed, allowing its implementation in single-processor architecture. Accordingly, for multiprocessors
architectures, a per processor list is obtained. Kahn Process Networks cannot be scheduled statically,
as their ring rules do not allow us to build, at compile time, a ring sequence such that the system
does not block under any circumstances. For this modeling formalism, one must use dynamic scheduling
(with associated implementation overhead). Another disadvantage of Kahn Process Networks is associated
with the unbounded FIFO buffers, and the potential growth of memory needs. Therefore, Kahn Process
Networks cannot be efciently implemented without considering some limitations.
Synchronous dataow graphs are a kind of Kahn Process Networks with additional restrictions,
namely:
At each ring, a process consumes and produces a xed amount of tokens on its incoming and
outgoing channels, respectively.
For a process to re, it must have at least as many tokens on its input channels as it has to consume.
Arcs are marked with the number of tokens produced or consumed.
Figure 5.32 presents the basic ring rule. Figure 5.33 shows a simple dataow model and an associated
static scheduling list. This scheduling is a list of rings that can be repeated indenitely. One cycle
through the schedule should return the graph to its original state (by state, we mean the number of tokens
in every arc).
Fired
1 1
1
1
1
1
FIGURE 5.32 Basic ring rule.
1
2
1
4
2
2
A
B
D
C
2
2
2 1
A A B A A B C D D
FIGURE 5.33 Simple dataow model and associated static schedule list. (FromLus Gomes and Joo Paulo Barros,
permission.)
This static schedule can be determined through the analysis of the graph in order to nd the paths of
ring sequences that satisfy, on each arc, the same amount of tokens to be produced and consumed. One
balance equation is built per arc, considering that
outgoing_weight

origin_node = incoming_weight

destination node
For the referred example, the following balance equations can be obtained (one per arc):
2a 4b = 0
b 2c = 0
2c d = 0
2b 2d = 0
2d a = 0
The root of the set of presented equations is (a = 4, b = 2, c = 1, d = 2), which gives information
about the ring vector associated with the static scheduling list. Yet, in other cases, it is possible that the
mentioned balance equations do not have a root, implying that no static schedule is possible.
True
SWITCH
T F T F T F T F
T F T F T F T F
Fired
SWITCH
False
False
SWITCH
Fired
Fired Fired
SWITCH
True
SELECT SELECT SELECT SELECT
(c) (d)
(a) (b)
FIGURE 5.34 Dynamic dataow actors and dataow rings. (From Lus Gomes and Joo Paulo Barros, Models
of Computation for Embedded Systems. In The Industrial Information Technology Handbook, Richard Zurawski Ed.,
Section VI Real Time and Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005. With permission.)
Many possible variants of dataow models have been explored. Especially interesting, among them, is
an extension to synchronous dataownetworks allowing dynamic dataow actors [52]; dynamic actors are
actors with at least one conditional input or output port. The canonical dynamic actors are switch and
selector,enabling the test of tokens onone specic incoming arc, andthe consumptionandproductionof
tokens in a data dependent way. ASwitchprocess, for example, has one regular incoming arc, one control
incoming arc, and two outgoing arcs; whenever tokens are presented inthe incoming arcs, the value carried
by the token presented in the control arc will dynamically select the active outgoing arc: the process ring
will produce one single token at the selected outgoing arc. Figure 5.34 illustrates dynamic dataow actor
behavior before and after ring. It has to be noted that control ports are never conditional ports. More
complex dynamic actors can be used, namely dynamic actors with integer control (instead of Boolean
control), originating CASE and ENDCASE actors (generalizations of the SWITCH and SELECT actors,
respectively).
Aggregation of the referred actors using specic patterns is commonly applied. For instance, integer-
controlled dataow graphs can result fromcascading a CASE actor, a set of mutually exclusive processing
nodes, and an ENDCASE actor. Control ports of the CASE and ENDCASE actors receive the same value;
only one of the processing nodes is executed at a time depending on the value presented at the control port.
Most simulation environments targeted for signal processing, such as Matlab-Simulink or Khoros
(for image processing), use dataow models.
5.5 Conclusions
As system complexity increases, the modeling formalism used for systems specication becomes more
important. It is through the usage of a formal MoCthat the designer can simulate and analyze the behavior
of the system, and anticipate, impose, or avoid several proprieties and features. This chapter presented
a set of selected modeling formalisms, also called MoCs, addressing embedded system design. Due to
the different types of embedded systems applications, some of the formalisms underscore the reactive
(control) part of system, while others emphasize the data processing modeling capability. The nal goal
and ultimate challenge to the designer is to choose the right formalismfor the systemin hands, allowing
the design of more robust and reliable systems, obtained in less development time, and allowing an easier
maintainability and lower cost.
References
[1] L. Lavagno, A. Sangiovanni-Vincentelli, and E. Sentovich, Models of Computation for Embedded
SystemDesign. NATO ASI Proceedings on System Synthesis, Il Ciocco (Italy), 1998.
[2] M. Sgroi, L. Lavagno, and A. Sangiovanni-Vincentelli, Formal Models for Embedded System
Design. IEEE Design and Test of Computers, 17: 1417, 2000.
[3] Stephen Edwards, Luciano Lavagno, Edward A. Lee, and A. Sangiovanni-Vincentelli, Design of
EmbeddedSystems: Formal Models, Validation, andSynthesis. Proceedings of the IEEE, 85: 366390,
1997.
[4] Frank Vahid and Tony Givargis, Embedded System Design A Unied Hardware/Software
Introduction. John Wiley & Sons, Inc., NewYork, 2002.
[5] Lus Gomes and Joo Paulo Barros, Models of Computation for Embedded Systems. In The
Industrial Information Technology Handbook, Richard Zurawski, Ed., Section VI Real Time and
Embedded Systems, chapter 83, CRC Press, Boca Raton, FL, 2005.
[6] Edsger W. Dijkstra, On the Economy of Doing Mathematics. In Mathematics of Program
Construction. Second International Conference, Oxford, UK. Lecture Notes in Computer Science,
Vol. 669. R.S. Bird, C.C. Morgan, and J.C.P. Woodcock, Eds., Springer-Verlag, Heidelberg, 1993,
pp. 210.
[7] David Harel, On Visual Formalisms. Communications of the ACM, 31: 514530, 1988.
[8] D. Harel and B. Rumpe, Meaningful Modeling: Whats the Semantics of Semantics?, Computer
37:10. IEEE Press, October 2004, pp. 6472.
[9] J. Buck, S. Ha, E.A. Lee, and D.G. Messerschmitt, Ptolemy: A Framework for Simulating and
Prototyping Heterogeneous Systems. International Journal of Computer Simulation, Special Issue
on Simulation Software Development, 4: 155182, 1994.
[10] Edward A. Lee and A. Sangiovanni-Vincentelli, Comparing Models of Computation. Proceedings
of ICCAD, 234241, 1996.
[11] L. Lamport, Time, Clocks, and the Ordering of Events in a Distributed System. Communications
of the ACM, 21: 558565, 1978.
[12] Wolfgang Reisig, Petri Nets: An Introduction. Springer-Verlag, NewYork, 1985.
[13] E.F. Moore, Gedanken-Experiments on Sequential Machines, Automata Studies. Princeton
University Press, Princeton, NJ, 1956.
[14] G.H.A. Mealy, Method for Synthesizing Sequential Circuits. Bell System Technical Journal,
34: 10451079, 1955.
[15] Manuel Silva, Las Redes de Petri: en la Automtica y la Informtica. Editorial AC, Madrid,
1985.
[16] Daniel Gajski, Nikil Dutt, Allen Wu, and Steve Lin, High-Level Synthesis Introduction to Chip
and System Design. Kluwer Academic Publishers, Dordrecht, 1992.
[17] David Harel, Statecharts: A Visual Formalism for Complex Systems. Science of Computer
Programming, 8: 231274, 1987.
[18] Bruce Powel Douglass, Doing Hard Time Developing Real-Time Systems with UML, Objects,
Frameworks, and Patterns. Object Technology Series, Addison-Wesley, Reading, MA, 1999.
[19] Bruce Powel Douglass, Real-Time UML Developing Efcient Objects for Embedded Systems.
Object Technology Series, Addison-Wesley, Reading, MA, 1998.
[20] Grady Booch, James Rumbaugh, and Ivar Jacobson, The Unied Modeling Language User Guide.
Object Technology Series, Addison-Wesley, Reading, MA, 1999.
[21] David Harel and Michal Politi, Modeling Reactive Systems with Statecharts The STATEMATE
Approach. McGraw-Hill, New York, 1998.
[22] David Harel and Amnon Naamad, The Statemate Semantics of Statecharts. ACM Transactions on
Software Engineering and Methodology (TOSEM), 5: 293333, 1996.
[23] Michael von der Beeck, A Comparison of Statecharts Variants. In Formal Techniques in Real-
Time and Fault-Tolerant Systems. Lecture Notes in Computer Science, Vol. 863, Hans Langmaack,
Willem P. de Roever, and Jan Vytopil, Eds., Springer-Verlag, Heidelberg, 1994, pp. 128148.
[24] F. Vahid, S. Narayan, and D.D. Gajski, Speccharts: A VHDL Frontend for Embedded Systems. IEEE
Transactions on Computer-Aided Design, 14: 694706, 1995.
[25] http://www.specc.org, 2004.
[26] M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, and A. Sangiovanni-Vincentelli,
Hardware/Software Codesign of Embedded Systems. IEEE Micro, 14: 2636, 1994.
[27] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone,
A. Sangiovanni-Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara, HardwareSoftware Co-Design
of Embedded Systems The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997.
[28] R.K. Brayton, M. Chiodo, R. Hojati, T. Kam, K. Kodandapani, R.P. Kurshan, S. Malik,
A. Sangiovanni-Vincentelli, E.M. Sentovich, T. Shiple, and H.Y. Wang, BLIF-MV: An Interchange
Format for Design Verication and Synthesis. Technical report UCB/ERL M91/97, U.C. Berkeley,
November 1991.
[29] ITU-T, Z.100 Specication and Description Language (SDL). ITU, 2002.
[30] Laurent Doldi, Validation of Communications Systems with SDL: The Art of SDL Simulation and
Reachability Analysis. John Wiley & Sons, New York, 2003.
[31] OMG, UML Resource Page, http://www.uml.org/, 2004.
[32] Rick Reed, Notes on SDL-2000 for the New Millennium. Computer Networks, 35: 709720, 2001.
[33] SDL Forum Society, http://www.sdl-forum.org/, 2004.
[34] ITU-T, Z.120 Message Sequence Chart (MSC). ITU, 1999.
[35] E. Rudolph, J. Grabowski, and P. Graubmann, Tutorial on Message Sequence Charts (MSC96).
In Tutorials of the First Joint International Conference on Formal Description Techniques for Dis-
tributed Systems and Communication Protocols, and Protocol Specication, Testing, and Verication
(FORTE/PSTV96). FORTE96. Kaiserslautern, Germany, October 1996.
[36] Carl Adam Petri, Kommunikation mit Automaten. Ph.D. thesis, University of Bonn, Bonn,
West Germany, 1962.
[37] Jrg Desel and Wolfgang Reisig, Place/Transition Petri Nets. In Lectures on Petri Nets I: Basic
Models. Lecture Notes in Computer Science, Vol. 1491. Springer-Verlag, 1998, pp. 122173.
[38] Eike Best, DesignMethods BasedonNets: Esprit Basic ResearchAction{DEMON}. InLecture Notes
in Computer Science; Advances in Petri Nets 1989, Vol. 424, G. Rozenberg, Ed., Springer-Verlag,
Berlin, Germany, 1990, pp. 487506.
[39] J. Desel and G. Juhas, What is a Petri Net? Informal Answers for the Informed Reader. In Unifying
Petri Nets, Lecture Notes in Computer Science, Vol. 2128, H. Ehrig, G. Juhas, J. Padberg, and
G. Rozenberg, Eds., Springer-Verlag, Heidelberg, 2001, pp. 122173.
[40] Lus Gomes and Joo-Paulo Barros, On Structuring Mechanisms for Petri Nets Based System
Design. IEEE Conference on Emerging Technologies and Factory Automation Proceedings (ETFA
2003) Lisbon, Portugal, September 1619, 2003.
[41] Gul Agha, Fiorella de Cindio, and Grzegorz Rozenberg, Eds., Concurrent Object-Oriented
Programming and Petri Nets, Advances in Petri Nets. Lecture Notes in Computer Science, Vol. 2001.
Springer-Verlag, Heidelberg, 2001.
[42] Kurt Jensen, Coloured Petri Nets. Basic Concepts, Analysis Methods and Practical Use, Vol. 13.
Monographs in Theoretical Computer Science. An EATCS Series, Springer-Verlag, Berlin, Germany,
19921997.
[43] N. Halbwachs, Synchronous Programming for Reactive Systems. Kluwer Academic Publishers,
Dordrecht, 1993.
[44] A. Benveniste and G. Berry, The Synchronous Approach to Reactive and Real-Time Systems.
Proceedings of the IEEE, 79: 12701282, 1991.
[45] F. Boussinot and R. De Simone, The ESTEREL Language. Proceedings of the IEEE, 79: 12931304,
1991.
[46] N. Halbwachs, P. Caspi, P. Raymond, and D. Pilaud, The Synchronous Data Flow Programming
Language LUSTRE. Proceedings of the IEEE, 79: 13051319, 1991.
[47] A. Benveniste and P. Le Guernic, Hybrid Dynamical Systems Theory and the SIGNAL Language.
IEEE Transactions on Automatic Control, 35: 525546, 1990.
[48] G. Berry, The Esterel v5 Language Primer, 2000. Available from http://www-sop.inria.fr/esterel.org.
[49] G. Kahn, The Semantic of a Simple Language for Parallel Programming. In Proceedings of the IFIP
Congress 74. North-Holland Publishing Co., Amsterdam, 1974.
[50] E.A. Lee and T.M. Parks, Dataow Process Networks. Proceedings of the IEEE, 83: 773801, 1995.
[51] E.A. Lee and D.G. Messerschmitt, Synchronous Data Flow. Proceedings of the IEEE, 75: 12351245,
1987.
[52] E.A. Lee, Consistency in Dataow Graphs. IEEE Transactions on Parallel and Distributed Systems,
2: 223235, 1991.
[53] P. Huber, K. Jensen, and R.M. Shapiro, Hierarchies in Coloured Petri Nets. In Proceedings of the
10th International Conference on Applications and Theory of Petri Nets. Bonn, Springer-Verlag,
London, 1991, pp. 313341.
6
System Validation
J.V. Kapitonova,
A.A. Letichevsky, and
V.A. Volkov
National Academy of Science of
Ukraine
Thomas Weigert
Motorola
6.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1
6.2 Mathematical Models of Embedded Systems . . . . . . . . . . 6-2
Transition Systems Agents Environments Classical
Theories of Concurrency
6.3 Requirements Capture and Validation . . . . . . . . . . . . . . . . . 6-19
Approaches to Requirements Validation Tools for
Requirements Validation
6.4 Specifying and Verifying Embedded Systems . . . . . . . . . . 6-29
System Descriptions and Initial Requirements Static
Requirements Dynamic Requirements Example: Railroad
Crossing Problem Requirement Specications Reasoning
about Embedded Systems Consistency and Completeness
6.5 Examples and Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-40
Example: Embedded Operating System Experimental
Results in Various Domains
6.6 Conclusions and Perspectives . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-50
6.1 Introduction
Toward the end of the 1960s system designers and software engineers faced what was then termed as
software crisis. This crisis was the direct outcome of the introduction of a new generation of computer
hardware. The new machines were substantially more powerful than the hardware available until then,
making large applications and software systems feasible. The strategies and skills employed in building
software for the new systems did not match the new capabilities provided by the enhanced hardware. The
results were delayed projects, sometimes for years, considerable cost overruns, and unreliable applica-
tions with poor performance. The need arose for new techniques and methodologies to implement large
software systems. The now classic waterfall software life-cycle model was then introduced to meet these
needs.
The 1960s have long gone by, but the software crisis still remains. In fact the situation has worsened
the implementation disasters of the 1960s are being succeeded by design disasters. Software systems have
reached levels of complexity at which it is extremely difcult to arrive at complete, or even consistent,
specications and it is nearly impossible to know all the implications of ones requirement decisions.
Further, the availability of hardware and software components may change during the course of the
development of a system, forcing a change in the requirements. The customer may be unsure of the
requirements altogether. The situation is even worse for embedded systems: the real time and distributed
aspects of such systems impose additional design difculties and introduces the possibility of concurrency
6-1
pathologies suchas deadlock or lifelock, resulting fromunforeseeninteractions of independently executing
systemcomponents. Anumber of newmethodologies, suchas rapidprototyping, executable specications,
and transformational implementation have been introduced to address these problems in order to arrive at
shorter cycle time and increased quality of the developed systems. Although each of these methodologies
addresses different concerns they share the underlying assumption that verication and validation be
performed as close to the customer requirements as possible.
While verication tries to ensure that the system is built right, that is, without defects, validation
attempts to ensure that the right system is developed, that is, a system that matches what the customer
actually wants. The customer needs is captured in the system requirements. Many studies have demon-
strated that errors in system requirements are the most expensive as they are typically discovered late,
when one rst interacts with the system; in the worst case such errors can force complete redevelopment
of the system.
In this chapter, we examine techniques aimed at discovering the unforeseen consequences of require-
ments as well as omissions in requirements. Requirements should be consistent and complete. Roughly
speaking, consistency means the existence of an implementation that meets the requirements; complete-
ness means that the implementation (its function or behavior) is dened uniquely by the requirements.
Validation of a system is to establish that the system requirements are consistent and complete.
Embedded systems [13] consist of several components that are designed to interact with one another
and with their environment. In contrast to functional systems, which are specied as functions frominput
to output values, an embedded system is dened by its properties. A property is a set of desired behaviors
that the system should possess.
In Section 6.2, we present a mathematical model of embedded systems. Labeled transition systems,
representing the environment and agents inserted into this environment by means of a continuous inser-
tion function are used for representing system requirements at all levels of details. Section 6.3 presents a
survey of freely available systems that could be used to validate embedded systems as well as references
to commercially available systems. In Section 6.4, we present a notation to describe the requirements of
embedded systems. Two kinds of requirements are distinguished: static requirements dene the properties
of system and environment states and the insertion function of the environment; dynamic requirements
dene the properties of histories and behavior of system and environment. Hoare-style triples are used to
formulate static requirements; logical formulae with temporal constraints are used to formulate dynamic
requirements. To dene transition systems with a complex structure of states we rely on attributed trans-
ition systems which allow to split the denition of a transition relation into a denition of transitions
on a set of attributes, and formulate general transition rules for the entire environment states. We also
present a tool for reasoning about the embedded systems and discuss more formally the consistency and
completeness condition for a set of requirements. Our approach does not require the developers to build
software prototypes, which are traditionally used for checking consistency of a systemunder development.
Instead, one develops formal specications and uses proofs to determine consistency of the specication.
Finally, Section 6.5 presents the specication of a simple scheduler as a case study and reports the results
on applying these techniques to the systems in various application domains.
We have observed the following time distribution in the software development cycle: 40% of the cycle
time is spent on requirements capture, 20% on coding, and 40% on testing. Requirements capture,
includes not only the development of requirements but also their corrections and renement during the
entire development cycle. According to Brooks [4], 15%of the development efforts are spent on validation,
that is, ensuring that the systemrequirements are correct. Therefore, improving validation has a signicant
impact on development time, even for successful requirement specications. For failed requirements that
forced major system redevelopment, the impact is obviously much higher.
6.2 Mathematical Models of Embedded Systems
In the embedded domain, the main properties of the systems concern the interaction of components with
each other and with the environment. The primary mathematical notion to formally represent interacting
System Validation 6-3
and communicating systems is that of a labeled transition system. When formulating requirements or
developing high-level specications we are not interested in the internal structure of the states of a
system and consider these states, and therefore also the systems, as identical up to some equivalence. The
abstraction afforded by this equivalence leads to the general notion of an agent and its behavior. Agents
exist in some environment and an explicit denition of the interaction of agents and environments in
terms of a function that embeds the agent in this environment helps to specialize the mathematical models
to particular characteristics of the subject domain.
6.2.1 Transition Systems
The most general abstract model of software and hardware systems, which evolve in time and change
states in a discrete way, is that of a discrete dynamic system. It is dened as a set of states and a set of
histories, describing the evolution of a system in time (either discrete or continuous). As a special case,
a labeled transition system over the set of actions A is a set S of states together with the transition relation
T S A S. If (s, a, s
) T, we usually write this as s

a
s
and say that a system S moves from the

state s to state s
while performing the action a. (Sometimes the termevent is used instead of action.)
An automaton is a more special case, where the set of actions is the set of input/output values. Continuity
of time, if necessary, can be introduced by supplying actions with a duration, that is, by considering
complex actions (a, t ), where a is a discrete component of an action (its content) and t is a real number
representing the duration of a. In timed automata, duration is dened nondeterministically and intervals
for possible durations are used instead of specic moments in time.
Transition systems separate the observable part of a system, which is represented by actions, from
the hidden part, which is represented by states. Actions performed by a system are observable by an
external observer and other systems, which can communicate with the given system, synchronizing their
actions, and combining their behaviors. The internal states of a system are not observable; they are
hidden. Therefore, the representation of states can be ignored when considering the external behavior of
a system.
The activity of a system can be described by its history which is a sequence of transitions, beginning
from an initial state:
s
0
a
1
s
1
a
2
s
n
a
n1
s
n+1

A history can be nite or innite. Each history has an observable part (a sequence of actions
a
1
, a
2
, . . . , a
n
, . . .) and a hidden part (a sequence of states). The former is called a trace generated by
the initial state s
0
(in Reference 5, the term behavior is used instead of trace). Two states are called to be
trace-equivalent if the set of all traces generated by these states coincide.
A nal history cannot be continued: it is innite or for the last state s
n
in the sequence, there are
no transitions s
n
a
n
s
n+1
from this state; such a state is called a nal state. We distinguish a nal state
representing successful termination fromdeadlock states (states where one part of a systemis waiting for an
event caused by another part and the latter is waiting for an event caused by the former) and divergent or
undened states. Such states can be dened later or constitute livelocks (states that contain hidden innite
loops or innite recursive unfolding without observable actions).
Transition systems can be nondeterministic in which a system can move from a given state s into
different states performing the same action a. A labeled transition system (without hidden transitions) is
deterministic if for arbitrary transitions s
a
s
and s
a
s
, it follows that s
= s
, and that there are no

states representing both successful termination and divergence.
To dene transition systems with a complex structure of states we rely on attributed transition systems.
If e is a state of an environment and f is an attribute of this environment, then the value of this attribute
will be denoted as e f . We will represent a state of an environment with attributes f
1
, . . . , f
n
as an
object with public (observable) attributes f
1
: t
1
, . . . , f
n
: t
n
, where t
1
, . . . , t
n
are types, and some hidden
private part.
6.2.2 Agents
Agents are objects that can be recognized as separate from the rest of the world, that is, other agents or
the environment. They change their internal state and can interact with other agents and the environ-
ment, performing observable actions. The notion of an agent formalizes such diverse objects as software
components, programs, users, clients, servers, active components of distributed systems, and so on.
Inmathematical terms, agents are labeledtransitionsystems withstates considereduptobisimilarity. We
are not interested in the structure of the internal states of an agent but only in its observable behavior. The
notion of an agent as a transition system considered up to some equivalence has been studied extensively
in concurrency theory; van Glabbeek [6] presents a survey of the different equivalence relations that have
been proposed to describe concurrent systems. These theories use an algebraic representation of agent
states and develop a corresponding algebra so that equivalent expressions dene equivalent states. The
transition relation is dened on the set of algebraic expressions by means of rewriting rules and recursive
denitions.
Some representations avoid the notion of a state, and instead, if for some agent E a transition for action
a is dened, it is said that the agent performs the action a and thus becomes another agent E
.
6.2.2.1 Behaviors
Agents with the same behavior (i.e., agents which cannot be distinguished by observing their interaction
with other agents and environments) are considered equivalent. We characterize the equivalence of agents
in terms of the complete continuous algebra of behaviors F(A). This algebra has two sorts of elements
behaviors u F(A), represented as nite or innite trees, and actions a A, and two operations
prexing and nondeterministic choice. If a is an action and u is a behavior, prexing results in a new
behavior denoted as a u. Nondeterministic choice is an associative, commutative, and idempotent binary
operation over behaviors denoted as u +v, where u, v F(A). The neutral element of nondeterministic
choice is the deadlock element (impossible behavior) 0. The empty behavior performs no actions and
denotes the successful termination of an agent. The generating relations for the algebra of behaviors are
as follows:
u +v = v +u
(u +v) +w = u +(v +w)
u +u = u
u +0 = u
u = 0
where is the impossible action.
Both operations are continuous functions on the set of all behaviors over A. The approximation relation
is a partial order with minimal element . Both prexing and nondeterministic choice are monotonic
with respect to this approximation:
u
u v u +w v +w
u v a u a v
The algebra F(A) is constructed so that prexing and nondeterministic choice are also continuous with
respect tothe approximationandit is closedrelative tothe limits (least upper bounds) of the directedsets of
nite behaviors. Thus, we canuse the xedpoint theoremtogive a recursive denitionof behaviors starting
from the given behaviors. Finite elements are generated by three termination constants: (successful
termination), (the minimal element of the approximation relation), and 0 (deadlock).
F (A ) can be considered as a transition system with the transition relation dened by u
a
v if u can be
represented in the form u = a v + u
. The terminal states are those that can be represented in the form
u + , divergent states are that which can be represented in the form u + . In algebraic terms we can
say that u is terminal (divergent) iff u = u + (u = u + ), which follows from the idempotence of
nondeterministic choice. Thus, behaviors can be considered as states of a transition system. Let beh(s)
denote the behavior of an agent in a state s, then the behavior of an agent in state s can be represented as
the solution u
s
F (A ) of the system
u
s
=
s
a
t
a u
t
+
s
(6.1)
where
s
= 0 if s is neither terminal nor divergent,
s
= if s is terminal but not divergent,
s
= for
divergent but not terminal states, and
s
= +for states which are both terminal and divergent. If all
summands in the representation (6.1) are different, then this representation is unique up to associativity
and commutativity of nondeterministic choice.
As an example, consider the behavior u dened as u = tick.u. This behavior models a clock that
never terminates. It can be represented by a transition system with only one state u which generates the
innite history
u
tick
u
tick

The innite tree with only one path representing this behavior can be obtained as the limit of the sequence
of nite approximations u
(0)
= , u
(1)
= tick., u
(2)
= tick.tick., . . . . Now consider,
u = tick.u +stop.
This is a model of a clock which can terminate by performing the action stop, but the number of steps
to be done before terminating are not known in advance. The transition system representing this clock
has two states, one of which is a terminal state. The rst two approximations of this behavior are
u
(1)
= tick.+stop.
u
(2)
= tick.(tick.+stop.) +stop.
Note that, the second approximation cannot be written in the formtick.tick.+tick.+stop.
because distributivity of choice does not hold in behavior algebra.
u = tick.u +tick.0
describes a similar behavior but is terminated by deadlock rather than successfully.
6.2.2.2 Bisimilarity
Trace equivalence is too weak to capture the notion of the behavior of a transition system. Consider the
systems shown in Figure 6.1.
Both systems in Figure 6.1 start by performing the action a. But the system at the left-hand side has
a choice at the second step to perform either action b or c. The system on the right can only perform
an action b and can never perform c or it can only perform c and never perform b, depending on what
decision was made at the rst step. The notion of bisimilarity [7] captures the difference between these
two systems.
a
c
a
b b c
a
FIGURE 6.1 Two systems which are trace equivalent but have different behaviors.
A binary relation R S S on the set of states S of a transition system without terminal and divergent
states is called a bisimulation if for each s and t such that (s, t ) R and for each a A:
1. If s
a
s
then there exists t
S such that t
a
t
and (s
, t
) R.
2. If t
a
t
then there exists s
S such that s
a
s
and (s
, t
) R.
Two states s and t are called bisimilar if there exists a bisimulation relation R such that (s, t ) R.
Bisimilarity is an equivalence relation whose denition is easily extended to the case when R is dened as
a relation between the states of two different systems, considering the disjoint union of their sets of states.
Two transition systems are bisimilar if each state of one of them is bisimilar to some state of the other.
For systems with nontrivial sets of terminal states S
and divergent states S
, partial bisimulation is
considered instead of bisimulation. A binary relation R S S is a partial bisimulation if for all s and t
such that (s, t ) R and for all a A,
1. If s S
then t S
and if s / S
then t / S
.
2. If s
a
s
then there exists t
such that t
a
t
and (s
, t
) R.
3. If t
a
t
then there exists s
such that s
a
s
and (s
, t
) R.
A state s of a transition system S is called a bisimilar approximation of t , denoted by s
B
t , if there
exists a partial bisimulation R such that (s, t ) R. Bisimilarity s
B
t can then be introduced as the
relation s
B
t t
B
s. For attributed transition systems, the additional requirement is that if (s, t ) R,
then s and t have the same attributes.
A divergent state without transition approximates arbitrary other states that are not terminal. If s
approximates t and s is convergent (not divergent) then t is also convergent, s and t have transitions for
the same sets of actions, and satisfy the same conditions as for bisimulation without divergence. Otherwise
if s is divergent, the set of actions, for which s has transitions, is only included in the set of actions for
which t has transitions, that is, s is less dened than t . For the states of a transition system it can be
proved that
s
B
t beh(s) beh(t )
s
B
t beh(s) = beh(t )
and, therefore, the states of an agent considered up to bisimilarity can be identied with corresponding
behaviors. If S is a set of states of an agent then a set U = {beh(s)|s S} is a set of all its behaviors.
This set is transition closed which means that u U and u
a
v implies v U. Therefore, U is also a
transition system equivalent to S and can be used as a standard behavior representation of an agent.
For many applications, a weaker equivalence such as weak bisimilarity introduced by Milner [8], or
insertion equivalence as discussed in Section 6.2.3, have been considered. Note that, for deterministic
systems, if two systems are trace-equivalent, they are also bisimilar.
6.2.2.3 Composition of Behaviors
Composition of behaviors is dened as an operation over agents and is expected to preserve equivalence;
it can, therefore, also be dened as an operation on behaviors.
The sequential composition of behaviors u and v is a new behavior denoted as (u; v) and dened by
means of the following inference rules and equations:
u
a
u
(u; v)
a
(u
; v
)
(6.2)
((u +); v) = (u; v) +v (6.3)
((u +); v) = (u; v) + (6.4)
(0; u) = 0 (6.5)
We consider a transition system with states built from arbitrary behaviors over the set of states A by
means of operations of the behavior algebra F(A) and a new operation denoted as (u; v). Expressions
are considered up to the equivalence dened by the above equations (thus, the extension of a behavior
algebra by this operation is conservative). The inference rule (6.2) denes a transition relation on a set of
equivalence classes.
From rule (6.2) and equation (6.4) it follows that (; v) = v and (; v) = . One can prove that
(u; ) = u and that sequential composition is associative and distributives to the left
((u +v); w) = (u; w) +(v; w)
Sequential composition can also be dened explicitly by the following recursive denition:
(u; v) =
u
a
u
a (u
; v) +
u=u+
(; v)
6.2.2.3.1 Parallel Composition of Behaviors
We dene an algebraic structure on the set of actions A by introducing the combinator a b of actions
a and b. This operation is commutative and associative with the impossible action as the zero element
(a = ). As u = 0, there are no transitions labeled . The inference rules and equations dening
the parallel composition u v of behaviors u and v are
u
a
u
, v
b
v
, a b =
u v
ab
u
u
a
u
u v
a
u
v
u
a
u
u (v +)
a
u

v
a
v
u v
a
u v
v
a
v
(u +) v
a
v
(u +) (v +) = (u +) (v +) +
(u +) v = (u +) v +
u (v +) = u (v +) +
The following equations for termination constants are direct consequences of these denitions:
= = = =
0 = 0 = 0 if = +
0 = 0 = if = +
Parallel composition is commutative and associative.
Parallel composition is the primary means for describing the interaction of agents. The simplest inter-
action is interleaving, which trivially denes composition as a b = for arbitrary actions. Agents
in a parallel composition interact with each other and can synchronize via combined actions. Parallel
composition can also be dened explicitly by the following recursive denition:
(u v ) =
u
a
u
v
b
v
(a b ) (u
) +
u
a
u
a (u
v ) +
v
b
v
b (u v
) +
u

v
where
u
is a termination constant in the equational representation of behavior u.
6.2.3 Environments
An environment E is an agent over an action algebra C with an insertion function. All states of the
environment are initial states. The insertion function, denoted by e[u] takes an argument e (the behavior of
an environment) and the behavior of an agent over an action algebra A in a given state u (the action algebra
of agents may be a parameter of the environment) and yields a new behavior of the same environment.
The insertion function is continuous in both its arguments.
We consider agents up to a weaker equivalence than bisimilarity. Consider the example in Figure 6.2.
Clearly, these systems are not bisimilar. However, if a represents the transmition of a message, and b
represents the reception of that message, the second trace on the left-hand side gure would not be
possible within an environment that supports asynchronous message passing. Consequentially, both
systems would always behave the same. Insertion equivalence captures this difference: the environment
can impose constraints on the inserted agent, such as disallowing the behavior b a, in this example. In
such environment, both behaviors shown in Figure 6.2 are considered equivalent.
Insertion equivalence depends on the environment and its insertion function. Two agents u and v are
insertion equivalent with respect to an environment E, written as u
E
v, if for all e E, e[u] = e[v].
Each agent u denes a transformation on the set of environment states; two agents are equivalent with
respect to a given environment if they dene the same transformation of the environment.
a
a
b
b
a
b
FIGURE 6.2 Two systems which are not bisimilar, but may be insertion equivalent.
External environment
Insertion function
Agent
E
n
v
i
r
o
n
m
e
n
t
FIGURE 6.3 Agents in environment.
After insertion of an agent into an environment, the new environment is ready to accept new agents to
be inserted. Since insertion of several agents is a common operation, we shall use the notation
e[u
1
, . . . , u
n
] = e[u
1
] [u
n
]
as a convenient shortcut for insertion of several agents.
In this expression, u
1
, . . . , u
n
are agents inserted into the environment simultaneously, but the order of
insertion may be essential for some environments. If we wanted an agent u to be inserted after an agent v,
we must nd some transition e[u]
a
s and consider the expression s[v]. Some environments can move
independently, suspending the actions of an agent inserted into it. In this case, if e[u]
a
e
[u], then
e
[u, v] describes the simultaneous insertion of u and v into the environment in state e
as well as the
insertion of u when the environment is in a state e and is followed by the insertion of v.
An agent can be inserted into the environment e[u
1
, u
2
, . . . , u
n
], or that environment can itself be
considered as an agent which can be inserted into a new external environment e
with a different insertion

function. An environment with inserted agents as a transition system is considered up to bisimilarity, but
after insertion into a higher level environment it is considered up to insertion equivalence (Figure 6.3).
Some example environments arising in real-life situations are:
A vehicle with sensors is an environment for a computer system.
A computer system is an environment for programs.
The operating system is an environment for application programs.
A program is an environment for data, especially when considering interpreters or higher-order
functional programs.
The web is an environment for applets.
6.2.3.1 Insertion Functions
Each environment is dened by its insertion function. The restriction on the insertion function to be
continuous is too weak and in practice more restricted types of insertion functions are considered. The
states of environments and agents can be represented in algebraic form as expressions of a behavior
algebra. To dene an insertion function it is sufcient to dene transitions on the set of expressions of
the type e[u]. We use rules in the form of rewriting logic to dene these transitions. The typical forms of
such rules are:
F(x)[G(y)] d F
(z)[G
(z)]
F(x)[G(y)] F
(z)[G
(z)]
where x = (x
1
, . . . , x
n
), y = (y
1
, . . . , y
n
), z = (x
1
, x
2
, . . . , y
1
, y
2
, . . .), x
1
, x
2
, . . . , y
1
, y
2
are action or
behavior variables, F, G, F
, G
are expressions in the behavior algebra, that is, expressions built by non-
deterministic choice and prexing. More complex rules allow arbitrary expressions on the right-hand
side in the behavior algebra extended by insertion as two sorted operation. The rst type of rule denes
observable transitions
F(x)[G(y)]
d
F
(z)[G
(z)]
The second type of rule denes unlabeled transitions which can be used as auxiliary rules. They are not
observable outside the environment and can be reduced by the rule
e[u]

e
[u
], e
[u
]
d
e
[u
]
e[u]
d
e
[u
]
where

means the transitive closure of unlabeled transitions. Special rules or equations must be added
for termination constants. Rewriting rules must be left linear with respect to the behavior variables, that is,
none of the behavior variables can occur more than once in the left-hand side. Additional completeness
conditions must be present to ensure all possible states of the environment are covered by the left-
hand side of the rules. Under these conditions, the insertion function will be continuous even if there
are innitely many rules. This is because, to compute the function e[u] one needs to know only some
nite approximations of e and u. If e and u are dened by means of a system of xed point equations,
these approximations can be easily constructed by unfolding these equations sufciently many times.
Insertion functions that are dened by means of rewriting rules can be classied on the basis of the
height of terms F(x) and G(y) in the left-hand side of the rules. The simplest case is when this height
is no more than 1, that is, terms are the sum of variables and expressions of the form c z, where c is
an action, and z is a variable. Such insertion functions are called one-step insertions, other important
classes are head insertion and look-ahead insertion functions. For head insertion the restriction on the
height should not exceed 1 which refers only to the agent behavior term G(y). The term F(x) can be
of arbitrary height. Head insertion can be reduced to one-step insertion by changing the structure of
the environment but preserving the insertion equivalence of agents. In head insertion, the interaction
between the environment and agent is similar to the interaction between the server and the client: a server
has information only about the next step in the behavior of the client but knows everything about its own
behavior. In a look-ahead insertion environment, the behavior of an agent can be analyzed for arbitrary
long (but nite) future steps. We can liken such environment to the interaction between an interpreter
and a program.
We consider a one-step insertion which is applied in many practical cases by restricting ourselves to
purely additive insertion functions that satisfy the following conditions:
e
i
[u] =
e
i
[u]
e
u
i
e[u
i
]
Given two functions D
1
: A C 2
C
and D
2
: C 2
C
, the transition rules for insertion functions are
u
a
u
, e
c
e
, d D
1
(a, c)
e[u]
d
e
[u
]
e
c
e
, d D
2
(c)
e[u]
d
e
[u]
We refer to D
1
and D
2
as residual functions. The rst rule (interaction rule) denes the interaction between
the agent and the environment which consists of choosing a matching pair of actions a A and c C.
Note that, the environment and the agent move independently. If the choice of action is made rst by
the environment, then the choice of action c by the environment denes a set of actions that the agent
may take: a can be chosen only so that D
1
(a, c) = . The observable action d must be selected from the
set D
1
(a, c). This selection can be restricted by the external environment if e[u] considered as an agent
is inserted into the environment by other agents inserted into environment e[u] after u. This rule can be
combined with rules for unobservable transitions if some action, say (as in Milner CCS), is selected in C
to hide the transition. For this case we formulate the interaction rule to account for hidden interactions.
u
a
u
, e
c
e
, D
1
(a, c)
e[u] e
[u
]
The second rule (environment move rule) describes the case when the environment transitions inde-
pendently of the inserted agent and the agent is waiting until the environment will allow it to move.
Unobservable transitions can also be combined with environment moves. Some equations should be
added for the case when e or u are termination constants. We shall assume that [u] = , 0[u] = 0,
e[] = e, e[] = , and e[0] = 0. There are no specic assumptions about [u] but usually neither
nor 0 belong to E. Note that, in the case when E and [u] = u, insertion equivalence coincides
with bisimulation. The denition of the insertion function for one-step insertion discussed earlier will be
complete, if we assume that there are no transitions other than those dened by the rules.
The denition above can be expressed in the form of rewriting rules as follows:
d D
1
(a, c) (c x)[a y] d y
d D
2
(c) (c x)[y] d x[y]
and in the form of explicit recursive denition as
e[u] =
e
c
e
u
a
u
dD
1
(a,c)
d e
[u
] +
e
c
e
dD
2
(c)
d e
[u] +
e
[u]
To compute transitions for the multiagent environment e[u
1
, u
2
, . . . , u
n
] we recursively compute
transitions for e[u
1
], then for e[u
1
, u
2
] = (e[u
1
])[u
2
], and eventually for e[u
1
, u
2
, . . . , u
n
] =
(e[u
1
, u
2
, . . . , u
n1
])[u
n
].
Important special cases of one-step insertion functions are parallel and sequential insertion. An
insertion function is called a parallel insertion if
e[u, v] = e[u v]
This means that the subsequent insertion of two agents can be replaced by the insertion of their parallel
composition. The simplest example of a parallel insertion is dened as e[u] = e u. This special case
holds when the sets of actions of environment and agents are the same (A = C), b = D
1
(a, a b), and
D
2
(a) = A. In the case when E, this environment is a set of all other agents interacting with a given
agent in parallel, and insertion equivalence coincides with bisimilarity. Sequential insertion is introduced
in a similar way:
e[u, v] = e[u; v]
This situation holds, for example, when D
1
(a, c) = , D
2
(c) = C, and [u] = u.
6.2.3.2 Example: Agents over a Shared and Distributed Store
As an example, consider a store, which generalizes the notions of memory, data bases, and other inform-
ation environments used by programs and agents to hold data. An abstract store environment E is an
environment over an action algebra C, which contains a set of actions A used by agents inserted into this
environment. We shall distinguish between local and shared store environments. The former can interact
with an agent inserted into it while this agent is not in a nal state and, if another agent is inserted into this
environment, the activity of the latter is suspended until the former completes its work. The shared store
admits interleaving of the activity of agents inserted into it, and they can interact concurrently through
this shared store.
6.2.3.2.1 Local and Shared Store
The residual functions for a local store are dened as:
D
1
(a, c) = {d|c = a d}, where d = for d C\A or d = otherwise, and D
2
(c) = C
and for a shared store as
D
1
(a, c) = {d|c = a d}, where d = , d C, and D
2
(c) = C.
It can be proved that the one-step insertion function for a local store is a sequential insertion and that
one-step insertion for a shared store is a parallel insertion. In other words,
e[u
1
, u
2
, . . .] = e[u
1
; u
2
; . . .]
for a local store, and
e[u
1
, u
2
, . . .] = e[u
1
u
2
. . .]
for a shared store. The interaction move for the local store is dened as
u
a
u
, e
ad
e
e[u]
d
e
[u
]
When the store moves according to this rule, an agent inserted into it plays the role of control for this
store. A store in a state e[u] can only perform actions which are allowed by the agent u. This action can be
combined only with an action d which is not from the action set A and cannot be used by another agent
in a transition. The actions returned by the residual function are external actions and can be observed and
used only from outside the store environment.
Different from a local store, in a shared store environment several agents can perform their actions in
parallel according to the rule
u
1
a
1
u
1
, . . . , u
n
a
n
u
n
, e
a
1
a
n
d
e
e[u
1
u
n
v]
d
e
[u
1
u
n
v]
An important special case of the store environment E is a memory over a set of names R and a data
domain D. The memory can be represented by an attributed transition systemwith attributes R and states
e : R R
. Agent actions are assignments and conditions, and their combinations are possible if they can
be performed simultaneously. If a is a set of assignments, then in a transition e
a
e
the state e
results
from applying a to e. The conjunction of conditions c enables a transition e
ca
e
if c is valid on e and
e
a
e
.
6.2.3.2.2 Multilevel Store
For a shared memory store the residual action d in the transition
e[u
1
u
n
v]
d
e
[u
1
u
n
v]
is intended to be used by external agents inserted later, but in a multilevel store it is convenient to restrict
the interaction with the environment to a given set of agents which have already been inserted. For this
purpose, a shared memory can be inserted into a higher level closure environment with an insertion
function dened by the equation
g[e[u]][v] = g[e[u v]]
where g is a state of this environment, e is a shared memory environment, and only the following two
rules are used for transitions in the closure environment:
e[u]
c
e
[u
], c C
ext
c =
g[e[u]]

ext
(c,e)
g[e
[u
]]
e[u]

e
[u
]
g[e[u]] g[e
[u
]]
Here C
ext
is a distinguished set of external actions. Some of external actions can contain occurrences
of names from e. The function
ext
substitutes the values of these names in c and performs other
transformations to make an action be observable for external environment.
Two level insertions can be described in the following way. Let R = R
1
R
2
be divided into two
nonintersecting parts: the external and internal memory. Let A
1
be the set of actions which change only
the values of R
1
, but can use the values of R
2
(external output actions), let A
2
be the set of actions which
change only the values of R
2
, but can use the values of R
1
(external input actions), and A
3
be the set
of actions which change and use only the values of R
2
(internal actions). These sets are assumed to be
dened on the syntactical level. Redene the residual function D
1
and transitions of E: let a A and split
a into a combination of actions
1
(a)
2
(a)
3
(a) so that
1
(a) A
1
,
2
(a) A
2
, and
3
(a) A
3
(some of these actions may be equal to ). Dene the interaction rule in the following way:
u
a
u
, e
(
2
(a))
3
(a)
e
e[u]
c
1
(a)
e
[u
]
where is anarbitrary substitutionof names used inconditions and inthe right-hand sides of assignments
of
2
(a) into the set of their values, b is an application of the substitution to b, c
is a substitution
written in the form of the condition r
1
= (r
1
) r
2
= (r
2
) . Dene
ext
(b, e) = be, that is
a substitution of the values of R
2
to b.
Consider a two level structure of a store state
t [g[e
1
[u
1
]] g[e
1
[u
1
]] ]
where t D
R
1
is a shared store and e
1
, e
2
, . . . D
R
2
represent the distributed store (memory). When the
component g[e
i
[u
i
]] performs internal actions these are hidden and do not affect the shared memory. Per-
forming external output actions change the names of the sharedmemory andexternal input actions receive
values from the shared memory to change components of the distributed memory. This construction is
easily iterated as the components of a distributed memory can have multilevel structure.
6.2.3.2.3 Message Passing
Distributed components can interact via shared memory. We nowintroduce direct interaction via message
passing. Synchronous communicationcanbe organized by extending the set of actions with a combination
of actions in parallel composition independently of the insertion function. To describe synchronous data
exchange in the most general abstract schema, let
u =
dD
a(d) F(d) u
D
a
(d
) F
(d
)
be two agents which use data domain D for the exchange of information. Functions a and a
map the data

domain onto the action algebra, functions F and F
map elements of the data domain onto behaviors.

The parallel composition of u and u
is
u u
a(d)a
(d
)=
a(d) a
(d
)(F(d) F
(d
)) +
dD
a(d)(F(d) u
) +
D
a
(d
)(F
(d
) u)
(note that,
u
=
u
= 0, i.e., this is a special case of parallel composition where there are no termination
constants). The rst summand corresponds to the interaction of two agents. The other two summands
reect the possibility of interleaving. The interaction can be deterministic even if u and u
are non-
deterministic if a(d) a
(d
) = has only one solution. Interleaving makes it possible to select other

action if u u
is embedded into another parallel composition. They can also be hidden by a closure
environment (similar to restriction in Calculus of Concurrent Systems, CCS).
The exchange of information through combination is bidirectional. An important special case
of information exchange is the use of send/receive pairs. For example, consider the following
combination rule
send(addr, d)receive (addr
, d
)=
exch(addr), if addr=addr
, d=d
, otherwise
In the latter case, if
u = send(addr, d) v
and
u
D
receive (addr, d
) F(d
)
the interaction summand of the parallel composition will be exch(addr) (v F(d)).
Asynchronous message passing via channels can be described by introducing a special communication
environment. The attributes of this environment are channels and their values are sequences (queues)
of stored messages. It is organized similarly to the memory environment but queue operations are used
instead of storing. In addition, send and receive actions are separated in time. This environment is a
special case of a store environment and can be combined with a store environment keeping separate the
different types of attributes and actions.
6.2.4 Classical Theories of Concurrency
The theory of interaction of agents and environments [911] focuses on the description of multi-agent
systems comprised of agents cooperatively working within a distributed information environment.
Other mathematical models for specications of dynamic and real time systems interacting with envir-
onments have been developed based on process algebras (CSP, CCS, ACP, etc.), automata models (timed
Bchi and Muller automata, abstract state machines [ASM]), and temporal logic (LPTL, LTL, CTL, CTL
).
New models are being developed to support different peculiarities of application areas, such as Milners
-calculus [12] for mobility and its recent extension to object-oriented descriptions.
The environment may change the predened behavior of an agent. For example, it may contain
some other agents designed independently and intended to interact and communicate with the agent
during its execution. The classical theories of communication consider this interaction as part of the
parallel composition of agents. The inuence of the environment can be expressed as an explicit language
operation such as restriction (CCS) or hiding (CSP).
In contrast to the classical theories of interaction which are based on an implicit and hence not
formalized notion of an environment, the theory of interaction of agents and environments studies them
as objects of different types. In our approach the environment is considered as a semantic notion and is
not explicitly included in the agent. Instead, the meaning of an agent is dened as a transformation of an
environment which corresponds to inserting the agent into its environment. When the agent is inserted
into the environment, the environment changes and this change is considered to be a property of the agent
described.
6.2.4.1 Process Algebras
An algebraic theory of concurrency and communication that deals with the occurrence of events rather
than with updates of stored values is called a process algebra. The main variants of process algebra are
generally known by their acronyms: CCS [8] Calculus of Concurrent Systems developed by Milner,
CSP [13] Hoares Communicating Sequential Processes, and ACP Algebra of Communicating
Processes of Bergstra and Klop [14]. These theories are based on transition systems and bisimulation,
and consider interaction of composed agents. They employ nondeterministic choice as well as parallel
and sequential compositions as primitive constructs. The inuence of the environment on the system
may be expressed as an explicit language operation, such as restriction in CCS or hiding in CSP. These
theories consider communicating agents as objects of the same type (this type may be parameterized by
the alphabets for events or actions) and dene operations on these types.
The CCS model species sets of states of systems (processes) and transitions between these states.
The states of a process are terms and the transitions are dened by the operational semantics of the
computation, which indicates how and under which conditions a term transforms itself into another
term. Processes are representedby the synchronizationtree (or process graph). Twoprocesses are identied
through bisimulation.
CCS introduces a special action , called the silent action, which represents an internal and invisible
transition within a process. Other actions are split into two classes: output actions, which are indicated by
an overbar, and input actions, which are not decorated. Synchronization only takes place between a single
input and a single output, and the result is always the silent action . Thus, a a = , for all actions a.
Consequentially, communication serves only as synchronization; its result is not visible.
The -calculus [12] is an enhancement of CCS and models concurrent computation by processes
that exchange messages over named channels. A distributed interpretation of the -calculus provides for
synchronous message passing and nondeterministic choice. The -calculus focuses on the specication
of the behavior of mobile concurrent processes, where mobility refers to variable communication via
named channels, which are the main entities in the -calculus. Synchronization takes place only between
two channel agents when they are available for interchange (a named output channel is indicated by an
overbar, while an input channel with the same name is not decorated). The inuence of the environment
in the -calculus is expressed as an explicit operation of the language (hiding). As a result of this operation,
a channel is declared inaccessible to the environment.
CSP explicitly differentiates the set of atomic actions that are allowed in each of the parallel processes.
The parallel combinator is indexed by these sets: when (P
{A}
Q
{B}
), P engages only in events from the set
A, and Q only in events from the set B. Each event in the intersection of A and B requires a synchronous
participation of both processes, whereas other events only require participation of the relevant single
process. As a result, a a = a, for all actions a. The associative and commutative binary operator
describes howthe output data supplied by two processes is combined before transmissionto their common
environment.
In CSP, a process is considered to run in an environment which can veto the performance of certain
atomic actions. If, at some moment during the execution, no action, in which the process is prepared to
engage in, is allowed by the environment, then a deadlock occurs, which is considered to be observable.
Since in CSP a process is fully determined by the observations obtainable from all possible nite interac-
tions, a process is represented by its failure set. To dene the meaning of a CSP program, we determine
the set of states corresponding to normal termination of the program, and the set of states corresponding
to its failures. Thus, the CSP semantics is presented in model-theoretic terms: two CSP processes are
identied if they have the same failure set (failure equivalence).
The main operations of ACP are prexing and nondeterministic choice. This algebra allows an event
to occur with the participation of only a subset of the concurrently active processes perhaps omitting
any that are not ready. As a result, the parallel composition of processes is a mixture of synchronization
and interleaving, where each of the processes either occurs independently or is combined by with a
corresponding event of another process. The merge operator is dened as
Merge(a, b) = (a b) +(a; b) +(b; a).
ACP denes its semantics algebraically; processes are identied through bisimulation.
Most differences between CCS, ACP, and CSP can be attributed to differences in the chosen style of
presentation of the semantics: the CSP theory provides a model, illustrated with algebraic laws. CCS is
a calculus, but the rules and axioms in this calculus are presented as laws, valid in a given model. ACP is a
calculus that forms the core of a family of axiomatic systems, eachdescribing some features of concurrency.
6.2.4.2 Temporal Logic
Temporal logic is a formal specication language for the description of various properties of systems.
A temporal logic is a logic augmented with temporal modalities to allow a specication of the order of
events in time, without introducing time explicitly as a concept. Whereas traditional logics can specify
properties relating to the initial and nal states of terminating systems, a temporal logic is better suited to
describe the on-going behavior of nonterminating and interacting (reactive) systems.
As an example, Lamports TLA (Temporal Logic of Actions) [5,15] is based on Pnuelis temporal logic
[16] with assignment and an enriched signature. It supports syntactic elements taken from programming
languages to ease maintenance of large-sized specications. TLA uses formulae on behavior, which are
considered as a sequence of states. States in TLA are assignments of values to variables. A system satises
a formula iff that formula is true in all behaviors of this system. Formulae where the arguments are only
the old and the new states are called actions.
Here, we distinguish between linear and branching temporal logics. In a linear temporal logic, each
moment of time has a unique possible future, while in branching temporal logic, each moment of time
may have several possible futures. On one hand, linear temporal logic formulae are interpreted over linear
sequences of points in time and specify the behavior of a single computation of a system. Formulae of
a branching temporal logic, on the other hand, are interpreted over tree-like structures, each describing
the behavior of possible computations of a nondeterministic system.
Many temporal logics are decidable and corresponding decision procedures exist for linear and branch-
ing time logics [17], propositional modal logic [18], and some variants of CTL
[19]. These decision

procedures proceed by building a canonical model for a set of temporal formulae representing properties
of the system to be veried by using techniques from automata theory, semantic tableaux, or binary
decision diagrams [20]. Determining whether such properties hold for a system amounts to establishing
that the corresponding formulae are true in a model of the system. Model checking based on these decision
procedures has been successfully applied to nd subtle errors in industrial-size specications of sequential
circuits, communication protocols, and digital controllers [21].
Typically, a system to be veried is modeled as a (nite) state transition graph, and their properties are
formulated in an appropriate propositional temporal logic. An efcient search procedure is then used to
determine whether the state transition graph satises the temporal formulae or not. This technique was
rst developed in the 1980s by Clarke and Emerson [22] and by Quielle and Sifakis [23] and extended
later by Burch et al. [21].
Examples of temporal properties (properties of the interaction between processes in a reactive system)
are as diverse as their applications (this classication was introduced in References 2 and 24):
Safety properties state that something bad never happens(a programnever enters an unacceptable
state).
Liveness properties state that something good will eventually happen (a programeventually enters
a desirable state).
Guarantees specify that an event will eventually happen but does not promise repetitions.
Obligations are disjunctions of safety and guarantee formulae.
Responses specify that an event will happen innitely many times.
Persistence species the eventual stabilization of a system condition after an arbitrary delay.
Reactivity is the maximal class formed from the disjunction of response and persistence properties.
Unconditional Fairness states that a property p holds innitely often.
Weak Fairness states that if a property p is continuously true then the property q must be true
innitely often.
Strong Fairness states that if a property p is true innitely often then the property q must be true
innitely often.
6.2.4.3 Timed Automata
Timed automata accept timed words innite sequences in which a real-valued time of occurrence is
associated with each symbol. A timed automaton is a nite automaton with a nite set of real-valued
clocks. The clocks can be reset to 0 (independent of each other) with the transitions of the automaton and
keep track of the time elapsed since the last reset. Transitions of the automaton put certain constraints
on the clock value such that a transition may be taken only if the current values of the clocks satisfy the
associated constraints.
Timed automata can capture qualitative features of real time systems such as liveness, fairness, and
nondeterminism, as well as its quantitative features such as periodicity, bound response, and timing
delays.
Timed automata are a generalization of nite -automata (either Bchi automata and Muller automata
[25,26]). When Bchi automata are used for modeling nite-state concurrent processes, the verication
problem is reduced to that of language inclusion [27,28]. While the inclusion problem for -regular
languages is decidable [26], for timed automata the inclusion problem is undecidable, which constitutes a
serious obstacle in using timed automata as a specication language for validation of nite-state real time
systems [29].
6.2.4.4 Abstract State Machine
Gurevichs ASMproject [30,31] attempts to apply formal models of computation to practical specication
methods.
ASM assumes the Implicit Turing Thesis according to which every algorithm can be modeled at its
appropriate abstraction level (its algorithm) by a corresponding ASM. ASM descriptions are based on the
concept of evolving algebras, which are transition systems on static algebras. Each static algebra represents
a state of the modeled system and transition rules are transitions in the modeled system. To simplify the
semantics and ease proofs, transitions are limited: they can change only functions, but not sorts, and
cannot directly change the universe.
A single agent ASM is dened over a vocabulary (a set of functions, predicates, and domains). Its
states are dened by assigning an interpretation to the elements of the vocabulary. An ASM program
describes the rules of transitioning between states. An ASM program is dened by basic transition rules,
such as updates (changes to the interpretation of the vocabulary), conditions (apply only if some specic
condition holds), or choice (extracts from a state elements with given properties), or by combinations of
transition rules into complex rules.
Multiagent ASM consist of a number of agents that execute their ASM program concurrently and
interact through globally shared locations of a state. Concurrency between agents is modeled by partially
ordered runs. The program steps executed by each agent are linearly ordered; in addition, program steps in
different programs are ordered if they represent causality relations. Multiagent ASM rely on a continuous
global system time to model time-related aspects.
6.2.4.5 Rewriting Logic
Rewriting logic [32] allows to prove assertions about concurrent systems with states changing under
transitions. Rewriting logic extends equational logic and constitutes a logical framework in which many
logics and semantic formalisms can be represented naturally (i.e., without distorting encoding). Similar to
algebras allowing a semantic interpretation to equational logic, models of rewriting logic are concurrent
systems. Moreover, models of concurrent computation, object-oriented design languages, architectural
description languages, and languages for distributed components also have natural semantics in rewriting
logic [33].
Inrewriting logic, systemstates are ina bijective correspondence withformulae (modulowhatever struc-
tural axioms are satised by such formulae, e.g., modulo associativity or commutativity of connectives)
and concurrent computations in a system are in a bijective correspondence with proofs (modulo appro-
priate notions of equivalence). Given this equivalence between computation and logic, a rewriting logic
axiom of the form
t t
has two readings. Computationally, it means that a fragment of a system state that is an instance of
the pattern t can change to the corresponding instance of t
concurrently with any other state changes. The

computational meaning is that of a local concurrent transition. Logically, it just means that we can derive
the formula t from the formula t
, that is, the logical reading is that of an inference rule. Computation

consists of rewriting to a normal form, that is, an expression that can no further be rewritten; when the
normal form is unique, it is taken as the value of the initial expression. When rewriting equal terms always
leads to the same normal form, the set of rules is said to be conuent and rewriting can be used to check
for equality.
Rewriting logic is reective [34,35], and thus, important aspects of its meta-theory can be represented at
the object level in a consistent way. The language Maude [36,37] has been developed at SRI to implement a
framework for rewriting logic. The language design and implementation of Maude systematically leverage
the reexivity of rewriting logic and make the meta-theory of rewriting logic accessible to the user allowing
to create within the logic a formal environment for the logic with tools for formal analysis, transformation,
and theorem proving.
6.3 Requirements Capture and Validation
In Reference 38, requirements capture is dened as an engineering process for determining what artifacts
are to be produced as the result of the development effort. The process involves the following steps:
Requirements identication
Requirements analysis
Requirements representation
Requirements communication
Development of acceptance criteria and procedures
Requirements can be considered as an agreement between the customer and the developer. As agree-
ments, they must be understandable to the customer as well as to the developer and the level of
formalization depends on the common understanding and the previous experience of those involved
in the process of requirements identication.
The main properties of the system to be developed including its purpose, functionality, conditions of
use, efciency, safety, liveness, or fairness properties, are specied in the requirements along with the
main goals of the development project. The requirements also include an explanation of terms referenced,
information about other already developed systems which should be reused in the development process,
and possible decisions on implementation and structuring of the system.
It is well recognized that identifying and correcting problems in requirements and early design phase
avoids far more expensive repairs later. Boehm quotes late life-cycle xes to be a hundred times more
expensive than corrections made during the early phases of system development [39]. In Reference 40,
Boehm documents that the relative cost of repairing errors increases exponentially with the life-cycle
phase at which the error was detected. Kelly et al. [41] documents a signicantly higher density of defects
found during the requirements phase as compared with later life-cycle phases. Early life-cycle defects
are also very prevalent: in Reference 42, it was shown that of 197 critical faults found during integration
testing of spacecraft software, only 3 were programming mistakes. The other 194 were introduced at earlier
stages. Fifty percent of these faults were owing to awed requirements (mainly omissions) for individual
components, 25%were owing to awed designs for these components, and the remaining 25%were owing
to awed interfaces between components and incorrect interactions between them. A number of other
studies [43,44] reveal that most errors in software systems originate in the requirements and functional
specications. If the errors were detected as soon as possible, their repair would have been least expensive.
A requirements specication must describe the external behavior of a system in terms of observable
events and actions. The latter describes interaction of systemcomponents with their environments includ-
ing other components and acceptable parts of the external physical world, that is, those aspects of the
world which can inuence the behavior of the system.
We consider requirements to be correct if they are consistent and complete. Checking consistency and
completeness of requirements is the nal task of requirements representation.
The informal understanding of the consistency of requirements is the existence of an implementation
which satises the requirements. In other words, if requirements are inconsistent, then an implementation
free of errors (bugs) that satises all the requirements cannot be created.
Unfortunately, most ways for describing requirements and preliminary designs (natural language,
diagrams, pseudocode) do not offer mechanized means of establishing correctness, so the primary means
to deduce their properties and consequences is through inspections and reviews. To be amenable to
analysis, the requirements must rst be formalized, that is, rewritten in some formal language. We call this
formalized description a requirements specication. It should be free of implementation details and can be
used for the development of an executable model of a system, which is called an executable specication.
Note that, any description that is formal enough to be amenable to operational interpretation will also
provide some method of implementation, albeit usually a rather inefcient one.
The existence of an executable specication which satises the formalized requirements is a sufcient
condition of consistency. It is sufcient, because a nal implementation that is free of errors can be
extracted from executable specications using formal methods such as stepwise renement.
Completeness of requirements is understood as the uniqueness of the executable specication con-
sidered up to some equivalence. The intuitive understanding of equivalence is as follows: two executable
specications are equivalent if they demonstrate the same behaviors in the same environments. Com-
pleteness can also be expressed in terms of determinism of the executable specication (requirements are
sufcient for constructing a unique deterministic model). In some cases, incompleteness of requirements
is not harmful because it can be motivated by the necessity to suspend implementation decisions until
later stages of development.
The correspondence betweenthe original requirements andthe requirements specicationis not formal.
Experience has shownthat special skills are requiredtocheckcorrespondence betweeninformal andformal
requirements. Incompleteness and inconsistencies discovered at this stage are used for improvement and
correction of the requirements which are used to correct the requirements specication.
6.3.1 Approaches to Requirements Validation
Standard approach to requirements analysis and validation typically involve manual processes such as
walk-throughs or Fagan-style inspections [45,46]. The term walk-through refers to a range of activities
that can vary from cursory peer reviews to formal inspections, although walk-throughs usually do not
involve the replicable processes and methodical data collection that characterize Fagan-style inspections.
Fagans highly structured inspection process was originally developed for hardware logic and later applied
to software design and code and eventually extended to all life-cycle phases, including requirements
development and high level design [45].
A Fagan inspection involves a review team with the following roles: a Moderator, an Author, a Reader,
and a Tester. The Reader presents the design or code to the other team members, systematically walking
through every piece of logic and every branch at least once. The Author represents the viewpoint of the
designer or coder, and the test perspective is represented by the Tester. The Moderator is trained to facilitate
intensive but constructive discussion. When the functionality of the system is well understood, the focus
shifts to a search for faults, possibly using a checklist of likely errors to guide the process. The inspection
process includes highly structured rework. One of the main advantages of Fagan-style inspections over
other conventional forms of verication and validation is that inspections can be applied early in the life
cycle. Thus potential anomalies can be detected before they become entrenched in low level design and
implementation.
Rushby [47] gives an overview of techniques of mechanized formal methods: decision procedures
for specialized, but ubiquitous, theories such as arithmetic, equality, and the propositional calculus are
helpful in discovering false theorems (especially if they can be extended to provide counter examples)
as well as in proving true ones, and their automation dramatically improves the efciency of proof.
Rewriting is essential to efcient mechanization of formal methods. Unrestricted rewriting provides a
decision procedure for theories axiomatized by terminating and conuent sets of rewrite rules, but few
such theories arise in practice.
Integration of various techniques increases the efciency of these methods. For example, theorem
proving attempts to show that a formula follows from given premises, while model checking attempts to
show that a given system description is a model for the formula. An advantage of model checking is that,
for certain nite-state systems and temporal logic formulae, it can be automated and is more efcient than
theorem proving. Benets of an integrated system (providing both theorem proving and model checking)
are that model checking can be used to discharge some cases in a larger proof and theorem proving can
be used to justify reduction to a nite state that is required for automated model checking. Integration of
these techniques can provide a further benet: before undertaking a potentially difcult and costly proof,
we may be able to use model checking to examine special cases. Any errors that can be discovered and
eliminated in this way will save time and effort during theorem proving.
Model checking determines whether a given formula stating a property of the specication is satised
in a Kripke-model (in the specication represented as a Kripke-model). In the worst case, these algorithms
must traverse the whole of the model, that is, visit all states of the corresponding transition system, and
consequentially, model checking can be applied mainly for nite-state systems even in the presence of
sophisticated means of representing the set of states, such as binary decision diagram (BDD) methods
[20], while the proof of theorems about innite state systems can be done only by means of deductive
methods.
6.3.2 Tools for Requirements Validation
Formal methods may be classiedaccording totheir primary purpose as descriptive or analytic. Descriptive
methods focus largely on specications as a medium for review and discussion, whereas analytic methods
focus on the utility of specication as a mathematical model for analyzing and predicting the behavior of
systems. Not surprisingly, the different emphasis is reected in the type of a formal language favored by
either methods.
Descriptive formal methods emphasize the expressive power of the underlying language and provide a
rich type system, often leveraging the notations of conventional mathematics or set theory. These choices
in language elements do not readily support automation; instead, descriptive methods typically offer
attractive user interfaces and little in the way of deductive machinery. These methods assume that the
specication process itself serves as verication, as expressing the requirements in mathematical form
leads to detect inconsistencies that are typically overlooked in natural language descriptions. Examples of
primarily descriptive formal methods are VDM [48], Z [49], B [50], or LOTOS [51].
Analytic formal methods place emphasis on mechanization and favor specication languages that are
less expressive but capable of supporting efcient automated deduction. These methods vary in the
degree of automation provided by the theorem prover, or, conversely, by the amount of user interaction
in the proof process. They range from automatic theorem proving without user interaction to proof
checking without automatic proof steps. The former typically have restricted specication languages and
powerful provers that can be difcult to control and offer little feedback on failed proofs, but perform
impressively in the hands of experts, for example, Nqthm [52]. Proof checkers generally offer more
expressive languages, but require signicant manual input for theorem proving, for example, high-order
logic (HOL) [53]. Many tools fall somewhere in between, depending on language characteristics and proof
methodology, for example, Eves [54,55], or PVS [56]. The goal of mechanized analysis may be either to
prove the equivalence between different representations of initial requirements or to establish properties
that are considered critical for correct system behavior (safety, liveness, etc.).
Tools for analytic formal methods fall into two main categories: state exploration tools (model checkers)
and deductive veriers (automated theorem provers). Model checking [57] is an approach for formally
verifying nite-state systems. Formalized requirements are expressed as temporal logic formulas, and
efcient symbolic algorithms are used to process a model of the system and check if the specication
holds in that model. Widely known tools are VeriSoft [58], SMV [59], or SPIN [60]. Some deductive
veriers either support inference in rst order (Larch [61]), others, such as PVS [62], are based on
higher-order languages integrated with supporting tools and interactive theorem provers.
Today, descriptive methods are often augmented by facilities of mechanized analysis. Also, automated
theoremproving and model checking approaches may be integrated in a single environment. For example,
a BDD-based model checker can be used as a decision procedure in the PVS [63] theorem prover. In addi-
tion to the prover or proof checker, a key feature of analytic tools is the type checker which checks
specications for semantic consistency, possibly adding semantic information to the internal representa-
tion built by the parser. If the type system of the specication language is not decidable, theorem proving
may be required to establish the type consistency of a specication. An overview of verication tools and
underlying methods can be found in Reference 64.
The most severe limitation to the deployment of formal methods is the need for mathematical soph-
isticated users, for example, with respect to the logical notation these methods use, which is often an
insurmountable obstacle to the adoption of these methods in engineering disciplines. General-purpose
decision procedures, as implemented in most of these methods, are not scalable to the size of industrial
projects. Another obstacle to the application of deductive tools like PVS is the necessity to develop a math-
ematical theory formalized on a very detailed level to implement even very simple predicates. Recently,
formal methods were successfully applied to specication and design languages widely accepted in the
engineering community, such as MSC [65], SDL [51], or UML [66]. Several tool vendors participate in
the OMEGA project aimed at the development of formal tools for the analysis and verication of design
steps based on UML specications [67]. SDL specications can be checked by model checkers as well
as automated theorem provers: for example, the IF system from Verimag converts SDL to PROMELA as
input to the SPIN model checker [68]. At Siemens, verication of the GSM protocol stack was conducted
using the BDD-based model checker SVE [69]. An integrated framework for processing SDL specication
has been implemented based on the automated theorem prover ACL2 [70]. Ptk [71] provides semantic
analysis of MSCdiagrams and generates test scripts fromsuch diagrams in a number of different languages
including SDL, TTCN, and C. FatCat [72] locates situations of nondeterminacy in a set of MSC diagrams.
Inthe following, we give a cursory overviewof some of the wider knowntools supporting the application
of formal methods to specications. This survey reviews only tools that are freely available, at least
for research use. A number of powerful commercial verication technologies have been developed, for
example: ACL2 (Computational Logic, USA), ASCE (Adelard, UK), Atelier B (STERIA Mditerrane,
France), B-Toolkit (B-Core, UK), CADENCE (Cadence, USA), Escher (Escher Technologies, UK), FDR
(Formal Systems, UK), ProofPower (ICL, UK), Prover (Prover Technology, Sweden), TAU (Telelogic),
Valiosys (Valiosys, France), and Zola (Imperial Software, UK). Amore detailed survey of commercial tools
aimed at formal verication is given in Reference 73.
The tools and environments surveyed provide only a sample of the wide variety of tools available. In
particular, in the area of model checking a large number of implementations support the verication
of specications written in different notations and supporting different temporal logics. For example,
Kronos [74] allows modeling of real time systems by timed automata, that is, automata extended with
a nite set of real-valued clocks, used to express timing constraints. It supports TCTL, an extension of
temporal logic that allows quantitative temporal claims. UPAAL [75] represents systems as networks of
automata extended with clocks and data variables. These networks are compiled from a nondeterministic
guarded command language with data types. The VeriSoft [58] model checker explores the state space of
systems composed of concurrent processes executing arbitrary code (writteninany language) and searches
for concurrency pathologies such as deadlock, lifelock, divergence, and for violation of user-specied
assertions.
6.3.2.1 Descriptive Tools
6.3.2.1.1 Vienna Development Method [76]
Vienna development method (VDM) is a model-oriented formal specication and design method based
on discrete mathematics, originally developed at IBMs Vienna Laboratory. It is a model-oriented formal
specication and design method based on discrete mathematics. Tools to support formalization using
VDM include parsers, typecheckers, proof support, animation, and test case generators. In VDM, a
system is developed by rst specifying it formally and proving that the specication is consistent, then
iteratively rening and decomposing the specication provided that each renement satises the previous
specication. This process continues until the implementation level is reached.
The VDM Specication Language (VDM-SL) was standardized by ISO in 1996 and is based on rst-
order logic with abstract data types. Specications are written as constructive specications of an abstract
data type, by dening a class of objects and a set of operations that act upon these objects. The model of
a system or subsystem is then based on such an abstract data type. A number of primitive data types are
provided in the language along with facilities for user-dened types.
VDM has been used extensively in Europe [48,77,78]. Tools have been developed to support
formalization using VDM, for example, Mural [79] which aids formal reasoning via a proof assistant.
6.3.2.1.2 Z [49,80]
Z evolved from a loose notation for formal specications to a standardized language with tool support
provided by a variety of third parties. The formal specication notation has been developed by the
Programming Research Group at Oxford University. It is based on Zermelo-Fraenkel set theory and
rst-order predicate logic. Z is supported by graphical representations, parsers, typecheckers, pretty-
printers, and a proof assistant implemented in HOL providing proof checkers as well as a full-edged
theorem prover. The standardization of Z through ISO solidied the tool base and enhanced interest in
mechanized support.
The basic Z form is called a schema, which is used to introduce functions. Models are constructed by
specifying a series of schemata using a state transition style. Several object-oriented extensions to Z have
been proposed.
Z has been used extensively in Europe (primarily in the United Kingdom) to write formal specications
for various industrial software development efforts and has resulted in two awards for technological
achievement: for the IBM CICS project and for a specication of the IEEE standard for oating-point
arithmetic. To leverage Z in embedded systems design, Reference 81 extended Z by temporal interval logic
and automated reasoning support through Isabelle and the SMV model checker.
6.3.2.1.3 B [50]
Following the B method, initial requirements are represented as a set of abstract machines, for which an
object-based approach is employed at all stages of development. B relies on a wide-spectrum notation to
represent all levels of description, from specication through design to implementation.
After specifying requirements, they can be checked for consistency which for B means preserva-
tion of invariants. In addition, B supports checking correctness of the renement steps to design and
implementation.
B is supported through toolkits providing syntax checkers, type checkers, a specication animator,
proof-obligation generator, provers allowing different degrees of mechanization, and a rich set of
coding tools. It also includes convenient facilities for documenting, cross-referencing, and reviewing
specications.
The B method is popular in industry as well as in the academic community. Several international
conferences on B have been conducted. An example of the use of B in embedded systems design is
reported in Reference 82 which promotes the development of correct software for smart cards through
translation of B specications into embedded C code.
6.3.2.1.4 Rigorous Approach to Industrial Software Engineering [83,84]
Rigorous Approach to Industrial Software Engineering (RAISE) is based on a development method-
ology that evolved from the VDM approach. Under the RAISE methodology, development steps are
carefully organized and formally annotated using the RAISE specication language which is a powerful
wide-spectrum language for specifying operations and processes allowing derivations between levels of
specications. It provides different styles of specication: model-oriented, algebraic, functional, imper-
ative, and concurrent. The CORE requirements method is also provided as an approach for front-end
analysis. Supporting tools provide a window-based editor, parser, typechecker, proof tools, a database,
and translators to C and Ada.
Derivations from one level to the next generate proof obligations. These obligations may be dis-
charged using proof tools which are also used to perform validation (establishing system properties).
Detailed descriptions of the development steps and overall process are available for each tool. The nal
implementation step has been partially mechanized for common implementation languages.
6.3.2.1.5 Common Algebraic Specication Language [85]
Common Algebraic Specication Language (CASL) was developed as a language for formal specication
of functional requirements and modular software design that subsumes many algebraic specication
frameworks and also provides tool interoperability. CASL is a complex language with a complete formal
semantics comprising a family of formally dened specication languages meant to constitute a common
framework for algebraic specication and development [86]. To make tool construction manageable,
it allows for reuse of existing tools, for interoperability of tools developed at different sites, and for
construction of generic tools that can be used for several languages.
The CASL Tool Set, CATS, combines a parser, a static checker, a pretty printer, and facilities for
translation of CASL to a number of different theorem provers. Encoding eliminates subsorting and
partiality, and thus allows reuse of existing theorem proving tools and term rewriting engines for CASL.
Typical applications of a theorem prover in the context of CASL are checking semantic correctness
(according to the model semantics) by discarding proof obligations that have been generated during
checking of static semantic constraints and validating intended consequences, which can be added to a
specication using annotations. This allows a check for consistency with informal requirements.
In the scope of embedded systems verication, the Universal Formal Methods (UniForM) Workbench
has been deployed in the development of railway control and space systems [87]. This system aims to
provide a basis for interoperability of tools and the combination of languages, logics, and methodologies.
It supports vericationof basic CASLspecicationencodedinIsabelle andthe subsequent implementation
of transformation rules for CASL to support correct development by transformation.
6.3.2.1.6 Software Cost Reduction [88]
Software cost reduction (SCR) is a formal method for modeling and validating system requirements. SCR
models a systemas a black box computing output data fromthe input data. Systembehavior is represented
as a nite-state machine. SCR is based on tabular notations that are relatively easy to understand. To
develop correct requirements with SCR, a user shall perform four types of activities supported by SCR
tools.
First, a specication is developed using the SCR tabular notation using the specication editor. Second,
the specication is automatically analyzed for violations of application-independent proprieties, such as
nondeterminismand missing cases, using an extension of the semantic tableaux algorithm. To validate the
specication, the user may run scenarios, sequences of observable events, through the SCR simulator and
for checking application-dependent properties the user can apply the Spin model checker by translating
the specication into Promela.
A toolset has been developed by the Naval Research Laboratory, including a specication editor, a sim-
ulator for symbolically executing the specication, and formal analysis tools for testing the specication
for selected properties.
SCR has been applied primarily to the development of embedded control systems including the A-7E
aircrafts operational ight program, a submarine communications system, and safety-critical components
of two nuclear power plants [89].
6.3.2.1.7 EVES [55]
EVES is an integrated environment supporting formal development of systems fromrequirements to code.
Additionally, it may be used for formal modeling and mathematical analysis. To date, EVES applications
have primarily been in the realm of security-critical systems.
EVES relies on the wide-spectrum language Verdi, ranging from a variant of classical set theory with a
library mechanism for information hiding and abstraction to an imperative programming language. The
EVES mathematics is based on ZFC set theory without the conventional distinction between terms and
formulae. Supporting tools are a well-formedness checker, the integrated automated deduction system
NEVER, a proof checker, reusable library framework, interpreter, and compiler.
Development is treated as theory extension: each declaration extends the current theory with a set of
symbols and axioms pertaining to those symbols. Proof obligations are associated with every declaration
to guarantee conservative extension. The EVES library is a repository of reusable concepts (e.g., a variant
of the Z mathematical toolkit is included with EVES) and is the main support for scaling, information
hiding, and abstraction. Library units are either specication units (axiomatic descriptions), model units
(models or implementations of specications), or freeze units (for saving work in progress).
6.3.2.2 Deductive Veriers
6.3.2.2.1 High-Order Logic [53]
High-order logic is an environment for interactive theorem proving in HOL, that is, predicate calculus
with terms from the typed lambda calculus. HOL provides a parser, pretty-printer, typechecker, as well as
forward and goal oriented theorem provers. It interfaces HOL to the MetaLanguage (ML) which allows
representation of terms and theorems of the logic, of proof strategies, and of logical theories.
The HOL system is an interactive mechanized proof assistant. It supports both forward and backward
proofs. The forward proof style applies inference rules to existing theorems in order to obtain new
theorems and eventually the desired goal. Backward or goal oriented proofs start with the goal to be
proven. Tactics are applied to the goal and subgoals until the goal is decomposed into simpler existing
theorems.
HOL provides a general and expressive vehicle for reasoning about various classes of systems. Some of
the applications of HOL include the specication and verication of compilers, microprocessors, interface
units, algorithms, and formalization of process algebras, program renement tools, and distributed
algorithms.
Initially, HOL was aimed at hardware specication and verifying but later its application was extended
to many other domains. Since 1988 an annual meeting of the HOL community evolved into a large
international conference.
6.3.2.2.2 Isabelle [90]
Isabelle, developed at Cambridge University, is a generic theorem prover providing a high degree of auto-
mation and supporting a wide variety of built-in logics: many-sorted rst-order logic, constructive and
classical versions, higher-order logic, ZermeloFraenkel set theory, an extensional version of Martin-Lf s
Type Theory, two versions of the Logic for Computable Functions, the classical rst-order sequent cal-
culus, and modal logic. New logics are introduced by specifying their syntax and inference rules. Proof
procedures can be expressed using tactics. Ageneric simplier performs rewriting by equality relations and
handles conditional and permutative rewrite rules, performs automatic case splits, and extracts rewrite
rules from context. A generic package supports classical reasoning in a rst-order logic, set theory, etc.
The proof process is automated to allow long chains of proof steps, reasoning with and about equations,
and proofs about facts of linear arithmetic.
Isabelle aims at the formalization of mathematical proofs. Some large mathematical theories have been
formally veried and are available to a user. These include elementary number theory, analysis, and set
theory. For example, Isabelles ZermeloFraenkel set theory derives general theories of recursive functions
and data structures (including mutually recursive trees and forests, innite lists, and innitely branching
trees).
Isabelle has been applied to formal verication as well as reasoning about the correctness of computer
hardware, software, and computer protocols. Reference 91 has applied Isabelle to prove correctness
of safety-critical embedded software: an HOL implemented in Isabelle has been used to model both
specication and implementation of initial requirements; the problem of implementation correctness is
reduced to a mathematical theorem to be proven.
6.3.2.2.3 PVS [56,62]
PVS provides an integrated environment for the development and analysis of formal specications and
is intended primarily for the formalization of requirements and design-level specications, and for the
rigorous analysis of difcult problems. It has been designed to benet from synergetic usage of different
formalisms in its unied architecture.
The PVS specication language is based on classical, typed HOL with predicate subtypes, dependent
typing, and abstract data types. The highly expressive language is tightly integrated with its proof system
and allows automated reasoning about type dependencies. PVS offers a rich type system, strict typecheck-
ing, and powerful automated deduction with integrated decision procedures for linear arithmetic and
other useful domains, and a comprehensive support environment. PVS specications are organized into
parameterized theories that may contain assumptions, denitions, axioms, and theorems. Denitions are
guaranteed to provide conservative extension. Libraries of proved specications froma variety of domains
are available.
The PVS prover supports a fully automated mode as well as an interactive mode. In the latter, the
user chooses among various inference primitives (induction, quantier reasoning, conditional rewriting,
simplication using specic decision procedures, etc.). Automated proofs are based on user-dened
strategies composed from inference primitives. Proofs yield scripts that can be edited and reused. Model-
checking capabilities are integrated with the verication systemand can be applied for automated checking
of temporal properties.
PVS has been applied to algorithms and architecture for fault-tolerant ight control systems, to prob-
lems in real-time systemdesign, and to hardware verication. Reference 92 combined PVS with industrial,
UML-based development. Similarly, the Boderc project at the Embedded Systems Institute aims to integ-
rate UML-based software design for embedded systems into a common framework that is suitable for
multidisciplinary system engineering.
6.3.2.2.4 Larch [61]
Larch is a rst-order specicationlanguage supporting equational theories embedded ina rst-order logic.
The Larch Prover (LP) is designed to treat equations as rewrite rules and carry out other inference steps
such as induction and proof by cases. The user may introduce operators and assertions about operators
as part of the formalization process. The system is comprised of parser, type checker, and a user-directed
prover.
Larch Prover is designed to work midway between proof checking and fully automatic theoremproving.
Users may direct the proof process at a fairly high level. LP attempts to carry out routine steps in a proof
automatically and provide useful information about why proofs fail, but is not designed to nd difcult
proofs automatically.
6.3.2.2.5 Nqthm [52]
Nqthm is a toolset based on the powerful heuristic BoyerMoore theorem prover for a restricted logic
(a variant of pure applicative Lisp). There is no explicit specication language; rather, one writes spe-
cications directly in the Lisp-like language that encodes the quantier-free, untyped logic. Recursion
is the main technique for dening functions and, consequentially, mathematical induction is the main
technique for proving theorems. The system consists of parser, pretty-printer, limited typechecker (the
language is largely untyped), theorem prover, and animator.
The highly automated prover can be driven by large databases of previously supplied (and proven)
lemmas. The tool distribution comes with many examples of formalized and proved applications. For
over a decade, the Nqthmseries of provers has beenusedtoformalize a wide variety of computing problems
including safety-critical algorithms, operating systems, compilers, security devices, microprocessors, and
pure mathematics. Two well-known industrial applications are a model of a Motorola digital signal
processing (DSP) chip and the proof of correctness of the oating point division algorithm for the
AMD5K 86 microprocessor.
6.3.2.2.6 Nuprl [93]
Nuprl was originally designed by Bates and Constable at Cornell University and has been expanded and
improved over the past 15 yr by a large group of students and research associates. Nuprl is a highly
extensible open system that provides for interactive creation of proofs, formulae, and terms in a typed
language which is constructive type theory with extensible syntax. The Nuprl system supports HOLs and
rich type theories. The logic and the proof systems are built on a highly regular untyped term structure,
a generalization of the lambda calculus and mechanisms given for reduction of such terms. The style of
the Nuprl logic is based on the stepwise renement paradigm for problem solving in which the system
encourages the user to work backwards from goals to subgoals until one reaches what is known.
Nuprl provides a window-based interactive environment for editing, proof generation, and function
evaluation. The system incorporates a sophisticated display mechanism that allows users to customize the
display of terms. Based on structure editing, the system is free to display terms without regard to parsing
of syntax. The system also includes the functional programming language ML as its metalanguage; users
extend the proof system by writing their own proof generating programs (tactics) in ML. Since tactics
invoke the primitive Nuprl inference rules, user extensions via tactics cannot corrupt system soundness.
The system includes a library mechanism and is provided with a set of libraries supporting the basic types
including integers, lists, and Booleans.
The systemalso provides an extensive collection of tactics. The Nuprl systemhas been used as a research
tool to solve open problems in constructive mathematics. It has been used in formal hardware verication,
as a research tool in software engineering, and to teach mathematical logic to Cornell undergraduates.
It is now being used to support parts of computer algebra and is linked to the Weyl computer algebra
system.
6.3.2.3 State Exploration Tools
6.3.2.3.1 Symbolic Model Verier (SMV) [59]
The SMV system is a tool for checking nite-state systems against specications of properties. Its high-
level description language supports modular hierarchical descriptions and the denition of reusable
components. Properties are described in Computation Tree Logic (CTL), a propositional, branching-time
temporal logic. It covers a rich class of properties including safety, liveness, fairness, and deadlock freedom.
The SMV input language offers a set of basic data types consisting of bounded integer subranges and
symbolic enumerated types, which can be used to construct static, structured types. SMV can handle
both synchronous and asynchronous systems, and arbitrary safety and liveness properties. SMV uses a
BDD-based symbolic algorithmof model checking to avoid explicitly enumerating the states of the model.
With carefully tuned variable ordering, the BDD algorithm yields a system capable of verifying circuits
with extremely large numbers of states.
The SMV system has been distributed widely and has been used to verify industrial-scale circuits and
protocols, including the cache coherence protocol described in the IEEE Futurebus+ standard and the
cache consistency protocol developed at Encore Computer Corporation for their Gigamax distributed
multiprocessor. Formal verication of embedded systems using symbolic model checking with SMV has
been demonstrated in Reference 94: a Petri net-based system model is translated into the SMV input
language along with the specication of timing properties.
6.3.2.3.2 Spin [60,95,96]
Spin is a widely distributed software package that supports the formal verication of distributed systems
It was developed by the formal methods and verication group at Bell Laboratories.
Spin relies on the high-level specication language PROMELA (Process MetaLanguage), a non-
deterministic language based on Dijkstras guarded command language notation and CSP. PROMELA
contains primitives for specifying asynchronous (buffered) message passing via channels with an arbit-
rary number of message parameters. It also allows for the specication of synchronous message passing
systems (rendezvous) and mixed systems, using both synchronous and asynchronous communications.
The language can model dynamically expanding and shrinking systems, as new processes and message
channels can be created and deleted on the y. Message channel identiers can be passed in messages from
one process to another.
Correctness properties can be specied as standard system or process invariants (using assertions), or
as general Linear Temporal Logic (LTL) requirements, either directly in the syntax of next time free LTL, or
indirectly as Bchi Automata (expressed in PROMELA syntax as never claims). Spin can be used in three
modes: for rapid prototyping with random, guided, or interactive simulation; as an exhaustive verier,
capable of rigorously proving the validity of user-specied correctness requirements (using partial order
reduction to optimize search); and as proof approximation system that can validate very large protocol
systems with maximal coverage of the state space.
Spin has been applied to the verication of data transfer and bus protocols, controllers for reactive
systems, distributed process scheduling algorithms, fault-tolerant systems, multiprocessor designs, local
area network controllers, microkernel design, and many other applications. The tool checks the logical
consistency of a specication. It reports on deadlocks, unspecied receptions, and ags incompleteness,
race conditions, and unwarranted assumptions about the relative speeds of processes.
6.3.2.3.3 COordination SPecication ANalysis (COSPAN) [97]
The COSPANis a general purpose, rapid prototyping tool developed at AT&T that provides a theoretically
seamless interface between an abstract model and its target implementation, thereby supporting top-down
system development and analysis. It includes facilities for documentation, conformance testing, software
maintenance, debugging, and statistical analysis, as well as libraries of abstract data types and reusable
pretested components.
The COSPANinput language, S/R(selection/resolution), belongs to the omega-regular languages which
are expressible as nite-state automata on innite strings or behavioral sequences. COSPAN is based on
homomorphic reduction and renement of omega-automata, that is, the use of homomorphisms to
relate two automata in a process based on successive renement that guarantees that properties veri-
ed at one level of abstraction hold in all successive levels. Reduction of the state space is achieved by
exploiting symmetries and modularity inherent in large, coordinating systems. Verication is framed as
a language-containment problem: checking consists of determining whether the language of the system
automaton is contained in the language of the specication automaton. Omega-automata are partic-
ularly well-suited to expressing liveness properties, that is, events that must occur at some nite, but
unbounded time.
The COSPAN has been used in the commercial development of both software and hardware systems:
high-level models of several communications protocols, for example, the X.25 packet switching link
layer protocol, the ITU le transfer and management protocol (FTAM), and AT&Ts Datakit universal
receiver protocol (URP) level C; verication of a custom VLSI chip to implement a packet layer protocol
controller; and analysis and implementation of AT&Ts Trunk Operations Provisioning Administration
System (TOPAS).
6.3.2.3.4 MEIJE [98]
The MEIJE project at INRIA and the Ecole des Mines de Paris has long investigated concurrency theory
and implemented a wide range of tools to specify and verify both synchronous and asynchronous reactive
systems. It uses Esterel, a language designed to specify and program synchronous reactive systems, and a
graphical notation to describe labeled transition systems.
The tools (graphical editors, model checkers, observer generation) operate on the internal structure of
automata combined by synchronized product which are generated either from the Esterel programs or
from the graphical representations of these automata. MEIJE supports both explicit representation of the
automata supporting model checking and compositional reduction of systems using bisimulation or
hiding, as well as implicit representation of the automata favoring verication through observers and
forward search for properties to verify. To deal with the large state spaces induced by realistic-sized
specications, the MEIJE tools provide various abstraction techniques, such as behavioral abstraction
which replaces a sequence of actions by a single abstract behavior, state compression and encoding
in BDD, and on-the-y model checking. Observers can be either directly written in Esterel or can be
generated automatically from temporal logic formulae.
6.3.2.3.5 CADP [99]
The CADP, developed at INRIAandVerimag, is a tool box for designing and verifying concurrent protocols
specied in the ISO language LOTOS and to study formal verication techniques. Systems are specied as
networks of communicating automata synchronizing through rendevous or labeled transition systems.
The CAESARcompiler translates the behavioral part of LOTOS specications into a Cprogramthat can
be simulated and tested or into a labeled transition systemto be veried by ALDEBARAN. The latter allows
the comparison and reduction of labeled transition systems by equivalences of various strength (such as
bisimulation, weak bisimulation, branching bisimulation, observational equivalence, tau bisimulation,
or delay bisimulation). Diagnostic abilities provide the user with explanations when the tools failed
to establish equivalence between two labeled transition systems. The OPEN environment provides a
framework for developing verication algorithms in a modular way, and various tools are included:
interactive simulators, deadlock detection, reachability analysis, path searching, on-the-y model checker
to search for safety, liveness, and fairness properties, and a tool for generating test suites.
6.3.2.3.6 Murphi [100]
Murphi is a complete nite-state verication system that has been tested on extensive industrial-
scale, examples including cache coherence protocols and memory models for commercially designed
multiprocessors.
The Murphi verication system consists of the Murphi compiler, and the Murphi description language
for nite-state asynchronous concurrent systems which is loosely based on Chandy and Misras Unity
model and includes user-dened data types, procedures, and parameterized descriptions. A version for
synchronous concurrent systems is under development. A Murphi description consists of constant and
type declarations, variable declarations, rule denitions, start states, and a collection of invariants. The
Murphi compiler takes a Murphi description and generates a C++programthat is compiled into a special-
purpose verier that checks for invariant violations, error statements, assertion violations, deadlock, and
liveness. The verier attempts to enumerate all possible states of the system, while the simulator explores
a single path through the state space. Efcient encodings, including symmetry-based techniques, and
effective hash-table strategies are used to alleviate state explosion.
6.4 Specifying and Verifying Embedded Systems
The problems of consistency and completeness of requirements viewed as mathematical problems are
well known to be algorithmically unsolvable even for notations solely based on the rst-order predicate
calculus. To overcome these difculties, we have studied the general form of the requirements used in
specic subject domains and developed methods of proving sufcient conditions of consistency and
completeness.
Each requirements specicationdenes a class of systems compatible with the requirements. All systems
in this class are dened at least up to bisimulation, that is, systems with the same observable behavior
are considered as equal. However, systems often operate in the context of some environment; when
requirements describe the properties of the environment into which a system is inserted, the weaker
notion of insertion equivalence is used to distinguish different systems. If the class of systems compatible
with the requirements is not empty, we consider the requirements to be consistent. If the class contains
only one system (up to bisimulation or insertion equivalence, respectively), the requirements are said to
be complete.
To represent requirements, we distinguish between a class of logical requirement languages representing
behavior in logical form, a class of trace languages, representing behavior in the form of traces, and a
class of automata networks languages, representing behavior in terms of states and transitions. The latter
are model-oriented [101], in that desired properties or behaviors are specied by giving a mathematical
model that exhibits those properties. The disadvantage of model-oriented specications is that they state
what should be done or how something should be implemented, rather than the properties that are
required. For property-oriented [101] languages, each description denes not a single model, but a class
of models which satisfy the properties dened by the description. The properties expressed are properties
of attributed transition systems representing environments and agents inserted into these environments.
Only actions and the values of attributes are observable for inserted agents.
6.4.1 System Descriptions and Initial Requirements
The descriptions of observable entities (attributes) of a system and the external environment, their types,
andthe parameters they dependon, are capturedinrst-order predicate logic extendedby types andcertain
predened predicates such as equality or arithmetic inequality. The signature of the language contains a set
of attribute names, where each attribute has a type which is dened using direct or axiomatic denitions.
Operations dened on the sets of values of different types are used to construct value expressions (terms)
from attribute symbols, variables, and constants. If an expression contains attributes, the value of this
expression depends not only on the values of its variables, if any, but also on the current state of the
environment. Consequentially, it denes a function on the set of states of an environment.
The language includes the temporal modalities always and sometimes. Logical statements dene
properties of the environment, characterize agent actions, or dene abstract data types.
Initial requirements describe the initial state as a logical statement. If initial requirements are present,
requirements refer only to the states reachable from initial states, which are those states satisfying the
initial requirements. To describe initial requirements, we use a temporal modality initially.
Axioms about the system are introduced by the form
let <name> : <statement>;
6.4.2 Static Requirements
Static requirements dene the change of attributes at any given moment or interval of time depending on
the occurrence of events and the previous history of system behavior. Static requirements describe the
local behavior of a system, that is, all possible transitions which may be taken from the current state if
it satises the precondition, after the event forcing these transitions has happened. The general form of
static requirements is
req <name>:[<prex>]([<precondition>] -> [after <event description>] <postcondition>);
The precondition is a predicate formula of rst-order logic, true before the transition; the postcondition is
a predicate formula of rst-order logic, true after the transition. Both precondition and postcondition may
refer to a set of attributes used in the system. Variables are typed and may occur in precondition, event, and
postcondition; they link the values of attributes before and after the transition. Only universal quantiers
are allowed inthe quantier prex. All attributes occur free inthe requirements, but if anattribute depends
on parameters, the algebraic expressions substituted for the parameter may contain bound variables.
If the current state of the environment satises the precondition, and allows the event, then after
performing this action the newstate of the systemwill satisfy the property expressed by the postcondition.
This notation corresponds to Hoare-style triples.
Predicates are dened on sets of values, and predicate expressions can be constructed using predicate
symbols and value expressions to express properties of states of environments (if they include attributes).
The quantiers forall and exists can be used, as well as the usual propositional connectives.
The precondition is a logical statement without explicit temporal modalities and describes observable
properties of a system state. It may include predened temporal functionals that depend on the past
behavior of a system or its attributes, for example, the duration functional dur: if P is a statement, then
dur P denotes the time passed from the last moment when P became true; its value is a number (real for
continuous time and an integer for discrete time).
The event denotes the cause of a transition or a change of attribute values. The simplest case of an
event is an action. More complex examples are sequences of actions or nite behaviors (i.e., the event is an
algebraic expression generated by actions, prexing, nondeterministic choice, and sequential and parallel
compositions). To describe histories we use product P Q and iteration It(P) over logical statements.
The postcondition is a logical statement denoting the property of attribute values after the transition
dened by the static requirement has been completed. It cannot include temporal modalities as well as
functionals depending on the past.
As an example, consider a systemwhich counts the number of people entering a room. The requirement
for an action enter could be written as:
req Enter: Forall(n : int)((num = n) > after(enter) (num = n +1));
where num is a system attribute representing the number of people in the room. The variable n links the
value of the attribute num before and after the transition caused by the enter action.
6.4.3 Dynamic Requirements
Dynamic requirements are arbitrary logical statements including temporal modalities and functionals.
They should be the consequences of static requirements and, therefore, dynamic requirements are
formulated as calls to the prover to establish a logic statement:
prove<statement>;
We use the temporal modalities always and sometimes, as well as the temporal functional dur and
Kleene operations (product and iteration) over logical statements. Temporal statements refer to properties
of attributed transition systems with initial states. A logical statement without modalities describes a
property of a state of a system while temporal statements describe properties of all states reachable from
admissible initial states: if P is a predicate formula, the formula always P means that any state reachable
from an admissible initial state possesses property P. The formula sometimes P means that there exists
a reachable state that possesses property P. The notion of reachability can be expressed in set-theoretical
terms. Temporal modalities can be translated to rst-order logic by introducing quantiers on states and
consequentially, it is possible to use rst-order provers for proving properties of environments.
For synchronous systems (systems with explicit time assignments at each state) we introduce the
functional (E time t ) for arbitrary expression E (terms or statements) and integer expression t which
denotes the value of Eat time t . Then always and sometimes are dened as:
always P (t 0)(P time t )
sometimes P (t 0)(P time t )
The functional dur is dened in the following way [102]:
(dur P = s) time t (t s) (P) time (t s) (s
)((t s
> t s) P time s
)
Statements, which do not explicitly mention state or time are considered as referring to an arbitrary
current state or an arbitrary current moment of time. Statements with Kleene operations refer to discrete
time and are reduced to logical statements as follows:
(P
1
P
2
. . .
P
n
) time t (P
1
time (t n +1)) (P
2
time (t n +2)) (P
n
time t )
It(P) time t (s t )(s
)((t s s
t ) P time s
)
6.4.4 Example: Railroad Crossing Problem
The railroad crossing problem is a well-known benchmark to assess the expressiveness of development
techniques for interactive systems. We illustrate the description of a synchronous system(in discrete time)
relying on duration functionals. The problem statement is to develop a control device for a railroad
crossing so that safety and liveness conditions are satised. This system has three components, as shown
in Figure 6.4).
The n-track railroad has the following observable attributes: InCr is a Boolean variable equal to 1 if
a train is at the crossing; Cmg(i) is a Boolean variable equal to 1 if a train is coming on track number i.
At the moment this attribute becomes equal to 1, the time left until the train will reach the crossing is not
less than d_min and it remains 1 until the train reaches the crossing. Cmg(i) is an input signal to the
controller which has a single output signal DirOp. When DirOp equals to 1, the gate starts opening, and
when it becomes 0, the gate starts closing. The attribute gate shows the position of the gate. It is equal
to open when the gate is completely opened and closed if it is completely closed. The time taken for
the gate to open is d_open, the time taken to close is d_close. The requirements text below omits the
straightforward static requirements. The dynamic properties of the system are safety and liveness. Safety
means that when the train is at the crossing, the gate is closed. Liveness means that the gate will open
when the train is at a safe distance (Code 6.1).
n-track railroad Controller Gate
InCr Gate
Cmg DirOp
n
FIGURE 6.4 Railroad crossing problem.
Code 1
parameters(
d_min,
d_close,
d_open,
WT );
attributes(n:int)(
InCr:bool,
Cmg(n):bool,
DirOp:bool,
gate );
let C1:(d_min>d_close);
let C2:(d_close>0);
let Duration Theorem: Forall(x,d)(
always(dur Cmg(x) > d -> (DirOp))->
always(dur (DirOp) > dur Cmg(x)+(-1)*(d+1)) );
/* ------------- Environment spec ------------------------ */
let CrCm: always(InCr->Exist x (dur Cmg(x) > d_min));
let OpnOpnd:always( dur DirOp >d_open ->(gate=opened));
let ClsClsd:always( dur (DirOp)>d_close->(gate=closed));
/* ------------ Controller spec ------------------------ */
let Contr1: always(Exist x (dur Cmg(x) > WT ) -> (DirOp));
let Contr2: always(Forall x (WT >= dur Cmg(x)) -> DirOp );
/* ------------- Safety and Liveness --------------------- */
let(WT=d_min+(-1)*d_close);
prove Safety: always(InCr->(gate=closed));
prove Liveness: always(
Forall x ((WT > dur Cmg x) -> (gate=opened)));
Note the assumption of the Duration Theorem in the requirements to shorten the proofs of safety and
liveness.
6.4.5 Requirement Specications
The example in Section 6.4.4 is rather simple in the number of requirements. Requirement specications
used in practice to describe embedded systems are typically much more complex. Requirement specic-
ations may consist of hundreds or thousands of static requirements, and a large domain descriptions
through attributes and parameters. Each requirement is usually simple but taken together the resultant
behavior may be complex and contain inconsistencies or be incomplete.
We use attributed transition systems to describe the requirements for embedded systems. The formal
specication of requirements consists of the environment description, the description of common system
properties in the form of axioms, the insertion function dened by static requirements, and intended
properties of the system as a whole dened as dynamic requirements.
A typed list of system parameters and a typed list of system attributes are used to describe the structure
of the environment. The parameters of the system are variables, which have inuence on the behavior of
the environment and can change their values from one conguration of the system to another but they
never change their value during the execution of the system. Examples of system parameters are the set of
tasks for an embedded operating system, the bus threshold for a device controller, etc. System attributes
are variables that differ between the observable states of the environment. Attributes may change their
values during runtime. Examples of attributes are the queue of tasks, which are ready to be executed by
the operating system, or the current data packet for a device controller.
As an example, we consider (in simplied form) several fragments of the formalized requirements for
an embedded operating system for automotive electronics, OSEK [103]. A typed list of system parameters
and a typed list of system attributes describes the structure of the environment (Code 6.2).
Code 2
parameters (
tasks: Set of name,
resources: Set of name
);
attributes (
suspended: Set of name,
ready: Set of name,
running: name
);
Parameters of the system are variables, which have inuence on the behavior of the environment and
can change their values from one conguration of the system to another, but never change their value
during the execution of the system.
The operating system (environment) and executing tasks (agents) interact via service calls. The list of
actions contains the names of the services dened provided by the system, including service parameters,
if any (Code 6.3).
Common system properties are dened as a propositions in rst-order logic extended with temporal
modalities. For example, consider the following requirements: the length of the queue of suspended
tasks can never be greater than the set of dened tasks. We will formalize this requirement as follows
(Code 6.4).
To dene the transitions of the system when processing the request for a service we use Hoare-style
triples notation, as dened above (Code 6.5).
Code 3
a: name) (
Activate a,
Terminate,
Schedule );
Code 4
Let SuspendedLengthReq:
Always ((length(suspended)<|tasks|) |/ (length(suspended) = |tasks|));
Code 5
req Activate1: Forall (a:name, s: Set of name, r: Set of name) (
( (suspended = s) & (ready = r) & (a in s) )
-> after (Activate a)
( (suspended = (s setminus a)) & (ready = (r union a)) ));
The insertion function expressed by this rule is sequential, in that only one running task can be
performed at a time, all others are in a state suspended or ready. A task becomes running as a result of
performing a schedule action. It is selected from a queue of ready tasks ordered by priorities. Agents can
change the behavior of the environment by service requests. The interaction between the environment
and the agents is dened by an insertion function, which computes the new behavior of the environment
with inserted agents.
The part of the description of requirements specic to sequential environments is the denition of the
interaction of agents and environments, where this interaction is described by the insertion function. The
most straightforward way to dene this function is through interactive requirements: an action is allowed
to be processed if and only if the current state of the environment matches one of the preconditions for
service requests. This is denoted as E-(act)->E intuitively meaning that the environment E allows
the action act and if it will be processed then the environment will be equal to E
.
The agent (composition of all agents) interacting with the environment requests the service act if and
only if it transits from its current state u into state u
with action act.u-(act)->u.

The composition of environment and the agent is noted as env(E,u). To dene the transition of
env(E,u) we use interactive rules (Code 6):
Code 6
req ActivateInteract1:
Forall (E: env, E:env, u:agent, u:agent, a:name)(
( (E-(Activate a)->E) & (u-(Activate a)->u) )
->
( env(E,u) -(Activate a)-> env(E, (Schedule;u))));
req ActivateInteract2:
Forall (E: env, E:env, u:agent, u:agent, a:name)(
((E -(Activate a)-> E) & (u-(Activate a)->u) )
->
( (E.error = 1) & env(E,u) -(Activate a)-> env(E, bot)) ));
Intuitively, this rule means that when the environment allows the action Activate a and the agent
requests the action Activate a, then the composition of agent and environment env(E,u) trans-
itions to the state env(E,(Schedule;u)) with the action Activate a, then at the next step the
environment will be equal to E

and the current agent will be (Schedule;u).
6.4.5.1 Requirements for Sequential Environments
This class of models includes such products as single processor operating systems and single client devices.
The denitive characteristic of such systems is that at any moment of time only one service request can
be processed by the environment. Agents request services from the environment; they are dened by their
behavior. The only way of interaction between the environment and the agents is to interact through
service requests. It determines the level of abstraction that we use in the formal denition of the behavior
of agents.
The insertion functions used for the description of sequential systems is broader than the insertion
functions discussed earlier. An inserted agent can start its activity before agents inserted earlier terminate.
The active agent can be selected by the environment using various criteria such as priority or other static or
dynamic characteristics. To compare agent behaviors, in some cases a look-ahead insertion may be used.
Usually, sequential environments are deterministic systems and static requirements should be consistent
to dene deterministic transitions. Consistency requirements reduce to the following condition: the
preconditions for each pair of static requirements referring to the same action must be nonintersecting.
In other words, for arbitrary values of attributes there must be at least one of two requirements which
has a false precondition. Completeness can also be checked for the set of all static requirements that refer
to the same action. Every such set of requirements must satisfy the condition that for arbitrary values of
attributes there must be at least one among the requirements that is applicable with a true precondition.
6.4.5.2 Requirements for Parallel Environments
A parallel environment is an environment with inserted agents that work in parallel and independently.
Agents are considered as transition systems. An agent can transition from one state to another by per-
forming an action at any time when the environment accepts this action. Once the agent has completed
the action, it transitions into a new state and, in addition, causes a transition of the environment.
As an example, consider the modeling of device interaction protocols. Devices are independent and
connected through the environment. They interact by sending and receiving of data packets. The protocol
is considered as an algorithm used by devices to interact with other components.
Such a device is an agent in the parallel environment. It is represented as a transition system that can
cause transition between states by one of the two actions: sending or receiving a packet that is a parameter
of these actions. We formalize such requirements by using the notation of Hoare-style triples.
Asynchronous parallel environments are highly nondeterministic. Such specications are easily
expressed in sequence diagrams or message sequence charts, as shown in Figure 6.5. The precondi-
tions and postconditions are conditions and states on the message sequence diagram, while the actions
represent the message arrows and tasks shown on the diagrams.
6.4.5.3 Requirements for Synchronous Agents
As an example of a synchronous system, consider a processor with a bus and its time-dependent behavior.
The processor interacts with the bus through signals which appear at every bus cycle (discrete time step).
Interaction protocols in the processor-bus system are dened by signal behavior. Every signal can have a
value of either 0 or 1. After some event, every signal switches to one of these values. For every signal of the
systemthere is a history describing the conditions of signal switching. Such conditions are called assertion
condition (when the signal switches to 1) and deassertion condition (when the signal switches to 0).
Formally, the history of signals is a sequence of conjunctions of signals. The situation when signal S
1
is
equal to 1, signal S
2
is equal to 0 in moment t
n
and signal S
1
is equal to 0, signal S
2
is equal to 1 in moment
t
n+1
can be described in the following way:
(S1& S2) ( S1&S2)
DAP ACG CCCH(k) SDGC_Call(i, Group_ID)
SDGC_Complete
(MS_ID, Group_Id=0,)
SDGC_Page_Response_Type_1
(ms_id,)
SDGC_Complete
(ms_id,)
SDGC_Complete
(ms_id,)
SDGC_Page_Request
(Group_ID, LAI,)
SDGC_Page_Request_Type_1
(ms_id=2)
(ms_id=3)
(ms_id=1)
SDGC_Paging_Request
(List_Of_Targets)
FIGURE 6.5 Sample MSC diagram.
Let signal S
3
have as assertion condition that it will be equal to 1 after the above history of signals S
1
and
S
2
. This fact can be described as
(S1 & S2) ( S1 &S2) > after(1) S3
using triples notation. This condition can be reected on a wave (or timing) diagram (Figure 6.6).
The consistency condition is fullled if signal q will not be changed into 1 and 0 in the same cycle. In
other words, if there are two requirements P q and P
q, then preconditions P and P
cannot be
true simultaneously.
Static requirements for synchronous systems can use Kleene expressions over conditions and duration
functions with numeric inequalities in preconditions. These requirements are converted into standard
form with logic statements relating to adjacent time intervals.
S1
S2
S3
t1 t2 t3
FIGURE 6.6 Sample wave diagram.
6.4.6 Reasoning about Embedded Systems
The theory of agents and environments has been implemented in the system 3CR [104]. The kernel of
our system [105] consists of a simulator for a generic Action Language (AL) [10,11] for the description of
system behaviors, of services for automatic exploration of the behavior tree of a system, and of a theorem
prover for rst-order predicate logic, enrichedwitha theory of linear equations andinequalities. It provides
the following technologies supporting the development, verication, and validation of requirements for
embedded systems:
Prove the internal consistency and completeness of static requirements of a system.
Prove dynamic properties of the system dened by static requirements including safety, liveness,
and integrity conditions.
Translate systems described in standard engineering languages (e.g., MSC, SDL, or wave
diagrams) into the rst-order format described earlier and simulate these models in user-dened
environments.
Generate test suites for a system dened by veried requirements specications and validate the
implementations of the system against these test cases.
These facilities can be used in automated as well as in interactive mode. To determine consistency and
completeness of requirements for interactive systems we rely on the theory of interaction of agents and
environments as the underlying formal machinery.
6.4.6.1 Algebraic Programming
The mathematical models described in Section 6.2 can be made more concrete by imposing structure on
the state space of transition systems. An universal approach is to consider an algebraic structure of the set
of states of a system. Then states are represented by algebraic expressions and transitions can conveniently
be dened by (conditional) rewriting rules. Acombination of conditional rewriting rules with congruence
on the set of algebraic expressions can be dened in terms of rewriting logic [32].
Most modern rewriting techniques are considered primarily in the context of equational theories
but could also be applied to rst-order or higher-order clausal or nonclausal theorem proving. The
main disadvantage of computations with such systems is their relatively weak performance. For instance,
rewriting modulo associativity and commutativity (AC-matching) is NP-complete. Consequentially, these
systems are usually not powerful enough when real-life problems are considered.
Our environment [105] supports reasoning in noncanonical rewriting systems. It is possible to combine
arbitrary systems of rewriting rules with different rewrite strategies. The equivalence relation (basic
congruence) on a set of algebraic expressions is introduced by means of interpreters for operations
which dene a canonical form. The primary strategy of rewriting is a one-step syntactic rewriting with
postcanonization by means of reducing the rewritten node to this canonical form. All other strategies are
combinations of the primary strategy with different traversals of the tree representing a term structure.
Rewrite strategies can be chosen from the library of strategies or written as procedures or functions.
The generic AL [11,106] is used for the syntactical representation of agents as programs and is based
on the behavior algebra dened in Section 6.2. The main syntactic constructs of AL are prexing, non-
deterministic choice, sequential composition, and parallel composition. Actions and procedure calls are
primitive statements. It provides the standard termination constants (successful termination, divergence,
deadlock). The semantics of this language is parameterized by an intensional semantics dened through
an unfolding function for procedure calls and an interaction semantics dened by the insertion function
of an environment into which the programwill be inserted. The intensional semantics and the interaction
semantics are dened as systems of rewriting rules.
The intensional semantics of an AL program is an agent which is obtained by unfolding procedure calls
in the program and dening transitions on a set of program states. It is dened independently of the
environment by means of rewriting rules for the unfolding function (unfolding rules) up to bisimulation.
The left-hand side of an unfolding rule is an expression representing a procedure call. The right-hand side
of an unfolding rule is an AL program which may be unfolded further generating more and more exact
approximations of the behavior under recursive computation.
The only built-in compositions of AL are prexing and nondeterministic choice. The unfolding of
parallel and sequential compositions are exible and can be adjusted by the user. Alternatives for parallel
composition are dened by the choice of the combination operator. For example, when the combination
of arbitrary actions is the impossible action, parallel composition is reduced to interleaving. On the other
hand, exclusion of interleaving from the unfolding rules denes parallel composition as synchronization
at each step (similar to hand shaking in Milners -calculus).
The interaction semantics of AL programs is dened through the insertion function. Programs are
considered up to insertion equivalence. Rewriting rules which dene the insertion function (insertion
rules) have the following structure: the left-hand side of an insertion rule is the state or behavior of the
environment with a sequence of agents inserted into this environment (represented as AL programs). The
right-hand side is a program in AL augmented by calls to the insertion function denoted as env(E, u),
where E is an environment state expression and u is anAL program. To compute the interaction semantics
of AL program one uses both the unfolding rules for procedure calls and the insertion rules to unfold calls
to the insertion function.
In this approach, the environment is considered as a semantic notion and is not explicitly included
in the agent. Instead, the meaning of an agent is dened as a transformation of an environment which
corresponds to inserting the agent into its environment. When the agent is inserted into the environment,
the environment changes and this change is considered to be a property of the agent described.
6.4.6.2 Simulating of Transition Systems
The AL has been implemented by means of a simulator [10,106,107], an interactive program which
generates all histories of an environment with inserted agents and which can explore the behavior of
this environment step-by-step, starting from any possible initial state, with branching at nondetermin-
istic points and backtracking to previous states. The simulator permits forward and backward moves
along histories; in automatic mode it can search for states satisfying predened properties (deadlock,
successful termination, etc.) or properties dened by the user. The generation of histories may be user
guided and thus permits examination of different histories. The user can retrieve information about the
current state of a system and change this state by means of inserting new agents using different insertion
functions.
Arbitrary data structures can be used for the representation of the states of an environment and the
environment actions. The set of states of an environment is closed under the insertion function e[u]
which is denoted in the simulator as env(e, u). The agent u is represented by an AL expression. Arbitrary
algebraic data structures can be used for the representation of agent actions and procedure calls.
The core of the simulator is specied as a nondeterministic transition system that functions as an
environment for the system model. Actions of the simulating environment are expressed by means of calls
for services of the simulator. Local services dene one-step transition of the simulated system. Global
services permit the user to compute different properties of the behavior of a simulated system. The user
can formulate the property of a state by means of a rewriting rule system or some other predicate function
and the simulator will search for the existence of a state satisfying the property among the states reachable
fromthe current state. Examples of such properties are deadlock, successful termination, undened states,
and so on.
6.4.6.3 Theorem Proving
The proof system [108] is based on the interactive evidence algorithm [109111] a Gentzen-style
calculus with unication used for rst-order reasoning.
The Interactive Evidence Algorithm is a sequent calculus and relies on the construction of an auxiliary
goal as the main inference step which allows easy control of the direction of the search for proofs at each
step through the choice of auxiliary goals. This algorithm can be represented as a combination of two
calculi: inference in the calculus of auxiliary goals is used as a single-step inference in the calculus of
conditional sequents. In a sense, the interactive evidence algorithm generalizes logic programming in that
for the latter, auxiliary goals are extracted from Horn disjuncts while in the interactive evidence algorithm
they are extracted from arbitrary formulae with quantiers (which need not be skolemized).
The interactive evidence algorithm is implemented as a nondeterministic algebraic program extracted
from the calculus based on the simulator for AL. This program is inserted as an agent into a control
environment which searches for a proof, organizes interaction with the user and the knowledge bases,
and implements strategies and heuristics to speed up the proof search. The control environment contains
the assumptions of a conditional sequent, and so the local information can be combined with other
information taken from knowledge base agents and used in search strategies.
The prover is invoked by the function prove implemented as a simple recursive procedure with
backtracking which takes an initial conditional sequent as argument and searches for a path from the
initial statement to axioms, and this path is converted to a proof. The inference search is nondeterministic
owing to disjunction rules.
Predicates are considered up to equivalence dened by means of all Boolean equations except dis-
tributivity. A function Can dened by means of a system of rewriting rules denes the reduction of
predicate formulae as well as propositional formulae to a normal form. Predicate formulae are considered
up to renaming of bound variables and equations (x)p = (x)p, (x)p = (x)p. Associativity,
commutativity, and idempotence of conjunction and disjunction as well as the laws of contradiction,
excluded middle, and the laws for propositional constants are used implicitly in these equations.
6.4.7 Consistency and Completeness
The notion of consistency of requirements in general is equivalent to the existence of an implementation
or model of a system that satises these requirements. Completeness means that this model is unique
up to some predened equivalence. The traditional way of proving consistency is to develop a model
coded in some programming or simulation language and to prove that this code is correct with respect to
requirements. However, direct proving of correctness is difcult because it demands computing necessary
invariant conditions for the states of a program. Another method is generating the space of all possible
states of a system reachable from the initial states and checking whether the dynamic requirements are
satised in each state. This approach is known as model checking and many systems which support model
checking have been developed. Unfortunately, model checking is realistic only if the state space is nite,
and all reachable states can be generated in a reasonable amount of time.
Our approach proves consistency and completeness of requirements directly, without developing a
model or implementation of the system. We prove that the static requirements dene the system com-
pletely and that dynamic properties of consistent requirements are all the logical consequences of static
requirements. Based on this assumption, one can dene an executable specication using only static
requirements and then execute it using a simulator.
We distinguish between the consistency and completeness of static requirements and dynamic con-
sistency. The rst is dened in terms of static requirements only and reects the property of a system
to be deterministic to actions by the environment. For example, a query from a client to a server as the
action of an inserted agent can be selected nondeterministically, but the response must be dened by static
requirements selected in a deterministic manner. When all dynamic requirements are the consequences
of static requirements, we say the system is dynamically consistent.
Sufcient conditions for the consistency of static requirements depend on subject domains and implicit
assumptions about the change of observable attributes. For example, for the classes of asynchronous
systems considered previously, the condition for internal consistency is simply that the conjunction of two
preconditions corresponding to different rules with the same action is not satisable. Completeness means
that the disjunction of all preconditions for all rules corresponding to the same action is generally valid.
For synchronous systems, on the other hand, it is the nonsatisability of two preconditions corresponding
to rules which dene conicting changes to the same (usually binary) attribute. The incompleteness of
static requirements usually is not harmful, it merely postpones design decisions to the implementation
stage. However, it is harmful if there exists an implementation which meets the static requirements but
does not meet the dynamic requirements.
Dynamic consistency of requirements (the invariance of dynamic conditions expressed using the tem-
poral modalityalways) can be proven inductively using the structure of static requirements. Consistency
checking proceeds by formulating and proving consistency conditions for every pair of static requirements
with the same starting event. Every such pair of requirements must satisfy the condition that for arbitrary
values of attributes there must be at least one of the two requirements which has a false precondition or
the postconditions are equivalent.
Completeness of requirements means that there exists exactly one model for the requirements up to
some equivalence. We distinguish two main cases depending on the focus of the requirements specication.
If the specication denes the environment, the equivalence of environments needs to be considered.
Otherwise, if an agent is dened by the requirements, the equivalence of agents needs to be examined.
Let e and e

be two environment states (of the same or different environments). We say that e and e
are equivalent if for an arbitrary agent u the states e[u] and e
[u] are bisimilar (from the equivalence of

two environment states it follows that for insertion equivalent agents u and u
, e[u] and e
[u
] are also
bisimilar). If there are restrictions on possible behaviors of the agents, we consider admissible agents
rather than arbitrary agents.
Let E and E

be two environments (each being a set of environment states and an insertion function).
These environments are equivalent if each state of one of the environments is equivalent to some state of
the other.
If the set of environments denes an agent for a given environment E, logical completeness (with
respect to agent denition) means that all agents satisfying these requirements are insertion equivalent
with respect to the environment E, that is, if u and u
satisfy the requirements, then for all e E,

e[u]
E
e[u
].
We check completeness for the set of all static requirements that refer to the same starting event. Every
such set of requirements must satisfy the condition that for arbitrary values of attributes there must be at
least one among the requirements that is applicable with a true precondition.
6.5 Examples and Results
Figure 6.7 exhibits a design process using the 3CR [104] tool set. The requirements for a system are
represented as input text written in the formal requirements language or translated from engineering
notations, such as SDL or MSC. Static requirements are sent to the checker which establishes their
consistency and completeness. The checker analyzes a requirement statement and generates a logical
Behavior
model

Checker
Static
requirements
Prover
Generate
executablespec
Simulator
Generate tests
Validate
Structure
model
Dynamic
requirements
Environment
model
FIGURE 6.7 Design process.
statement expressing the consistency of the given requirement with other requirements already accepted,
as well as a statement expressing completeness after all static requirements have been accepted. Then this
statement is submitted to the prover in order to search for a proof. The prover may return one of three
answers: proved, not proved, or unknown. In the case where consistency could not be proven, one of the
following types of inconsistencies is considered.
Inconsistent formalization. This type of inconsistency can be eliminated through improved formal-
ization, if the postconditions are consistent for the states where all preconditions are true. Splitting
the requirements can help.
Inconsistency resulting from incompleteness. This is the case when two requirements are consist-
ent, but nonintersection of preconditions cannot be proved because complete knowledge of the
subject domain is not available. A discussion with experts or the authors of the requirements is
recommended.
Inconsistency. Preconditions are intersected, but postconditions are inconsistent after performing
an action. This is a sign of a possible error, which can be corrected only by the change of require-
ments. If the intersection is not reachable, the inconsistency will not actually arise. In this case,
a dynamic property can be formulated and proven.
Dynamic properties are checked after accepting all static requirements. These are logical statements
expressing properties of a system in terms of rst-order predicate calculus, extended by temporal modal-
ities, as well as higher-order functions and types. If an inductive proof is needed, all static requirements
are used for generating lemmas to prove the inductive step.
After checking the consistency and completeness of static requirements, the requirements are used for
the automatic generation of an executable specication of a system satisfying the static requirements. At
this point, the dynamic requirements have already been proven to be consequences of static requirements,
so the system also satises the dynamic requirements. The next step of system design would be the use of
the obtained information in the next stages of development. For example, executable specications can
be used for generating complete test cases for system test.
6.5.1 Example: Embedded Operating System
In this section, we shall describe a general model which could be used for developing formal requirements
for embedded operating systems such as OSEK [103].
The requirements for the OSEK operating system can serve as an example of the application of the gen-
eral methodology of checking consistency. These requirements comprise two documents: OSEK Concept
and OSEK API. The rst document contains an informal description of conformance classes (BCC1,
BCC2, ECC1, ECC2, ECC3) and requirements of the main services of the system. The second document
renes the requirements in terms of C function headers and types of service calls.
Two kinds of requirements can be distinguished in these documents. Static requirements dene per-
manent properties of the operating system, which must be true for arbitrary states and any single-step
transition. These requirements refer to the structure of operating system states and their changes in
response to the performance of services. Dynamic requirements state global system properties such as the
absence of deadlocks or priority inversions.
Using the theory of interaction of agents and environments as the formalism for the description
of OSEK, an environment consists of a processor (or processor network), an operating system, and
the external world which interacts with the environment via some kind of communication network;
agents are tasks interacting with the operating system and communication network via services. We use
nondeterministic agents over a set of actions representing operating systemservices as models of tasks. The
states of the environment are characterized by a set of observable attributes with actions corresponding
to the actions of task agents.
Each attribute denes a partial function from the set E of environment states to the set of val-
ues D. E is considered as an algebra with the set of (internal or external) operations dened on it.
The domain D should be dened as abstract as possible, for example, by means of set theoretic construc-
tions (functions, relations, powersets) over abstract data types represented as initial algebras, in order to
be independent as much as possible of the details of implementation when formulating the requirements
specications.
In monoprocessor systems only one agent is in the active state, that is capturing a processor resource.
If e is a state of the environment with no active agents then in the representation e[u] of the environment
the state u is a state of an active agent. All other agents are in nonactive states (suspended and ready states
for OSEK) and are included into the state e as parts of the values of attributes.
The properties of an environment can be divided into static and dynamic properties. Static properties
dene one-step transitions of a system; dynamic properties dene the properties of the total system. The
general form of a rule for transitions is:
e
c
e
, u
a
u
e[u]
d
e
[u
]
In this rule d, e
, and u
depend on parameters appearing in the assumptions. Usually, if a = c = d

(synchronization), e
= e
, and u
= u
, albeit there can be special cases such as hiding, scheduling

points, or interrupt routines.
To dene the properties of a transition e
c
e
for the environment we rst dene the transition rules

for the attributes associating a transition system to each attribute. The states of a system associated to the
attribute p is a pair p : v where v is a value of a type associated with the attribute p. All transition systems
are dened jointly, that is the transitions of one attribute can depend on the current values or transitions
of other ones. After dening the transitions for attributes the transitions for environment states must be
dened in such a way that the following general rule should be true:
Let p
1
, . . . , p
n
be attributes of a state e of the environment and e p
1
= v
1
, . . . , e p
n
= v
n
. Let e
c
e
,
e
is not a terminal state (, , or 0), and p

i
: v
i
c
p
i
: v
i
for all i I [1 : n] where I is the set of all
indices for which such transitions dened. Then e
p
i
= v
i
for i I and e
p
i
= e p
i
for i / I . From
this denition it follows that if I = and e
c
e
then e
p
i
= e p
i
for all i I [1 : n].
In the case, when two states of the environment are bisimilar, this rule is sufcient to dene the
transitions of the environment. Otherwise we can introduce the hidden part of the environment state and
consider transitions of attributes jointly with this hidden component.
For space considerations, in Section 6.5.1.1 we show only the example of a simple scheduler applicable
to this class of operating systems.
6.5.1.1 Requirements Specication for a Simple Scheduler
This example of a simplied operating system providing initial loading and scheduling for tasks and
interrupt processing is used as a benchmark to demonstrate the approach for formalizing and checking
the consistency of requirements. We use the terminology of OSEK [103].
The attributes of the scheduler are:
Active, a name
Priority, a partial function from names to natural numbers
Ready, a list of name/agent pairs
Call, a partial function from names to agents
The empty list and the everywhere undened function are denoted as Nil. These attributes are dened
only for nonterminal and deterministic states. The actions of task agents are calls for services:
new_task (a, i), a is a name of an agent, i is an integer
activate a, a is a name
terminate
schedule
In the following requirements we assume that the current state of the environment is e[u] and that
u
c
u
for a given service c. The values of attributes are their values in a state e. We dene the transitions
e[u]
d
e
[u
].
The actions of environment include all task actions and, in addition, the following actions which are
specic only for the environment and are addressed to an external observer of scheduler activity:
loaded a, a is a name
activated a, a is a name
activate_error
schedule_error
terminated a, a is a name
schedule u, u is an agent
scheduled a, a is a name
wait
start_interrupt
end_interrupt
6.5.1.1.1 Requirements for new_task
This action substitutes the old task with the same name if it was previously dened in the scheduler or
adds the task to an environment as a new task otherwise. Transitions for the attributes:
priority : f
new_task(a:v,i)
priority : f [a := i]
We use the following notation for the redenition of functions: if f : X Y and x X then f [x := y] is
a new function g such that g(x) = y and g(x
) = f (x
) for x = x
(assignment for functions). If x / X it

is added to the domain of a function and then an assignment is performed.
call : f
new_task (a:v,i)
f [a := v]
Now the task agent v becomes the initial state of a task named a. new_task is dened by the
following rule:
e
new_task (a:v,i)
e
, u
new_task (a:v,i)
u
e[u]
loaded a
e
[u
]
6.5.1.1.2 Requirements for Activate
We use the following notation: if p is an attribute, its value is a function and x is in the domain of this
function, then p(x) denotes the current value of this function on x.
call a = v
ready : r
activate a
ready : ord(a : v, r)
The function ord is dened on the set of lists of pairs (a : u) where a is a name and u is an agent and this
function must satisfy the following system of axioms where all parameters are assumed to be universally
quantied:
ord(a : , r) = r
priority b priority a ord(a : u, b : v, r) = (b : v, ord(a : u, r))
Hence ready is a queue of task agents ordered by priorities and adding a pair (a : u) put this pair as the
last one among all pairs of the same priority as a. The rules are:
e
activate a
e
, u
activate a
u
, a Dom(call)
e[u]
activated a
e
[u
]
u
activate a
u
, a / Dom(call)
e[u]
activate_error

Anundenedstate of the environment only means that a decisionabout the behavior of the environment
in this case is left for the implementation stage. For instance, the denition can be extended so that the
environment sends an error message and calls error processing programs or continuous its functioning
ignoring the incorrect action.
6.5.1.1.3 Requirements for Terminate
u
terminate
u
e[u]
activated (e.active)
e[schedule ]
6.5.1.1.4 Requirements for Schedule
Let P(u, b, v, s) = P
1
P
2
where
P
1
= e.active = Nil ord(e.active : u, e.ready) = (b : v, s)
P
2
= e.active = Nil u = e.ready = (b : v, s)
Let r = e.ready, and a = e.active, then the rules for attributes are:
P(u, b, v, s)
ready : r
schedule u
ready : s
P(u, b, v, s)
active : a
schedule u
active : b
Note that, transitions for attributes and therefore for the environment are highly nondeterministic because
the parameter u is an arbitrary agent behavior. But this nondeterminism disappears in the rule for
scheduling which restricts the possible values for u to no more than one value. The rules are:
P(u
, b, v, s), e
schedule u
, u
schedule
u
e[u]
scheduled b
e
[v]
u
schedule
u
, e.active = Nil u
=
e[u]
schedule_error

u
schedule
, e.ready = Nil
e[u]
wait
e
[]
Therefore, if a task has no name (it can happen if a task is initially inserted into an environment) it can
use scheduling only as the last action, otherwise it is an error. And if there is nothing to schedule, the
scheduling action is ignored.
6.5.1.1.5 Interrupts
The simplest way to introduce interrupts to our model is to hide the occurrence of interrupts and the
choice of the start of interrupt processing. Only actions which show the start and the end of interrupt
processing are observable. The rules are:
e
start_interrupt
e
[v]
e
start_interrupt
e
[v; end_interrupt ; u]
We have no transitions for attributes labeled by the interrupt action so in this transition e and e
have the
same values for all attributes. The program v is an interrupt processing routine.
u
end_interrupt
u
e[u]
end_interrupt
e[u
]
Nesting of interrupts can be of arbitrary depth. The action end_interrupt is an environment action
but it is usedby the insertedagent after interrupt startedtoshowthe endof interrupt processing. Therefore,
the set of actions for inserted agent is extended, but still it is not an action of an agent before its insertion
into the environment.
6.5.1.1.6 Termination
When all tasks are successfully terminated, the scheduler reaches the waiting state:
active : a
wait
active : Nil
ready : Nil, e
wait
e
e[]
wait
e
[]
6.5.1.1.7 Dynamic Requirements
A state e of an environment is called initial if e.ready = e.active = Nil, and the domains of
functions e.priority and e.call are empty. Let E
0
be the set of all states reachable from the initial
states. Dene E
n+1
, n = 0, 1, . . . as a set of all states reachable from the states e[u], where e E
n
and u is
an arbitrary task agent. The set E of admissible states is dened as a union E = E
0
E
1
. . .. Multiple
insertion rules show that the insertion function is sequential. Dynamic requirements for environment
states are as follows:
E does not contain the deadlock state 0.
There are no undened states in E except for those which result from error actions.
Tasks are scheduled in FIFO discipline for the tasks of the same priority, tasks of a higher priority
are scheduled rst and interrupt actions are nested as brackets.
6.5.1.1.8 Consistency
The only nonconstructive transition in the requirements specication of the simple scheduler is the inser-
tion of an arbitrary agent as an interrupt processing routine. If we restrict the corresponding transitions
to the selection from some nite set (even nondeterministically) the requirements will be executable.
To prove dynamic properties, rst some invariant properties for E (always statements) must be proved.
Then after their formalization, dynamic properties are inferred from these invariants:
Dom(e.priority ) = Dom(e.call)
(a : u) e.ready a Dom(e.priority )
e.active = Nil e.active Dom(e.priority )
e.ready is ordered by priority
In the invariants formulated above, e is assumed to be nonterminal.
6.5.1.2 Input Text to the Consistency Checker
The consistency checker accepts static requirements represented in the form of Hoare-style triples and
dynamic requirements in the form of logical formulae. Requirements include the description of typed
attributes and actions. The following input text is obtained from the description of simple scheduler
consideredabove. It is statically consistent andcanbe usedfor proving dynamic properties of the scheduler.
Each requirement describes the change of a state of environment with the inserted agent represented as the
value of the attribute active_task. The value of this attribute is the behavior of a previously inserted
agent which is currently active. The predicate active_task>a u is used to represent the transition
active_task
a
u. The action axiom is needed to prove consistency for action wait (Code 6.7).
Code 7
attributes(
active: name,
priority: name -> Nat,
ready: list of (name:agent),
call: name -> agent,
active_task: agent
);
actions(a:name,u:agent,i:int)(
new_task(a:u,i),
activate a,
terminate,
schedule,
loaded a,
activated a,
activate_error,
schedule_error,
terminated a,
schedule u,
scheduled a,
wait,
start_interrupt,
end_interrupt
);
Let action axiom: Forall x((x.Delta = Delta));
Let ord Delta: Forall(a,r)(ord(a:Delta,r) = r);
Let ord: Forall(a,b,u,v,r)(
(priority b <= priority a) & (a = Delta)
-> (ord(a:u,b:v,r) = (b:v,ord(a:u,r))));
/* ------------ new_task ------------------------------ */
req new_task: Forall(a:name, (u,v):agent, i:int)(
(active_task --> new_task(a:v,i).u)
-> after(loaded a)
((active_task = u) & (priority a = i) & (call a = v)));
/* ------------ activate ------------------------------ */
req activate success: Forall(a:name,(u,v):agent, r:list of(name:agent))(
((active_task --> activate a.u) & (ready = r) &(call a = v)
& (v = Nil))
-> after(activated a)
(active_task = u & ready = ord(a:v,r)));
req activate error: Forall(a:name,u:agent)(
((active_task --> activate a.u) & (call a = Nil))
-> after activate_error
bot);
/* ------------ terminate ----------------------------- */
req terminate: Forall(a:name, u:agent)(
((active_task --> terminate.u) & (active = a))
-> after(terminated a)
(active_task = schedule));
/* ------------ schedule ------------------------------ */
req schedule success active:
Forall((u,v):agent, a:name,s:list of(name:agent))(
((active_task --> schedule.u) & (active = Nil) &
(ord(active:u,ready) = (a:v,s)))
-> after(scheduled a)
((active_task =u) & (active = a) & (ready = s)));
req schedule success not active:
Forall(v:agent, a:name,s:list of(name:agent))(
( (active_task = schedule) & (active = Nil) & (ready = (a:v,s)))
-> after(scheduled a)
((active_task =v) & (active = a) & (ready = s)));
req schedule error: Forall(u:agent)(
((active_task --> schedule.u) & (active = Nil) & (u = Delta))
-> after schedule_error
bot);
req schedule final: Forall(v:agent, b:name,s:list of(name:agent))(
((active_task --> schedule.Delta) & (ready = Nil))
-> after wait
(active_task = Delta));
/* ------------ interrupt ------------------------------ */
req start interrupt: Forall((u,v):agent)(
((active_task = u) & (interrup_process = v))
-> after start_interrupt
(active_task = (v;end_interrupt;u)));
req end interrupt: Forall(u:agent)(
(active_task --> end_interrupt.u)
-> after end_interrupt
(active_task = u));
/* ------------ termination --------------------------- */
req termination: Forall(u:agent)(
(active_task = Delta) & (ready = Nil)
-> after wait
(active_task = Delta))
/* ------------ dynamic properties -------------------- */
prove always Forall(a:name)( a in_set Dom(priority)<=>a in_set Dom(call);
prove always Forall(a:name,u:agent)(
(a:u)in_list(ready)-> a in_set Dom(priority));
prove always (active = Nil)-> active in_set Dom(priority);
prove always is_ord ready
6.5.2 Experimental Results in Various Domains
We have developed specializations for the following subject domains: sequential asynchronous environ-
ments, parallel asynchronous environments, and sequential synchronous agents. We have conducted a
number of projects in each domain to determine the effectiveness of formal requirements verication.
Figure 6.8 exhibits the performance of our provers. We show the measurements in terms of MSC
diagrams, a familiar engineering notation often used to describe embedded systems. The chart on the left
shows performance in terms of arrows, that is, communications between instances on an MSC diagram.
We can see that the performance is roughly linear to the number of arrows up to roughly 800 arrows per
diagram. Note that, a typical diagram has much less arrows, no more than hundred in most cases. The
chart on the right shows that performance is linear in the number of MSC diagrams (of typical size).
Jointly, these charts indicate that the system is scalable to realistically sized applications.
6.5.2.1 OSEK
OSEK [103] is a representative example of an asynchronous sequential environment. The OSEK standard
denes an open, embedded operating system for automotive electronics.
The OSEK formal model has been described as an environment for application tasks of different types,
considered as agents inserted into this environment. The actions common for agents and environment
are the services of the operating system. The system is multitasking but has only one processor and only
one task is running at any given moment and, therefore, the system is considered to be sequential. The
0.00
0.05
0.10
0.15
0.20
0.25
0 200 400 600 800 1000 1200
Arrows
P
r
o
v
i
n
g

t
i
m
e

p
e
r

a
r
r
o
w

(
s
e
c
)
(
s
e
c
)
0
200
400
600
800
25 50 75 100 125 150
MSC
FIGURE 6.8 Performer of prover in terms of MSC diagrams.
system is asynchronous because all actions performed by tasks independently of the operating system are
not observable and so the time between two services cannot be taken into account. Static requirements
are represented by transition rules with preconditions and postconditions. The reachable states for OSEK
can be characterized by integrity conditions.
After developing the formal requirements for OSEK, the proof system was used to prove static consist-
ency and completeness of the requirements. Several interesting dynamic properties of the requirements
were also proven. The formalization of OSEK requirements led to the discovery of 12 errors in the
nonformal OSEK standard. For example, Section 6.7.5 of the OSEK/VDX specication [103] denes a
transition related to the current priority of a task in the case when it has a priority less than the ceiling
priority of the resource; however, no transition is dened in the case when the current priority of the task
is equal to the ceiling priority.
All these errors were documented and the corrections have been integrated into the OSEK standard. In
the formal specication, we have covered 10 services dened by the OSEK standard and have proven the
consistency and completeness of this specication. This covers approximately 40% of the complete OSEK
standard. Moreover, we have found a number of mistakes in the other parts of the OSEK standard, which
prevented formalization of the rest of the standards document.
Consistency and completeness of the covered parts of the standard (49 requirements) were proven,
after making corrections for the above mentioned defects. The proof of consistency took approximately
7 min on a Pentium III computer with 256M of RAM running the Red Hat Linux Operating System.
6.5.2.2 RIO
The RapidIO Interconnect Protocol [112] is an example of a parallel asynchronous environment. This is
a protocol for a set of processor elements to communicate amongst each other. There are three layers of
abstraction developed: logic, transport, and physical layers.
The static requirements for RIO are standard (pre- and postconditions referring to the adjacent
moments of time). But while in OSEK an action is uniquely dened by a running task, in RIO it is
generated by a nondeterministic choice of one of the processor elements that generates an observable
action.
The formal requirements descriptionof RIOfor logic (14 requirements) and transport layers (6 require-
ments) was obtained from the documentation and proved to be consistent and complete (46 sec);
46 requirements for the physical layer have been proven consistent in 8.5 min.
6.5.2.3 Vger
The formal requirements for the protocol used by the SC-Vger processor [113] for communicating with
other processor elements of a system via the MAXbus bus device were extracted from the documentation
of the MAXbus and from discussions with experts. Vger is a representative example of a synchronous
sequential agent inserted into a parallel environment. Vger is a deterministic automaton with binary
inputoutput signals and shared data available from the bus. The attributes of the system are its input
output signals and its shared data. Originally, there are no actions and we can consider the clock signal
synchronizing the system as the only observable action. Static requirements are written using asser-
tion/deassertion conditions for output signals. Each requirement is a rule for setting the signal to a given
value (0 or 1). The precondition is a history of conditions represented in a Klenee-like algebra with time.
Several rules can be applied at the same moment. For the static consistency conditions, the preconditions
of two rules which set the same attribute to different values can never be true at the same lock interval.
There are no static completeness conditions because we dene the semantics of the requirements text so
that if there are no rules to change the output value, it remains in the same state as in the previous moment
of time. We use binary attribute symbols as predicates and as long as there are no other predicate symbols
the systems represents a propositional calculus.
To prove statements with Klenee algebra expressions, these must rst be reduced to rst-order logic, that
is, to requirements with preconditions referring to one moment of time (without histories). A converter
has been developed for the automatic translation of subject domains relying on Kleene algebra and the
interval calculus notation.
The set of reachable states of Vger is not dened in rst-order logic, and the proof of the consistency
condition is only a sufcient condition for consistency. A more powerful yet still sufcient condition is
the provability of consistency conditions by standard induction from static requirements. There exists a
sequence of increasingly powerful conditions which converge to the results obtained by model checking.
All 26 Vger requirements have been proven to be consistent (192 sec).
6.6 Conclusions and Perspectives
In this chapter, we reviewed tools and methods to ensure that the right system is developed, by which
we mean a system that matches what the customer really wants. Systems that do not match customer
requirement result in cost overruns owing to later changes of the system at best, and, in the worst case,
may never be deployed. Based on the mathematical model of the theory of agents and interactions we
developed a set of tools capable of establishing the consistency and completeness of system requirements.
Roughly speaking, if the requirements are consistent, an implementation which meets the requirements is
possible; if the requirements are complete, this implementation is dened uniquely by the requirements.
We discuss how to represent requirements specications for formal validation and exhibit experimental
results of deploying these tools to establish the correctness of embedded software systems. This chapter
also reviews other models of system behavior and other tools for system validation and verication.
Our experience has shown that dramatic quality improvements are possible through formal valida-
tion and verication of systems under development. In practice, deployment of these techniques will
require increased upstream development effort: thorough analysis of requirements and their capture in
specication languages result in a longer design phase. In addition, signicant training and experience
are needed before signicant benets can be achieved. Nevertheless, the improvements in quality and
reduction in effort in later development phases warrant this investment, as application of these methods
in pilot projects has demonstrated.
References
[1] D. Harel and A. Pnueli. On the development of reactive systems. In K. Apt, Ed., Logics and Models
of Concurrent Systems. NATO ASI Series, vol. 13. Springer-Verlag, pp. 477498.
[2] Z. Manna and A. Pnueli. The Temporal Logic of Reactive and Concurrent Systems. Springer-Verlag,
Heidelberg, 1992.
[3] Z. Manna and A. Pnueli. Temporal Verication of Reactive Systems: Safety. Springer-Verlag,
Heidelberg, 1995.
[4] F.P. Brooks. The Mythical Man-Month: Essays on Software Engineering. Addison-Wesley, Reading,
MA, 1995.
[5] L. Lamport. Introduction to TLA. SRC Technical note 1994-001, 1994.
[6] R.J. van Glabbeek. Notes on the methodology of CCS and CSP. Theoretical Computer Science,
177: 329349, 1997.
[7] D.M.R. Park. Concurrency and automata on innite sequences. In Proceedings of the 5th GI
Conference. Lecture Notes in Computer Science, vol. 104. Springer-Verlag, Heidelberg, 1981.
[8] R. Milner. Communication and Concurrency. Prentice Hall, New York, 1989.
[9] J.V. Kapitonova and A.A. Letichevsky. On constructive mathematical descriptions of subject
domains. Cybernetics, 4: 408418, 1988.
[10] A.A. Letichevsky and D.R. Gilbert. Towards an implementation theory of nondeterministic con-
current languages. Second Workshop of the INTAS-93-1702 Project: Efcient Symbolic Computing,
St Petersburg, October 1996.
[11] A.A. Letichevsky and D.R. Gilbert. A general theory of action languages. Cybernetics and System
Analysis, 1: 1231, 1998.
[12] R. Milner. The polyadic -calculus: a tutorial. In F.L. Bauer, W. Brauer, and H. Schwichtenberg,
Eds., Logic and Algebra of Specication. Springer-Verlag, Heidelberg, 1993, pp. 203246.
[13] C.A.R. Hoare. Communicating Sequential Processes. Prentice Hall, New York, 1985.
[14] J.A. Bergstra and J.W. Klop. Process algebra for synchronous communication. Information and
Control, 60: 109137, 1984.
[15] L. Lamport. The temporal logic of actions. ACM Transactions on Programming Languages and
Systems, 16(3): 872923, 1994.
[16] A. Pnueli. The temporal logic of programs. In Proceedings of the 18th Annual Symposium on the
Foundations of Computer Science, November 1977, pp. 4652.
[17] E. Emerson and J. Halpern. Decision procedures and expressiveness in the temporal logic of
branching time. Journal of Computer and System Science, 30: 124, 1985.
[18] M.J. Fisher and R.E. Ladner. Propositional modal logic of programs. In Proceedings of the 9th
ACM Annual Symposium on Theory of Computing, pp. 286294.
[19] E. Emerson. Temporal and modal logic. InJ. vanLeeuwen, Ed., Handbook of Theoretical Computer
Science. MIT Press, Cambridge, MA, 1991, pp. 9971072.
[20] R. Bryant. Graph-based algorithms for Boolean function manipulation. IEEE Transactions on
Computers, 35: 677691.
[21] J. Burch, E. Clarke, K. McMillan, D. Dill, and L. Hwang. Symbolic model checking: 1020 states
and beyond. Information and Computation, 98: 142170, 1992.
[22] E. Clarke and E. Emerson. Synthesis of synchronization skeletons for branching time temporal
logic. InThe Workshop on Logic of Programs. Lecture Notes in Computer Science, vol. 131. Springer-
Verlag, Heidelberg, 1981, pp. 128143.
[23] J. Quielle and J. Sifakis. Specication and verication of concurrent systems in CESAR.
In Proceedings of the 5th International Symposium on Programming, pp. 142158.
[24] L. Lamport. What goodis temporal logic? InR. Mason, Ed., InformationProcessing-83: Proceedings
of the 9th IFIP World Computer Congress, Elsevier, 1983, pp. 657668.
[25] M. Abadi and L. Lamport. Composing specications. ACM Transactions on Programming
Languages and Systems, 15: 73132, 1993.
[26] W. Thomas. Automata on innite objects. In J. van Leeuwen, Ed., Handbook of Theoretical
Computer Science. MIT Press, Cambridge, MA, 1991, pp. 131191.
[27] A.P. Sistla, M. Vardi, and P. Wolper. The complementation problem for Bchi automata with
application to temporal logic. Theoretical Computer Science, 49: 217237, 1987.
[28] M. Vardi and P. Wolper. An automata-theoretic approach to automatic program verication.
In Proceedings of the 1st IEEE Symposium on Logic in Computer Science, pp. 332344.
[29] H. Rodgers. Theory of Recursive Functions and Effective Computability. McGraw-Hill, New York,
1967.
[30] Y. Gurevich. Evolving algebras: anattempt todiscover semantics. InG. Rozenberg andA. Salomaa,
Eds., Current Trends in Theoretical Computer Science, World Scientic, River Edge, NJ, 1993,
pp. 266292.
[31] Y. Gurevich. Evolving algebras 1993: Lipari guide. In E. Brger, Ed., Specication and Validation
Methods. University Press, 1995, pp. 936.
[32] J. Meseguer. Conditional rewriting logic as a unied model of concurrency. Theoretical Computer
Science, 96: 73155, 1992.
[33] P. Lincoln, N. Marti-Oliet, and J. Meseguer. Specication, transformation and programming
of concurrent systems in rewriting logic. In G. Bleloch et al., Eds., Proceedings of the DIMACS
Workshop on Specication of Parallel Algorithms American Mathematical Society, Providence, 1994.
[34] M. Clavel. Reection in General Logics and Rewriting Logic with Application to the Maude
Language. Ph.D. thesis, University of Navarra, 1998.
[35] M. Clavel and J. Meseguer. Axiomatizing reective logics and languages. In G. Kicrales, Ed.,
Reection96. 1996, pp. 263288.
[36] M. Clavel, F. Durn, S. Eker, P. Lincoln, N. Mart-Oliet, J. Meseguer, and J. Quesada. Towards
Maude 2.0. In F. Futatsugi, Ed., Proceedings of the 3rd International Workshop on Rewriting Logic
and its Applications. Notes in Theoretical Computer Science, vol. 36, Elsevier, 2000.
[37] J. Meseguer and P. Lincoln. Introduction in Maude. Technical report, SRI International, 1998.
[38] J. Brackett. Software Requirements. Technical report SEI-CM-19-1.2, Software Engineering
Institute, 1990.
[39] B. Boehm. Industrial software metrics top 10 list. IEEE Software, 4: 8485, 1987.
[40] B. Boehm. Software Engineering Economics. Prentice Hall, New York, 1981.
[41] J.C. Kelly, S.S. Joseph, and H. Jonathan. An analysis of defect densities found during software
inspections. Journal of Systems Software, 17: 111117, 1992.
[42] R. Lutz. Analyzing requirements errors in safety-critical embedded sytems. In IEEE International
Symposium Requirements Engineering, San Diego, 1993, pp. 126133.
[43] T. DeMarco. Structured Analysis and System Specication. Yourdon Press, New York, 1979.
[44] C.V. Ramamoorthy, A. Prakash, W. Tsai, and Y. Usuda. Software engineering: problems and
perspectives. Computer, 17: 191209, 1984.
[45] M.E. Fagan. Design and code inspections to reduce errors in program evelopment. IBM Systems
Journal, 15: 182211, 1976.
[46] M.E. Fagan. Advances in software inspection. IEEE Transactions on Software Engineering,
12: 744751, 1986.
[47] J. Rushby. Formal Methods and their Role in the Certication of Critical Systems. Technical
report CSL-95-1, March 1995.
[48] C.B. Jones. Systematic Software Development Using VDM. Prentice Hall, New York, 1990.
[49] J.M. Spivey. Understanding Z: A Specication Language and its Formal Semantics. Cambridge
University Press, London, 1988.
[50] J.-R. Abrial. The B-Book: Assigning Programs to Meanings. Cambridge University Press, London,
1996.
[51] International Organization for Standardization Information Processing Systems Open
Systems Interconnection. Lotos A Formal Description Technique Based on the Temporal
Ordering of Observational Behavior. ISO Standard 8807. Geneva, 1988.
[52] R.S. Boyer and J.S. Moore. A Computational Logic Handbook. Academic Press, New York, 1988.
[53] M.J.C. Gordonand T.F. Melham, Eds., Introduction to HOL. Cambridge University Press, London,
1993.
[54] D. Craigen, S. Kromodimoeljo, I. Meisels, B. Pase, and M. Saaltink. EVES: an overview. In
VDM91: Formal Software Development Methods. Lecture Notes in Computer Science, vol. 551.
Springer-Verlag, Heidelberg, 1991, pp. 389405.
[55] M. Saaltink, S. Kromodimoeljo, B. Pase, D. Craigen, and I. Meisels. Data abstraction in EVES. In
Formal Methods Europe93, Odense, April 1993.
[56] S. Owre, N. Shankar, and J.M. Rushby. User Guide for the PVS Specication and Verication
System. Technical report, SRI International, 1996.
[57] E. Clarke, O. Grumberg, and D. Peled. Model Checking. MIT Press, Cambridge, MA, 2000.
[58] P. Godefroid. VeriSoft: A tool for the automatic analysis of concurrent reactive software. In
Proceedings of the 9th Conference on Computer Aided Verication. Lecture Notes in Computer
Science, vol. 1254. Springer-Verlag, Heidelberg, 1997, pp. 476479.
[59] J. Burch, E. Clarke, D. Long, K. McMillan, and D. Dill. Symbolic model checking for sequen-
tial circuit verication. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 13(4): 401424, 1994.
[60] G. Holzmann. The SPIN Model Checker, Primer and Reference Manual. Addison-Wesley, Reading,
MA, 2004.
[61] S.J. Garland and J.V. Guttag. A Guide to LP, the Larch Prover. Technical report, DEC Systems
Research Center Report 82, 1991.
[62] J. Crow, S. Owre, J. Rushby, N. Shankar, and M. Srivas. A tutorial introduction to PVS. InWIFT 95:
Workshop on Industrial-Strength Formal Specication Techniques. Boca Raton, FL, April 1995.
[63] S. Rajan, N. Shankar, and M. Srivas. An integration of model checking with automated
proof checking. In Proceedings of the 7th International Conference on Computer Aided Veric-
ation CAV 95. Lecture Notes in Computer Science, vol. 939. Springer-Verlag, Heidelberg, 1995,
pp. 8497.
[64] B. Berard, Ed., Systems and Software Verication: Model-Checking Techniques and Tools. Springer-
Verlag, Heidelberg, 2001.
[65] International Telecommunications Union. Recommendation Z.120 Message Sequence Charts.
Geneva, 2000.
[66] Object Management Group. Unied Modeling Language Specication, 2.0. 2003.
[67] J. Hooman. Towards formal support for UML-based development of embedded systems.
In Proceedings of the 3rd PROGRESS Workshop on Embedded Systems, 2002, pp. 7176.
[68] M. Bozga, J. Fernandez, L. Ghirvth, S. Graf, J.P. Krimm, L. Mounier, and J. Sifakis. IF: an
intermediate representation for SDL and its applications. In Proceedings of the 9th SDL Forum,
Montreal, June 1999.
[69] F. Regensburger and A. Barnard. Formal verication of SDL systems at the Siemens mobile phone
department. In Tools and Algorithms for the Construction and Analysis of Systems ACAS98.
Lecture Notes in Computer Science, vol. 1384. Springer-Verlag, Heidelberg, 1998, pp. 439455.
[70] O. Shumsky and L. J. Henschen. Developing a framework for verication, simulation and testing
of SDL specications. In M. Kaufmann and J.S. Moore, Eds., Proceedings of the ACL2 Workshop
2000, Austin, 2000.
[71] P. Baker, P. Bristow, C. Jervis, D. King, and B. Mitchell. Automatic generation of conformance
tests from message sequence charts. In Proceedings of the 3rd SAM (SDL And MSC) Workshop,
Telecommunication and Beyond, Aberystwyth. Lecture Notes in Computer Science, 2003, p. 2599.
[72] B. Mitchell, R. Thomson, and C. Jervis. Phase automaton for requirements scenarios. In Pro-
ceedings of the Feature Interactions in Telecommunications and Software Systems, vol. VII, 2003,
pp. 7787.
[73] L. Philipson and L. Hogskola. Survey compares formal verication tools. EETIMES, 2001.
http://www.eetimes.com/story/OEG20011128S0037
[74] S. Yovine. Kronos: Averication tool for real-time systems. International Journal of Software Tools
for Technology Transfer, 1: 123133, 1997.
[75] P. Pettersson and K. Larsen. UPPAAL2k. Bulletin of the European Association for Theoretical
Computer Science, 70: 4044, 2000.
[76] D. Bjorner and C.B. Jones, Eds., The Vienna development method: the meta-language. In Logic
Programming. Lecture Notes in Computer Science, vol. 61. Springer-Verlag, Heidelberg, 1978.
[77] Y. Ledru and P.-Y. Schobbens. Applying VDM to large developments. ACM SIGSOFT Software
Engineering Notes, 15: 5558, 1990.
[78] A. Puccetti and J.Y. Tixadou. Application of VDM-SL to the development of the SPOT4 pro-
gramming messages generator. FM 99: World Congress on Formal Methods, VDM Workshop,
Toulouse, 1999.
[79] J.C. Bicarregui and B. Ritchie. Reasoning about VDM developments using the VDM support tool
in Mural. In VDM 91: Formal Software Development Methods. Lecture Notes in Computer Science,
vol. 551. Springer-Verlag, Heidelberg, 1991, pp. 371388.
[80] A. Diller. Z: An Introduction to Formal Methods. John Wiley & Sons, New York, 1990.
[81] W. Grieskamp, M. Heisel, and H. Dorr. Specifying embedded systems with statecharts and
Z: an agenda for cyclic software components. In Proceedings of the Formal Aspects of Soft-
ware Engineering FASE 98. Lecture Notes in Computer Science, vol. 1382. Springer-Verlag,
Heidelberg, 1998.
[82] D. Bert, S. Boulm, M.-L. Potet, A. Requet, and L. Voisin. Adaptable translator of B specications
to embedded C programs. In Formal Methods 2003. Lecture Notes in Computer Science, vol. 2805.
[83] R. Milne. The Semantic Foundations of the RAISE Specication Language. RAISE report
REM/11, STC Technology, 1990.
[84] M. Nielsen, K. Havelund, K. Wagner, and C. George. The RAISE language, methods, and tools.
Formal Aspects of Computing, 1: 85114, 1989.
[85] T. Mossakowski, Kolyang, and B. Krieg-Bruckner. Static semantic analysis and theorem proving
for CASL. In F. Parisi Presicce, Ed., Proceedings of the 12th Workshop on Algebraic Development
Techniques. Lecture Notes in Computer Science, vol. 1376. Springer-Verlag, Heidelberg, 1998,
pp. 333348.
[86] P.D. Mosses. COFI: the common framework initiative for algebraic specication and develop-
ment. In TAPSOFT97: Theory and Practice of Software Development. Lecture Notes in Computer
Science. vol. 1214. Springer-Verlag, Heidelberg, 1997, pp. 115137.
[87] B. Krieg-Brckner, J. Peleska, E. Olderog, andA. Baer. The UniForMworkbench, a universal devel-
opment environment for formal methods. In J. Wing, J. Woodcock, and J. Davies, Eds., FM99,
Formal Methods. Lecture Notes in Computer Science, vol. 1709. Springer-Verlag, Heidelberg, 1999,
pp. 11861205.
[88] C.L. Heitmeyer, J. Kirby, and B. Labaw. Tools for formal specication, verication and valid-
ation of requirements. In Proceedings of the 12th Annual Conference on Computer Assurance,
Gaithersburg, June 1997.
[89] S. Easterbrook, R. Lutz, R. Covington, Y. Ampo, and D. Hamilton. Experiences using lightweight
formal methods for requirements modeling. IEEE Transactions on Software Engineering, 24: 414,
1998.
[90] L.C. Paulson. Isabelle: A Generic Theorem Prover. Lecture Notes in Computer Science, vol. 828.
[91] B.J. Krmer and N. Vlker. a highly dependable computer architecture for safety-critical control
applications. Real-Time Systems Journal, 13: 237251, 1997.
[92] D. Muthiayen. Real-Time Reactive System Development A Formal Approach Based on UML
and PVS. Technical report, Concordia University, 2000.
[93] P.B. Jackson. The Nuprl Proof Development System, Reference Manual and User Guide. Cornell
University, Ithaca, NY, 1994.
[94] L. Cortes, P. Eles, and Z. Peng. Formal coverication of embedded systems using model checking.
In Proceedings of the 26th EUROMICRO Conference, Maastricht, September 2000, pp. 106113.
[95] G. Holzmann. Design and Validation of Computer Protocols. Prentice Hall, New York, 1991.
[96] G. Holzmann. The model checker SPIN. IEEETransactions onSoftware Enginering, 23: 320, 1997.
[97] R. Kurshan. Automata-Theoretic Verication of Coordinating Processes. Princeton University Press,
Princeton, NJ, 1993.
[98] R. de Simone and M. Lara de Souza. Using partial-order methods for the verication of beha-
vioural equivalences. In G. von Bochmann, R. Dssouli, and O. Raq, Eds., Formal Description
Techniques VIII, 1995.
[99] J. Fernandez, H. Garavel, A. Kerbrat, R. Mateescu, L. Mounier, and M. Sighireanu. CADP:
a protocol validation and verication toolbox. In Proceedings of the 8th Conference on Computer-
Aided Verication. New Brunswick, August 1996, pp. 437440.
[100] D. Dill, A. Drexler, A. Hu, and C. Yang. Protocol verication as a hardware design aid. In IEEE
International Conference on Computer Design: VLSI in Computers and Processors. October 1992,
pp. 522525.
[101] E. Astegiano and G. Reggio. Formalism and method. Theoretical Computer Science, 236:
334, 2000.
[102] Z. Chaochen, C.A.R. Hoare, and A.P. Ravn. Acalculus of durations. Information Processing Letter,
40: 269276, 1991.
[103] OSEK Group. OSEK/VDX. Operating System.Version 2.1. May 2000.
[104] S.N. Baranov, V. Kotlyarov, J. Kapitonova, A. Letichevsky, and V. Volkov. Requirement capturing
and 3CR approach. In Proceedings of the 26th International Computer Software and Applications
Conference, Oxford, 2002, pp. 279283.
[105] J.V. Kapitonova, A.A. Letichevsky, and S.V. Konozenko. Computations in APS. Theoretical
[106] D.R. Gilbert and A.A. Letichevsky. A universal interpreter for nondeterministic concurrent pro-
gramming languages. In M. Gabbrielli, Ed., Fifth Compulog Network Area Meeting on Language
Design and Semantic Analysis Methods, September 1996.
[107] T. Valkevych, D.R. Gilbert, and A.A. Letichevsky. A generic workbench for modelling the
behaviour of concurrent and probabilistic systems. In Workshop on Tool Support for System
Specication, Development and Verication, TOOLS98, Malente, June 1998.
[108] A.A. Letichevsky, J.V. Kapitonova, and V.A. Volkov. Deductive tools in algebraic programming
system. Cybernetics and System Analysis, 1: 1227, 2000.
[109] A. Degtyarev, A. Lyaletski, andM. Morokhovets. Evidence algorithmandsequent logical inference
search. In H. Ganzinger, D. McAllester, and A. Voronkov, Eds., Logic for Programming and
Automated Reasoning (LPAR99). Lecture Notes in Computer Science, vol. 1705. Springer-Verlag,
1999, pp. 4461.
[110] V.M. Glushkov, J.V. Kapitonova, A.A. Letichevsky, K.P. Vershinin, and N.P. Malevanyi. Con-
struction of a practical formal language for mathematical theories. Cybernetics, 5: 730739,
1972.
[111] V.M. Glushkov. On problems of automata theory and articial intelligence. Cybernetics, 5: 313,
1970.
[112] Motorola. RIO Interconnect Globally Shared Memory Logical Specication. Motorola, 1999.
[113] Motorola. SC-Vger Microprocessor Implementation Denition. Motorola, 1997.
[114] S. Abramsky. A domain equation for bisimulation. Information and Computation, 92:
161218, 1991.
[115] R. Alur and D. Dill. A theory of timed automata. Theoretical Computer Science, 126:
183235, 1994.
[116] S.N. Baranov, C. Jervis, V. Kotlyarov, A. Letichevsky, and T. Weigert. Leveraging UML to deliver
correct telecom applications. In L. Lavagno, G. Martin, and B. Selic, Eds., UML for Real: Design
of Embedded Real-Time Systems. Kluwer Academic Publishers, Amsterdam, 2003.
[117] J. Bicarregui, T. Dimitrakos, B. Matthews, T. Maibaum, K. Lano, and B. Ritchie. The VDM+B
project: objectives and progress. In World Congress on Formal Methods in the Development of
Computing Systems. Toulouse, September 1999.
[118] G. Booch, J. Rumbaugh, and I. Jacobson. Unied Modeling Language User Guide. Addison-Wesley,
Reading, MA, 1997.
[119] S. Chandra, P. Godefroid, and C. Palm. Software model checking in practice: an industrial
case study. In Proceedings of the International Conference on Software Engineering, Orlando,
May 2002.
[120] E. Clarke, I. Draghicescu, and R. Kurshan. A Unied Approach for Showing Language Con-
tainment and Equivalence between Various Types of Omega-Automata. Technical report,
Carnegie-Mellon University, 1989.
[121] F. VanDewerker and S. Booth. Requirements Consistency ABasis for DesignQuality. Technical
report, Ascent Logic, 1998.
[122] E. Felt, G. York, R. Brayton, and A. Vincentelli. Dynamic variable reordering for BDD
minimization. In Proceedings of the EuroDAC, 1993, pp. 130135.
[123] M. Fitting. A Kripke-Kleene semantics for logic programs. Journal of Logic Programming,
2: 295312, 1985.
[124] I. Graham. Migrating to Object Technology. Addison-Wesley, Reading, MA, 1995.
[125] Green Mountain Computing Systems. Green Mountain VHDL Tutorial, 1995.
[126] International Telecommunications Union. RecommendationZ.100 SpecicationandDescrip-
tion Language. Geneva, 1999.
[127] B. Jacobs. Objects and classes, coalgebraically. In B. Freitag, C.B. Jones, C. Lengauer,
and H.-J. Schek, Eds., Object-Orientation with Parallelism and Persistence. Kluwer Academic
Publishers, 1996, pp. 83101.
[128] I. Jacobson. Object-Oriented Software Engineering, A Use Case Driven Approach. Addison-Wesley,
Reading, MA, 1992.
[129] N.D. Jones, C. Gomard, and P. Sestoft. Partial Evaluation and Automatic Program Generation.
Prentice Hall, New York, 1993.
[130] J.V. Kapitonova, T.P. Marianovich, and A.A. Mishchenko. Automated design and simulation of
computer systems components. Cybernetics and System Analysis, 6: 828840, 1997.
[131] M. Kaufmann and J.S. Moore. ACL2: an industrial strength version of NQTHM. In Proceedings
of the 11th Annual Conference on Computer Assurance (COMPASS96), June 1996, pp. 2334.
[132] S. Kripke. Semantical considerations on modal logic. Acta Philosophica Fennica, 16: 8394, 1963.
[133] J. van Leeuwen, Ed., Handbook of Theoretical Computer Science. MIT Press, Cambridge,
MA, 1991.
[134] A.A. Letichevsky, and J.V. Kapitonova. Mathematical information environment. In Proceedings
of the 2nd International THEOREMA Workshop, Linz, June 1998, pp. 151157.
[135] A.A. Letichevsky andD.R. Gilbert. Agents andenvironments. InProceedings of the 1st International
Scientic and Practical Conference on Programming, Kiev, 1998.
[136] A.A. Letichevsky and D.R. Gilbert. A model for interaction of agents and environments. In
Selected Papers from the 14th International Workshop on Recent Trends in Algebraic Development
Techniques. Lecture Notes in Computer Science. vol. 1827, 2004, pp. 311328.
[137] P. Lindsay. On transferring VDM verication techniques to Z. In Proceedings of Formal Methods
Europe FME94, Barcelona, October 1994.
[138] W. McCune. Otter 3.0 Reference Manual and Guide. Technical report, Argonne National
Laboratory Report ANL-94, 1994.
[139] K. McMillan. Symbolic Model Checking. Kluwer Academic Publishers, Dordrecht, 1993.
[140] M. Morockovets and A. Luzhnykh. Representing mathematical texts in a formalized natural
like language. In Proceedings of the 2nd International THEOREMA Workshop, Linz, June 1998,
pp. 157160.
[141] T. Nipkow, L. Paulson, and Markus Wenzel. Isabelle/HOL A Proof Assistant for Higher-Order
Logic. Lecture Notes in Computer Science, vol. 2283. Springer-Verlag, Heidelberg, 2002.
[142] S. Owre, J.M. Rushby, and N. Shankar. A prototype verication system. In D. Kapur, Ed., Pro-
ceedings of the 11th International Conference on Automated Deduction (CADE). Lecture Notes in
Articial Intelligence, vol. 601. Springer-Verlag, Heidelberg, 1992, pp. 748752.
[143] G. Plotkin. A Structured Approach to Operational Semantics. Technical report, DAIMI FN-19,
Aarhus University, 1981.
[144] K.S. Rubin and A. Goldberg. Object behavior analysis. Communications of the ACM, 35:
4862, 1992.
[145] R. Rudell. Dynamic variable reordering for ordered binary decision diagrams. In Proceedings of
the IEEE/ACM ICCAD93, 1993, pp. 4247.
[146] J. Rushby. Mechanized formal methods: where next? In J. Wing and J. Woodcock, Eds., FM99: The
World Congress in Formal Methods. Lecture Notes in Computer Science, vol. 1708. Springer-Verlag,
Heiderberg, 1999, pp. 4851.
[147] J. Rushby, S. Owre, and N. Shankar. Subtypes for specications: predicate subtypes in PVS. IEEE
Transactions on Software Engineering, 24: 709720, 1998.
[148] M. Saeki, H. Horai, and H. Enomoto. Software development process from natural language
specication. In International Conference on Software Engineering. Pittsburgh, March 1989,
pp. 6473.
[149] J. Tsai and T. Weigert. Knowledge-Based Software Development for Real-Time Distributed Systems.
World Scientic Publishers, Singapore, 1993.
[150] M. Vardi. Verication of concurrent programs the automata-theoretic framework. In Proceed-
ings of the 2nd IEEE Symposium on Logic in Computer Science, pp. 167176.
[151] T. Weigert and J. Tsai. A logic-based requirements language for the specication and analysis of
real-time systems. In Proceedings of the 2nd Conference on Object-Oriented Real-Time Dependable
Systems, Laguna Beach, 1996, pp. 816.
Design and Verication
Languages
7 Languages for Embedded Systems
Stephen A. Edwards
8 The Synchronous Hypothesis and Synchronous Languages
Dumitru Potop-Butucaru, Robert de Simone, and Jean-Pierre Talpin
9 Introduction to UML and the Modeling of Embedded Systems
ystein Haugen, Birger Mller-Pedersen, and Thomas Weigert
10 Verication Languages
Aarti Gupta, Ali Alphan Bayazit, and Yogesh Mahajan
7
Languages for
Embedded Systems
Stephen A. Edwards
Columbia University
7.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1
7.2 Software Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-2
Assembly Languages The C Language C++ Java
Real-Time Operating Systems
7.3 Hardware Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-8
Verilog VHDL
7.4 Dataow Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12
Kahn Process Networks Synchronous Dataow
7.5 Hybrid Languages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14
Esterel SDL SystemC
7.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18
7.1 Introduction
An embedded system is a computer masquerading as a non-computer that must perform a small set of
tasks cheaply and efciently. A typical system might have communication, signal processing, and user
interface tasks to perform.
Because the tasks must solve diverse problems, a language general-purpose enough to solve them all
would be difcult to write, analyze, and compile. Instead, a variety of languages have evolved, each best
suited to a particular problem domain. The most obvious divide is between languages for software and
hardware, but there are others. For example, a language for signal-processing is often more convenient for
a particular problemthan, say, assembly, but might be poor for control-dominated behavior.
This chapter describes popular hardware, software, dataow, and hybrid languages, each of which excels
a certainproblems. Dataowlanguages are good for signal processing, and hybrid languages combine ideas
fromthe other three classes.
Due to space, this chapter only describes the main features of each language. The authors book on the
subject [1] provides many more details on all of these languages.
Some of this chapter originally appeared in the Online Symposiumfor Electrical Engineers (OSEE).
7-1
7.2 Software Languages
Software languages describe sequences of instructions for a processor to execute (Table 7.1). As such, most
consist of sequences of imperative instructions that communicate through memory: an array of numbers
that hold their values until changed.
Each machine instruction typically does little more than, say, add two numbers, so high-level languages
aim to specify many instructions concisely and intuitively. Arithmetic expressions are typical: coding an
expression such as ax
2
+ bx + c in machine code is straightforward, tedious, and best done by a compiler.
The C language provides such expressions, control-ow constructs such as loops and conditionals, and
recursive functions. The C++ language adds classes as a way to build new data types, templates for
polymorphic code, exceptions for error handling, and a standard library of common data structures.
Java is a still higher-level language that provides automatic garbage collection, threads, and monitors to
synchronize them.
7.2.1 Assembly Languages
An assembly language program (Figure 7.1) is a list of processor instructions written in a symbolic, human-
readable form. Each instruction consists of an operation such as addition along with some operands. For
example, add r5,r2,r4 might add the contents of registers r2 and r4 and write the result to r5. Such
arithmetic instructions are executed in order, but branch instructions can perform conditionals and loops
by changing the processors program counter the address of the instruction being executed.
A processors assembly language is dened by its opcodes, addressing modes, registers, and memories.
The opcode distinguishes, say, addition from conditional branch, and an addressing mode denes how and
where data is gathered and stored (e.g., from a register or from a particular memory location). Registers
can be thought of as small, fast, easy-to-access pieces of memory.
There are roughly four categories of modern assembly languages (Table 7.2). The oldest are those for the
so-called complex instruction set computers, or CISC. These are characterized by a rich set of instructions
and addressing modes. For example, a single instruction in Intels x86 family, a typical CISC processor,
can add the contents of a register to a memory location whose address is the sum of two other registers
and a constant offset. Such instruction sets are usually convenient for human programmers, who are
generally fairly skilled at using a heterogeneous set of tools, and the code itself is usually quite compact.
Figure 7.1(a) illustrates a small program in x86 assembly.
By contrast, reduced instruction set computers (RISC) tend to have fewer instructions and much
simpler addressing modes. The philosophy is that while you generally need more RISC instructions to
accomplish something, it is easier for a processor to execute them because it does not need to deal with
the complex cases and easier for a compiler to produce thembecause they are simpler and more uniform.
Figure 7.1(b) illustrates a small programin SPARC assembly.
TABLE 7.1 Software Language Features Compared
C C++ Java
Expressions
Control-ow
Recursive functions
Exceptions
Classes and inheritance
Templates
Namespaces
Multiple inheritance
Threads and locks
Garbage collection
Note: , full support; , partial support.
Languages for Embedded Systems 7-3
jmp L2
L1:
movl %ebx, %eax
movl %ecx, %ebx
L2:
xorl %edx, %edx
divl %ebx
movl %edx, %ecx
testl %ecx, %ecx
jne L1
(a) mov %i0, %o1
b .LL3
mov %i1, %i0
mov %i0, %o1
b .LL3
mov %i1, %i0
.LL5:
mov %o0, %i0
.LL3:
mov %o1, %o0
call .rem, 0
mov %i0, %o1
cmp %o0, 0
bne .LL5
mov %i0, %o1
(b)
FIGURE7.1 Euclids algorithm(a) i386 assembly (CISC) and (b) SPARCassembly (RISC). SPARChas more registers
and must call a routine to compute the remainder (the i386 has division instruction). The complex addressing modes
of the i386 are not shown in this example.
TABLE 7.2 Typical Modern Processor Architectures
CISC RISC DSP Microcontroller
x86 SPARC TMS320 8051
68000 MIPS DSP56000 PIC
ARM ASDSP-21xx AVR
move #samples, r0
move #coeffs, r4
move #n-1, m0
move m0, m4
movep y:input, x:(r0)
clr a
x:(r0)+, x0 y:(r4)+, y0
rep #n-1
mac x0,y0,a
x:(r0)+, x0 y:(r4)+, y0
macr x0,y0,a (r0)-
movep a, y:output
(a)
START:
MOV SP, #030H
ACALL INITIALIZE
ORL P1,#0FFH
SETB P3.5
LOOP:
CLR P3.4
SETB P3.3
SETB P3.4
WAIT:
JB P3.5, WAIT
CLR P3.3
MOV A,P1
ACALL SEND
SETB P3.3
AJMP LOOP
(b)
FIGURE 7.2 (a) A nite impulse response lter in DSP56001 assembly. The mac instruction (multiply and accumu-
late) does most of the work, multiplying registers X0 and Y0, adding the result to accumulator A, fetching the next
sample and coefcient frommemory, and updating circular buffer pointers R0 and R4. The rep instruction repeats the
mac instruction in a zero-overhead loop. (b) Writing to a parallel port in 8051 microcontroller assembly. This code
takes advantage of the 8051s ability to operate on single bits.
The third category of assembly languages arise from more specialized processor architectures such as
digital signal processors (DSPs) and very-long instruction word processors (VLIWs). The operations in
these instruction sets are simple like those in RISC processors (e.g., add two registers); but they tend to
be very irregular (only certain registers may be used with certain operations) and support a much higher
degree of instruction-level parallelism. For example, Motorolas DSP56001 can, in a single instruction,
multiply two registers, add the result to a third, load two registers frommemory, and increase two circular
buffer pointers. However, the instruction severely limits which registers (and even which memory) it may
use. Figure 7.2(a) shows a lter implemented in 56001 assembly.
The fourth category includes instruction sets on small (4- and 8-bit) microcontrollers. In some sense,
these combine the worst of all worlds: there are few instructions and each cannot do much, much like
a RISC processor, and there are also signicant restrictions on which registers can be used when, much
like a CISC processor. The main advantage of such instruction sets is that they can be implemented very
cheaply. Figure 7.2(b) shows a routine that writes to a parallel port in 8051 assembly.
7.2.2 The C Language
C is currently the most popular language for embedded system programming. C compilers exist for
virtually every general-purpose processor, from the lowliest 4-bit microcontroller to the most powerful
64-bit processor for compute servers.
C was originally designed by Dennis Ritchie [2] as an implementation language for the Unix operating
system being developed at Bell Labs for a 24K DEC PDP-11. Because the language was designed for
systems programming, it provides very direct access to the processor through such constructs as untyped
pointers and bit-manipulation operators, things appreciated today by embedded systems programmers.
Unfortunately, the language also has many awkward aspects, such as the need to dene everything before
it is used, that are holdovers from the cramped execution environment in which it was rst implemented.
A C program (Figure 7.3) contains functions built from arithmetic expressions structured with loops
and conditionals. Instructions in a C programrun sequentially, but control-ow constructs such as loops
of conditionals can affect the order in which instructions execute. When control reaches a function call in
an expression, control is passed to the called function, which runs until it produces a result, and control
returns to continue evaluating the expression that called the function.
C derives its types from those a processor manipulates directly: signed and unsigned integers ranging
from bytes to words, oating point numbers, and pointers. These can be further aggregated into arrays
and structures groups of named elds.
C programs use three types of memory. Space for global data is allocated when the program is compiled,
the stack stores automatic variables allocated and released when their function is called and returns, and
the heap supplies arbitrarily-sized regions of memory that can be deallocated in any order.
The C language is an ISO standard, but most people consult the book by Kernighan and Ritchie [3].
C succeeds because it can be compiled into very efcient code and because it allows the programmer
almost arbitrarily low-level access to the processor when necessary. As a result, virtually every function can
be written in C (exceptions include those that must manipulate specic processor registers) and can be
expected to be fairly efcient. Cs simple execution model also makes it fairly easy to estimate the efciency
of a piece of code and improve it if necessary.
#include <stdio.h>
int main(int argc, char *argv[])
{
char *c;
while (++argv, --argc > 0) {
c = argv[0] + strlen(argv[0]);
while (--c >= argv[0])
putchar(*c);
putchar(\n);
}
return 0;
}
FIGURE 7.3 A C program that prints each of its arguments backwards. The outermost while loop iterates through
the arguments (count in argc, array of strings in argv), while the inner loop starts a pointer at the end of the current
argument and walks it backwards, printing each character along the way. The ++ and -- prexes increment the
variable they are attached to before returning its value.
While C compilers for workstation-class machines usually comply closely to ANSI/ISO standard C,
C compilers for microcontrollers are often much less standard. For example, they often omit support for
oating-point arithmetic and certain library functions. Many also provide language extensions that, while
often very convenient for the hardware for which they were designed, can make porting the code to a
different environment very difcult.
7.2.3 C++
C++(Figure 7.4) [4] extends Cwith structuring mechanisms for big programs: user-dened data types, a
way to reuse code with different types, namespaces to group objects and avoid accidental name collisions
when programpieces are assembled, and exceptions to handle errors. The C++standard library includes a
collection of efcient polymorphic data types such as arrays, trees, strings for which the compiler generates
customimplementations.
A class denes a new data type by specifying its representation and the operations that may access
and modify it. Classes may be dened by inheritance, which extends and modies existing classes. For
example, a rectangle class might add length and width elds and an area method to a shape class.
A template is a function or class that can work with multiple types. The compiler generates custom
code for each different use of the template. For example, the same min template could be used for both
integers and oating-point numbers.
C++ also provides exceptions, a mechanism intended for error recovery. Normally, each method
or function can only return directly to its immediate caller. Throwing an exception, however, allows
control to return to an arbitrary caller, usually an error-handling mechanism in the main function
or similar. Exceptions can be used, for example, to gracefully recover fromout-of-memory conditions no
matter where they occur, without the tedium of having to check whether every function encountered an
out-of-memory condition.
Memory consumption is a disadvantage to C++s exception mechanism. While most C++compilers
do not generate slower code when exceptions are enabled, they do generate larger executables by including
tables that record the location of the nearest exception handler. For this reason, many compilers, such as
GNUs gcc, have a ag that completely disables exceptions.
class Cplx {
double re, im;
public:
Cplx(double v) : re(v), im(0) {}
Cplx(double r, double i)
: re(r), im(i) {}
double abs() const {
return sqrt(re*re + im*im);
}
void operator+= (const Cplx& a) {
re += a.re; im += a.im;
}
};
int main() {
Cplx a(5), b(3,4);
b += a;
cout << b.abs() << \n;
return 0;
}
FIGURE 7.4 A C++ fragment illustrating a partial complex number type and how it can be used (the C++ library
has a complete version). This class denes how to create a new complex number fromeither a scalar or by specifying
the real and imaginary components, how to compute the absolute value of a complex number, and how to add a
complex number to an existing one.
C++ is being used more and more within embedded systems, but it is sometimes a less suitable choice
than C for a number of reasons. First, C++ is a much more complicated language that demands a much
larger compiler, so C ++ has been ported to fewer architectures than C. Second, certain language features
such as dynamic dispatch (virtual function calls) and exceptions can be too costly to implement in very
small embedded systems. It is a more difcult language to learn and use properly, meaning there may be
fewer qualied C++ programmers. Also, it is often more difcult to estimate the cost of a certain construct
in C++ because the object-oriented programming style encourages many more function calls than the
procedural style of C, and the cost of these is harder to estimate.
7.2.4 Java
Suns Java language [57] resembles C++ but is not a superset. Like C ++, Java is object-oriented,
providing classes and inheritance. It is a higher-level language than C ++ since it uses object references,
arrays, and strings instead of pointers. Javas automatic garbage collection frees the programmer from
memory management.
Java omits a number of C ++s more complicated features. Templates are absent, although there
are plans to include them in a future release of the language because they make it possible to write
type-safe container classes. Java also omits operator overloading, which can be a boon to readability
(e.g., when performing operations on complex numbers) or a powerful obfuscating force. Java also does
not support C++s complex multiple inheritance mechanismcompletely. But it does provide the notion
of an interface a set of methods provided by a class that is equivalent to one of the most common
uses of multiple inheritance.
Java provides concurrent threads (Figure 7.5). Creating a thread involves extending the Thread class,
creating instances of these objects, and calling their start methods to start a new thread of control that
executes the objects run methods.
Synchronizing a method or block uses a per-object lock to resolve contention when two or more threads
attempt to access the same object simultaneously. A thread that attempts to gain a lock owned by another
thread will block until the lock is released, which can be used to grant a thread exclusive access to a
particular object.
For embedded systems, Java holds promise but also many caveats. On the positive side, it is a simple,
powerful language that provides the programmer a convenient set of abstractions. For example, unlike C,
Java provides true strings and variable-sized arrays. On the negative side, Java is a heavyweight language,
even more so than C++. Its runtime systemis large, consisting of either a bytecode interpreter, a just-in-
time compiler, or perhaps both, and its libraries are absolutely vast. While work has been done on paring
down these things, Java still requires a much larger footprint than C.
Unpredictable runtimes are a more serious problem for Java. For time-critical embedded systems,
Javas automatic garbage collector, bytecode interpreter, or just-in-time compiler make runtimes both
unpredictable and variable, making it difcult to assess efciency both beforehand and in simulation.
The real-time Java specication [8] attempts to address many of these concerns. It introduces mech-
anisms for more precise control over the scheduling policy for concurrent threads (the standard Java
specication is deliberately vague on this point to improve portability), memory regions for which auto-
matic garbage collection can be disabled, synchronization mechanisms for avoiding priority inversion,
and various other real-time features such as timers. It remains to be seen, however, whether this specica-
tion addresses enough real-time concerns and is sufciently efcient to be practical. For example, a naive
implementation of the memory management policies would be very inefcient.
7.2.5 Real-Time Operating Systems
Many embedded systems use a real-time operating system (RTOS) to simulate concurrency on a single
processor. An RTOS manages multiple running processes, each written in sequential language such
as C. The processes perform the systems computation and the RTOS schedules them attempts to
import java.io.*;
class Counter {
int value = 0;
boolean present = false;
public synchronized void count() {
try { while (present) wait(); }
catch (InterruptedException e) {}
value++; present = true; notifyAll();
}
public synchronized int read() {
try { while (!present) wait(); }
catch (InterruptedException e) {}
present = false; notifyAll();
return value;
}
}
class Count extends Thread {
Counter cnt;
public Count(Counter c) { cnt = c; start(); }
public void run() { for (;;) cnt.count(); }
}
class Mod5 {
public static void main(String args[]) {
Counter c = new Counter();
Count count = new Count(c);
int v;
for (;;) if ( (v = c.read()) % 5 == 0 )
System.out.println(v);
}
}
FIGURE 7.5 A contrived Java programthat spawns a counting thread to print all numbers divisible by 5. The main
method in the Mod5 class creates a new Counter, then a new Count object. The Count class extends the Thread class
and spawns a new thread in its constructor by executing start. This invokes its run method, which calls the method
count. Both count and read are synchronized, meaning at most one may run on a particular Count object at once, here
guaranteeing the counter is either counting or waiting for its value to be read.
A A A
B B B
C C
B A B C A B A B C
A preempts B
A completes, allows B to resume
B completes, allows C to run
C completes, A takes priority over B
A completes, allows B to run
B completes
FIGURE 7.6 The behavior of an RTOS with xed-priority preemptive scheduling. Rate-monotonic analysis gives
process A the highest priority since it has the shortest period; C has the lowest.
meet deadlines by deciding which process runs when. Labrosse [9] describes the implementation of a
particular RTOS.
Most RTOSes uses xed-priority preemptive scheduling in which each process is given a particular
priority (a small integer) when the system is designed (Figure 7.6). At any time, the RTOS runs the
highest-priority running process, which is expected to run for a short period of time before suspending
itself to wait for more data. Priorities are usually assigned using rate-monotonic analysis [10] (due to Liu
and Layland [11]), which assigns higher priorities to processes that must meet more frequent deadlines.
H H
M
L
L M H M L H H
L begins running
L acquires lock on resource
M preempts L
H preempts M
H blocks waiting for lock, M runs
M delays the execution of L
H misses its deadline
FIGURE 7.7 Priority inversion illustrated. When low-priority process L acquires a lock on a resource needed by
process H, it effectively blocks process H, but then intermediate-priority process M preempts L, preventing it from
running and releasing the resource needed by H. Priority inheritance, the common solution, temporarily raises the
priority of L to that of H when H requests the resource held by L.
Priority inversion is a fundamental problem in xed-priority preemptive scheduling that can lead to
missed deadlines by enabling a lower-priority process to delay indenitely the execution of a higher-
priority one. Figure 7.7 illustrates the typical scenario: a low priority process L runs and acquires a
resource. Shortly thereafter, a high-priority process H preempts L, attempts to acquire the same resource,
and blocks waiting for L to release it. This can cause H to miss its deadline even though it is at a higher
priority than L. Even worse, if a process M with priority between L and now starts, it can delay the
execution of H indenitely. Process M does not allow L to run since M is at a higher priority, so L cannot
execute and release the lock and H will continue to block.
Priority inversion is usually solved with priority inheritance. When a process L acquires a lock, its
priority is temporarily raised to a level where it will not be preempted by any other process that will also
attempt to acquire the lock. Many RTOSes provide a mechanism for doing this automatically.
7.3 Hardware Languages
Concurrency and the notion of control is the fundamental difference between hardware and software. In
hardware, every part of the program is always running, but in software, exactly one part of the program
is running at any one time. Software languages naturally focus on sequential algorithms while hardware
languages enable concurrent function evaluation, speculation, and concurrency.
Ironically, efcient simulation in software is a main focus of the hardware languages presented here,
so their discrete-event semantics are a compromise between what would be ideal for hardware and what
simulates efciently.
Verilog [12,13] and VHDL [1417] are the most popular languages for hardware description and
modeling (Figure 7.8 and Figure 7.9). Both model systems with discrete-event semantics that ignore idle
portions of the design for efcient simulation. Both describe systems with structural hierarchy: a system
consists of blocks that contain instances of primitives, other blocks, or concurrent processes. Connections
are listed explicitly.
Verilog provides more primitives geared specically toward hardware simulation. VHDLs primitive are
assignments such as a = b + c or procedural code. Verilog adds transistor and logic gate primitives, and
allows new ones to be dened with truth tables.
Both languages allow concurrent processes to be described procedurally. Such processes sleep until
awakened by an event that causes themto run, read and write variables, and suspend. Processes may wait
for a period of time (e.g., #10 in Verilog, wait for 10ns in VHDL), a value change (@(a or b),
wait on a,b), or an event (@(posedge clk), wait on clk until clk=1).
VHDL communication is more disciplined and exible. Verilog communicates through wires or regs:
shared memory locations that can cause race conditions. VHDLs signals behave like wires but the res-
olution function may be user-dened. VHDLs variables are local to a single process unless declared
shared.
Verilogs type system models hardware with four-valued bit vectors and arrays for modeling memory.
VHDL does not include four-valued vectors, but its type system allows them to be added. Furthermore,
composite types such as C structs can be dened.
Overall, Verilog is the leaner language more directly geared toward simulating digital integrated circuits.
VHDL is a much larger, more verbose language capable of handing a wider class of simulation and
modeling tasks.
7.3.1 Verilog
Verilog was rst devised in 1984 as an input language for a discrete-event simulator for digital hardware
design. It was one of the rst hardware description languages able to specify both the circuit and a test
bench in the same language, which remains one of its strengths.
Verilog has since been pressed into use as both a modeling language and a specication language.
Although Verilog is still simulated frequently, it is also frequently fed to a logic synthesis system that
translates it into an actual circuit. This is a technically challenging process and not all Verilog constructs
can be translated into hardware since Verilogs semantics are nondeterministic and effectively dened by
the behavior of an event-driven simulator.
Verilog provides both structural and behavioral modeling styles, and allows them to be combined at
will. Consider the simple multiplexer circuit shown in Figure 7.8(a). It can be modeled in Verilog as a
schematic composed of logic gates (Figure 7.8[b]), with a continuous assignment statement that represents
logic using an expression (Figure 7.8[c]), with a truth table as a user-dened primitive (Figure 7.8[d]),
or with imperative, event-driven code (Figure 7.8[e]).
The imperative modeling style is particularly useful for creating testbenches: models of an environment
that stimulate a particular circuit and check its behavior. Figure 7.8(f) illustrates such a testbench, which
instantiates a multiplexer (the instance is called dut device under test) and starts a simple process
(the initial block) to apply inputs and monitor outputs. Running Figure 7.8(f) in a Verilog simulator
gives a partial truth table for the multiplexer.
As these examples illustrate, a Verilog program is composed of modules. Each module has an interface
with named input and outputs ports and contains one or more instances of other modules, continuous
assignments, and imperative code in initial and always blocks. Modules perform the same information
hiding function as functions in imperative languages: a modules contents is not visible from outside and
names for instances, wires, and whatnot inside a module do not have to differ from those in other modules.
Verilog programs manipulate four-valued bit vectors intended to model digital hardware. Each bit
is 0, 1, X, representing unknown, or Z, used to represent an undriven tri-state bus. While such vectors are
very convenient for modeling circuitry, one of Verilogs shortcomings is the lack of a more complicated
type system. It does provide arrays of bit vectors but no other aggregate types.
The plumbing within a module comes in two varieties, one for structural modeling, the other for
behavioral. Structural components, such as instances of primitive logic gates and other modules, commu-
nicate through wires, each of which may be connected to drivers such as gates or continuous assignments.
Conceptually, the value of a wire is computed constantly from whatever drives it. Practically, the simulator
evaluates the expression in a continuous assignment whenever any of its inputs changes.
Behavioral components communicate through regs, which behave like memory in traditional program-
ming languages. The value of a reg is set by an assignment statement executed within an initial or always
block, and that value persists until the next time the reg is assigned. While a reg can be used to model
a state-holding element such as a latch or ipop, it is important to remember that they are really just
memory. Figure 7.8(e) illustrates this: a reg is used to store the output of the mux, even though it is not a
state-holding element. This is because imperative code can only change the value of regs, not wires.
g1
g4
g2
g3 f
nsel
f1
f2
module mux(f,a,b,sel);
output f;
input a, b, sel;
and g1(f1, a, nsel),
g2(f2, b, sel);
or g3(f, f1, f2);
not g4(nsel, sel);
endmodule
output f;
input a, b, sel;
assign f = sel ? a : b;
endmodule
primitive
mux(f,a,b,sel);
output f;
input a, b, sel;
table
1?0 : 1;
0?0 : 0;
?11 : 1;
?01 : 0;
11? : 1;
00? : 0;
endtable
endprimitive
output f;
input a, b, sel;
reg f;
always @(a or b or sel)
if (sel) f = a;
else f = b;
endmodule
module testbench;
reg a, b, sel;
wire f;
mux dut(f, a, b, sel);
initial begin
$display("a,b,sel -> f");
$monitor($time,,
"%b%b%b -> ",
a, b, sel, f);
a = 0; b = 0 ; sel = 0;
#10 a = 1;
#10 sel = 1;
#10 b = 1;
#10 sel = 0;
end
endmodule
(a) (b)
(c) (d)
(e) (f)
a
b
sel
FIGURE 7.8 Verilog examples. (a) A multiplexer circuit, (b) the multiplexer described as a Verilog structural model,
(c) the multiplexer described using a continuous assignment, (d) a user-dened primitive for the multiplexer, (e) the
multiplexer described with imperative code, (f) a testbench for the multiplexer.
Verilog is a large language that contains many now-little-used features such as switch-level transistor
models, pure event handling, and complicated delay specications, all remnants of previous design
methodologies. Today, switch-level modeling is rarely usedbecauseVerilogs precisionis toolowfor circuits
that take advantage of this behavior (a continuous simulator such as SPICE is preferred). Delays are rarely
used because static timing analysis has replaced event-driven simulation as the timing analysis method
of choice because its speed and precision. Nevertheless, Verilog remains one of the most commonly-used
languages for hardware design.
SystemVerilog, a recently-introduced standard (2002), is an extension to the Verilog language designed
to aid in the creation of large specications. It adds a richer set of datatypes, including C-like structures,
unions, and multidimensional arrays, a richer set of processes (e.g., an always_comb block has an
implied sensitivity to all variables it references), the concept of an interface to encapsulate communication
and function between blocks, and many other features. Whether SystemVerilog supplants Verilog as a
standard language for hardware specication remains to be seen, but it does have the advantage of being
an obvious evolutionary improvement over previous versions of Verilog.
7.3.2 VHDL
The VHDL language (VHDL is a two-level acronym, standing for VHSIC [Very High Speed Integrated
Circuit] Hardware Description Language) was designed to be a exible modeling language for digital
systems. It has fewer built-in features such as Verilogs four-valued bit vectors, gate, and transistor-level
models. Instead, it has very exible type and package systems that allow such things to be specied in the
language.
Unlike Verilog, VHDL draws a strong distinction between the interface to a hierarchical object and its
implementation. VHDL interfaces are called entities and their implementations are called architectures.
Figure 7.9 illustrates how these are used in a simple model: the entities are essentially named lists of ports
and the architectures consist of named lists of component instances. While this increases the verbosity
of the language, it makes it possible to use different implementations, perhaps at differing levels of
abstraction.
Like Verilog, VHDL supports structural, dataow, and behavioral modeling styles, illustrated in
Figure 7.9. As in Verilog, they can be mixed. In the three styles, an architecture is specied by listing
components and their connections (structural), as a series of equations (dataow, like Verilogs assign
declarations), or as a sequence of imperative instructions (behavioral, like Verilogs always blocks).
In general, a process runs until it reaches a wait statement. This suspends the process until a particular
event occurs, which may be an event on a signal, a condition on a signal, a timeout, or any combination
of these. By itself, wait terminates a process. At the other extreme, wait on Clk until Clk =
1 for 5ns; waits for the clock to rise or for 5 ns, whichever comes rst.
Combinational processes, which always run in response to a change on any of their inputs, are com-
mon enough to warrant a shorthand. Thus, process(A, B, C) effectively executes a wait on
A, B, C statement at the end.
VHDLs type systemis much more elaborate thanVerilogs. It provides integers, oating-point numbers,
enumerations, and physical quantities. Integers and oating-point numbers include a range specication.
For example, a 16-bit integer might be declared as
type address is range 16#0000# to 16#FFFF#;
Enumerated literals may be single characters or identiers. Identiers are useful for FSM states and
single characters are useful for Boolean wire values. Typical declarations:
type Bit is (0, 1);
type FourV is (0, 1, X, Z);
type State is (Reset, Running, Halted);
Objects in VHDL, such as types, variables, and signals, have attributes such as size, base, and range.
Such information can be useful for, say, iterating over all elements in an array. For example, if type
Index is range 31 downto 0, then IndexLOW is 0. Access to information about signals can
be used for collecting simulation statistics. For example, if Count is a signal, then CountEVENT is true
when there is an event on the signal.
VHDL has a powerful library and package facility for encapsulating and reusing denitions. For
example, the standard logic library for VHDL includes types for representing wire states and stand-
ard functions such as AND and OR that operate on these types. Verilog has such facilities built in, but is
not powerful enough to allow such functionality to be written as a library.
entity NAND is
port (a: in Bit; b: in Bit; y: out Bit);
end NAND;
architecture arch1 of mux2 is
signal cc, ai, bi : Bit; -- internal signals
component Inverter -- component interface
port (a:in Bit; y: out Bit);
end component;
component AndGate
port (a1, a2:in Bit; y: out Bit);
end component;
component OrGate
port (a1, a2:in Bit; y: out Bit);
end component;
begin
I1: Inverter port map(c => a, y => cc); -- by name
A1: AndGate port map(a, c, ai); -- by position
A2: AndGate port map(a1 => b, a2 => cc, y => bi);
O1: OrGate port map(a1 => ai, a2 => bi, y => d);
end;
signal cc, ai, bi : Bit;
begin
cc <= not c;
ai <= a and c;
bi <= b and cc;
d <= ai or bi;
end;
begin
process(a, b, c) -- sensitivity list
begin
if c = 1 then
d <= a;
else
d <= b;
end if;
end process;
end;
(a)
(b)
(c)
(d)
FIGURE 7.9 VHDL examples. Compare with Figure 7.8. (a) The entity declaration for the multiplexer, which denes
its interface, (b) a structural description of the multiplexer from Figure 7.8(a), (c) a dataow description with one
equation per gate, (d) an imperative behavioral description.
7.4 Dataow Languages
The hardware and software languages described earlier have semantics very close to that of their imple-
mentations (e.g., as instructions on a sequential processor or as digital logic gates), which makes for
efcient realizations, but some problems are better described using different models of computation.
Many embedded systems perform signal processing tasks such as reconstructing a compressed audio
signal. While such tasks can be described and implemented using the hardware and software languages
described earlier, signal processing tasks are more conveniently represented with systems of processes
that communicate through queues. Although clumsy for general applications, dataow languages are a
perfect t for signal-processing algorithms, which use vast quantities of arithmetic derived from linear
system theory to decode, compress, or lter data streams that represent periodic samples of continuously-
changing values such as sound or video. Dataow semantics are natural for expressing the block diagrams
typically used to describe signal-processing algorithms, and their regularity makes dataow implementa-
tions very efcient because otherwise costly run-time scheduling decisions can be made at compile time,
even in systems containing multiple sampling rates.
7.4.1 Kahn Process Networks
Kahn Process Networks [18] form a formal basis for dataow computation. Kahns systems consist
of processes that communicate exclusively through unbounded point-to-point rst-in, rst-out queues
(Figure 7.10). Reading from a port makes a process wait until data is available, but writing to a port always
completes immediately.
Deterministic behavior is the most unique aspect of Kahns networks. Processes blocking read behavior
guarantees the overall system behavior (specically, the sequence of data tokens that ow through each
queue) is the same regardless of the relative execution rates of the processes, that is, regardless of the
scheduling policy. This is generally a very desirable property because it provides a guarantee about the
behavior of the system, ensures that simulation and reality will match, and greatly simplies the design
task since a designer is not obligated to ensure this herself.
Balancing processes relative execution rates to avoid an unbounded accumulation of tokens is the
challenge in scheduling a Kahn network. One general approach, proposed in Parks thesis [19] places
process f(in int u, in int v, out int w)
{
int i; bool b = true;
for (;;) {
i = b ? wait(u) : wait(w);
printf("%i\n", i);
send(i, w);
b = !b;
}
}
process g(in int u, out int v, out int w)
{
for (;;) {
send(wait(u), v); send(wait(u), w);
}
}
process h(in int u, out int v, int init)
{
send(v, init);
for(;;)
send(wait(u), v);
}
channel int X, Y, Z, T1, T2;
f(Y, Z, X);
g(X, T1, T2);
h(T1, Y, 0);
h(T2, Z, 1);
FIGURE7.10 AKahnProcess Networkwrittenina C-like dialect. Here, processes are functions that runcontinuously,
may be attached to communication channels, and may call wait to wait for data on a particular port and send to write
data to a particular port. The f process alternately copies from its u and v ports to its w port; the g process does the
opposite, copying its u port to alternately v and w; and h simply copies its input to its output.
In Filt
1 1
Hil
1 8
Eq
2 4
Mul
2
2
Conj
2 2
Fork
1 2
Mul
1
2
2 2
sc
2 1
Add
1 1
Biq
1
1
Biq
1
1
Fork 1
1
1
1
Deci
1
1
Deco
2 2
Out
1 1
2
2 2
2
FIGURE 7.11 A modem in SDF. Each node represents a process. The labels on each arc indicate the number of
tokens sent or received by a process each time it res.
articial limits on the size of each buffer. Any process that writes to a full buffer blocks until space is
available, but if the system deadlocks because all buffers are full, the scheduler increases the capacity of
the smallest buffer.
In practice, Kahn networks are rarely used in their pure form since they are fairly costly to sched-
ule and their completely deterministic behavior is sometimes overly restrictive since they cannot easily
handle sporadic events (e.g., an occasional change of volume level in a digital volume control) or server-
like behavior where the environment may make requests in an unpredictable order. Nevertheless, Kahns
model still has useful properties and forms a starting point for other dataow models.
7.4.2 Synchronous Dataow
Lee and Messerschmitts [20] Synchronous Dataow (SDF) xes the communication patterns of the blocks
in a Kahn network (Figure 7.11 is an example after Bhattacharyya et al. [21]). Each time a block runs,
it consumes and produces a xed number of data tokens on each of its ports. Although more restrictive
than Kahn networks, SDFs predictability allows it to be scheduled completely at compile time, producing
very efcient code.
Scheduling operates in two steps. First, the rate at which each block res is established by considering
the production and consumption rates of each block at the source and sink of each queue. For example,
the arc between the Hil and Eq nodes in Figure 7.11 implies Hil runs twice as frequently. Once the rates
are established, any algorithmthat simulates the execution of the network without buffer underow will
produce a correct schedule if one exists. However, more sophisticated techniques reduce generated code
and buffer sizes by better ordering the execution of the blocks (see Bhattacharyya et al. [22]).
Synchronous dataow specications are built by assembling blocks typically written in an imperative
language such as C. The SDF block interface is specic enough to make it easy to create libraries of
general-purpose blocks such as adders, multipliers, and even FIR lters.
While SDF is often used as a simulation language, it is also well-suited to code generation. It enables a
practical technique for generating code for digital signal processors, for which C compilers often cannot
generate efcient code. Assembly code is handcrafted for each block in a library, and code synthesis
consists of assembling these handwritten blocks, sometimes generating extra code that handles the inter-
block buffers. For large, specialized blocks such as fast Fourier transforms, this canbe very effective because
most of the generated code was carefully optimized by hand.
7.5 Hybrid Languages
The languages in this section use even more novel models of computation than the hardware, software, or
dataow languages presented earlier (Table 7.3). While such languages are more restrictive than general-
purpose ones, they are much better-suited for certain applications. Esterel excels at discrete control by
blending software-like control ow with the synchrony and concurrency of hardware. Communication
TABLE 7.3 Hybrid Language Features Compared
Esterel SDL SystemC
Concurrency
Hierarchy
Preemption
Determinism
Synchronous communication
Buffered communication
FIFO communication
Procedural
Finite-state machines
Dataow
Multi-rate dataow
Software implementation
Hardware implementation
Note: , full support; , partial support.
protocols are SDLs forte; it uses extended nite-state machines with single input queues. SystemC provides
a very exible discrete-event simulation environment built on C ++.
7.5.1 Esterel
Intended for specifying control-dominated reactive systems, Esterel [23] combines the control constructs
of an imperative software language with concurrency, preemption, and a synchronous model of time
like that used in synchronous digital circuits. In each clock cycle, the program awakens, reads its inputs,
produces outputs, and suspends.
An Esterel program communicates through signals that are either present or absent each cycle. In each
cycle, each signal is absent unless an emit statement for the signal runs and makes the signal present for
that cycle only. Esterel guarantees determinism by requiring each emitter of a signal to run before any
statement that tests the signal.
Esterel is strongest at specifying hierarchical state machines. In addition to sequentially composing
statements (separated by a semicolon), it has the ability to compose arbitrary blocks of code in parallel
(the double vertical bars) and abort or suspend a block of code when a condition is true. For example, the
every-do construct in Figure 7.12 effectively wraps a reset statement around two state machines running
in parallel.
7.5.2 SDL
SDL is a graphical specication language developed for describing telecommunication protocols dened
by the ITU [24] (Ellsberger [25] is more readable). A system consists of concurrently-running FSMs,
each with a single input queue, connected by channels that dene which messages they carry. Each FSM
consumes the most recent message in its queue, reacts to it by changing internal state or sending messages
to other FSMs, changes to its next state, and repeats the process. Each FSM is deterministic, but because
messages from other FSMs may arrive in any order because of varying execution speed and communication
delays, an SDL system may behave nondeterministically.
In addition to a fairly standard textual format, SDL has a formalized graphical notation. There are three
types of diagrams. Flowcharts dene the behavior of state machines at the lowest level (Figure 7.13). Block
diagrams illustrating the communication among state machines local to a single processor are at the next
level up. Each communication channel is labeled with the set of messages that it conveys. The top level is
another block diagramthat depicts the communication among processors. The communication channels
module Example:
input S, I;
output O;
signal R, A in
every S do
await I;
weak abort
sustain R
when immediate A;
emit O
||
loop
pause; pause;
present R then emit A end;
end
end
end
end module
FIGURE 7.12 An Esterel program modeling a shared resource. This implements two parallel threads (separated
by ||), one that waits for an I signal, then asserts R until it received an A from the other thread and emits an O.
Meanwhile, the second thread emits an R in response to an A in alternate cycles.
Estab
Close
Seqn: =Seq
Seq: =Seq+1
Fin : = 1
Len(9)
Packet
wait1
Packet
Fin?
Seqn: =Ack
Ack: =Ack+1
Ackn: =Seqn+1
Ack : = 1
Len(9)
Rst?
Closed Size?
FIGURE 7.13 A fragment of an SDL owchart specication for a TCP protocol. The rounded boxes denote states
(Estab, wait1, and Closed). Immediately below Estab are inward-pointing boxes that receive signals (Close, Packet).
The square and diamond boxes below these are actions and decisions. The outward-pointed boxes (e.g., Packet) emit
signals.
in these diagrams are also labeled with the signals they convey, but are assumed to have signicant delay,
unlike the channels among FSMs in a single processor.
The behavior of an SDL state machine is straightforward. At the beginning of each cycle, it gets the next
signal in its input queue and sees if there is a receive block for that signal off the current state. If there is,
the associated code is executed, possibly emitting other signals and moving to a next state. Otherwise, the
signal is simply discarded and the cycle repeats from the same state. By itself, such semantics have a hard
time dealing with signals that arrive out-of-order, but SDL has an additional construct for handling this
condition. The save construct is like the receive construct, appearing immediately a state and matching a
signal, but when it matches it stores the signal in a buffer that holds it until another state has a matching rule.
7.5.3 SystemC
The SystemC language (Figure 7.14) is a C++ subset for system modeling. A SystemC specication is
simulated by compiling it with a standard C++ compiler and linking in freely-distributed class libraries
from www.systemc.org.
The SystemC language builds systems fromVerilog- and VHDL-like modules. Each has a collection of
I/O ports and may contain instances of other modules or processes dened by a block of C++code.
SystemC uses a discrete-event simulation model. The SystemC scheduler executes the code in a process
in response to an event such as a clock signal, or a delay. This model resembles that used in Verilog and
VHDL, but has the exibility of operating with a general-purpose programming language.
SystemC began life aiming to replace Verilog or VHDL as a hardware description language (it did
not offer designers a sufciently compelling reason to switch), but has since moved beyond that. Very
often in system design, it is desirable to run simulations to estimate such high-level behavior as bus
activity or memory accesses. Historically, designers had custom written simulators in a general-purpose
language such as C, but this was time-consuming because of the need to write a new simulation kernel
(i.e., something that provided concurrency) for each new simulator.
SystemC is emerging as a standard for writing system-level simulations. While not perfect, it works
well enough and makes it fairly easy to glue large pieces of existing software together. Although Verilog
has a PLI (programming language interface) that allows arbitrary C/C++ code to be linked and run
simultaneously with a simulation, the higher integration of the SystemC approach is more efcient.
#include "systemc.h"
struct complex_mult : sc_module {
sc_in<int> a, b;
sc_in<int> c, d;
sc_out<int> x, y;
sc_in_clk clock;
void do_mult() {
for (;;) {
x = a * c - b * d;
wait();
y = a * d + b * c;
wait();
}
}
SC_CTOR(complex_mult) {
SC_CTHREAD(do_mult, clock.pos());
}
};
FIGURE 7.14 A SystemC model for a complex multiplier.
SystemC supports transaction-level modeling, in which bus transactions, rather than being modeled
on a per-cycle basis as would be done in a language such as Verilog, are modeled as function calls. For
example, a burst-mode bus transfer would be modeled with a function that marks the bus as in use,
advances simulation time according to the number of bytes to be transferred, actually copies the data in
the simulator, and marks the bus as unused. Nowhere in the simulation would the actual sequence of
signals and bits transferred over the bus appear.
7.6 Summary
Currently, most embedded systems are programmed using C for software and Verilog, or possibly VHDL,
for hardware components suchas FPGAs or ASICs, but this will probably change. The increasedcomplexity
of such designs makes a compelling case for different, higher-level languages. Years ago, designers made
the jump fromassembly to C, and the higher-level constructs of Java are growing more attractive despite
its performance loss.
Domain-specic languages, especially for signal-processing problems, already have a signicant beach-
head, and will continue to make inroads. Most signal processing algorithms are already prototyped using
a higher-level language (Matlab), but it remains to be seen whether synthesis from Matlab will ever be
practical.
For hardware, the direction is less clear. While modeling languages such as SystemC will continue to
grow in importance, there is currently no clear winner for the successor to VHDL and Verilog. Roughly
a decade ago, a different, high-level subset of VHDL and Verilog was proposed as the new behavioral
synthesis subset, but did not catch on because it was too limiting, largely because of restrictions placed on
it by the synthesis algorithms. Additions such as SystemVerilog are incremental, if helpful, improvements,
but will not provide the quantum leap forward that synthesis from the RTL (register-transfer level)
subsets of Verilog and VHDL provided. Perhaps future hardware languages may contain constructs such
as Esterels.
References
[1] Stephen A. Edwards. Languages for Digital Embedded Systems. Kluwer, Boston, MA, September
2000.
[2] Dennis M. Ritchie. The Development of the C Language. In History of Programming Languages II.
Thomas J. Bergin, Jr. and Richard G. Gibson, Jr., Eds. ACMPress, NewYork and Addison-Wesley,
Reading, MA, 1996.
[3] Brian W. Kernighan and Dennis M. Ritchie. The C Programming Language, 2nd ed. Prentice Hall,
Upper Saddle River, NJ, 1988.
[4] Bjarne Stroustrup. The C++ Programming Language, 3rd ed. Addison-Wesley, Reading, MA, 1997.
[5] Ken Arnold, James Gosling, and David Holmes. The Java Programming Language, 3rd ed.
[6] James Gosling, Bill Joy, Guy Steele, and Gilad Bracha. The Java Language Specication, 2nd ed.
[7] TimLindholmand Frank Yellin. The Java Virtual Machine Specication. Addison-Wesley, Reading,
MA, 1999.
[8] Greg Bollella, Ben Brosgol, Peter Dibble, Steve Furr, James Gosling, David Hardin, Mark Turnbull,
Rudy Belliardi, Doug Locke, Scott Robbins, Pratik Solanki, and Dionisio de Niz. The Real-Time
Specication for Java. Addison-Wesley, Reading, MA, 2000.
[9] Jean Labrosse. MicroC/OS-II. CMP Books, Lawrence, Kansas, 1998.
[10] Loic P. Briand and Daniel M. Roy. Meeting Deadlines in Hard Real-Time Systems: The Rate
Monotonic Approach. IEEE Computer Society Press, NewYork, 1999.
[11] C. L. Liu and James W. Layland. Scheduling Algorithms for Multiprogramming in a Hard Real-
Time Environment. Journal of the Association for Computing Machinery, 20: 4661, 1973.
[12] IEEE Computer Society. IEEE Standard Hardware Description Language Based on the Verilog
Hardware Description Language (13641995). IEEE Computer Society Press, NewYork, 1996.
[13] Donald E. Thomas and Philip R. Moorby. The Verilog Hardware Description Language, 4th ed.
Kluwer, Boston, MA, 1998.
[14] IEEE Computer Society. IEEE Standard VHDL Language Reference Manual (10761993). IEEE
Computer Society Press, NewYork, 1994.
[15] Douglas L. Perry. VHDL, 3rd ed. McGraw-Hill, NewYork, 1998.
[16] Ben Cohen. VHDL Coding Styles and Methodologies, 2nd ed. Kluwer, Boston, MA, 1999.
[17] Peter J. Ashenden. The Designers Guide to VHDL. Morgan Kaufmann, San Francisco, CA, 1996.
[18] Gilles Kahn. The Semantics of a Simple Language for Parallel Programming. In Information
Processing 74: Proceedings of IFIP Congress 74. North-Holland, Stockholm, Sweden, August 1974,
pp. 471475.
[19] Thomas M. Parks. Bounded Scheduling of Process Networks. PhDthesis, University of California,
Berkeley, 1995. Available as UCB/ERL M95/105.
[20] Edward A. Lee and David G. Messerschmitt. Synchronous Data Flow. Proceedings of the IEEE,
75: 12351245, 1987.
[21] Shuvra S. Bhattacharyya, Praveen K. Murthy, and Edward A. Lee. Synthesis of Embedded Software
fromSynchronous DataowSpecications. Journal of VLSI Signal Processing Systems, 21: 151166,
1999.
[22] Shuvra S. Bhattacharyya, Ranier Leupers, and Peter Marwedel. Software Synthesis and Code
Generation for Signal Processing Systems. IEEE Transactions on Circuits and Systems II: Analog
and Digital Signal Processing, 47: 849875, 2000.
[23] Grard Berry and Georges Gonthier. The Esterel Synchronous Programming Language: Design,
Semantics, Implementation. Science of Computer Programming, 19: 87152, 1992.
[24] International Telecommunication Union. ITU-T Recommendation Z.100: Specication and
Description Language. International Telecommunication Union, Geneva, 1999.
[25] Jan Ellsberger, Dieter Hogrefe, and Amardeo Sarma. SDL: Formal Object-Oriented Language for
Communicating Systems, 2nd ed. Prentice Hall, Upper Saddle River, NJ, 1997.
8
The Synchronous
Hypothesis and
Synchronous
Languages
Dumitru Potop-Butucaru
IRISA
Robert de Simone
INRIA
Jean-Pierre Talpin
IRISA
8.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1
8.2 The Synchronous Hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2
What For? Basic Notions Mathematical Models
Implementation issues
8.3 Imperative Style: Esterel and SyncCharts . . . . . . . . . . . . . . 8-5
Syntax and Structure Semantics Compilation and
Compilers Analysis/Verication/Test Generation: Benets
from Formal Approaches
8.4 The Declarative Style: Lustre and Signal . . . . . . . . . . . . . . . 8-11
A Synchronous Model of Computation Declarative Design
Languages Compilation of Declarative Formalisms
8.5 Success Stories A Viable Approach for System
Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-18
8.6 Into the Future: Perspectives and Extensions . . . . . . . . . . 8-18
Asynchronous Implementation of Synchronous Specications
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-20
8.1 Introduction
Electronic Embedded Systems are not new, but their pervasive introduction in ordinary-life objects (cars,
phones, home appliances) brought a new focus onto design methods for such systems. New development
techniques are needed to meet the challenges of productivity in a competitive environment. This hand-
book reports on a number of such innovative approaches to the matter. Here, we shall concentrate on
Synchronous Reactive (S/R) languages [14].
S/R languages rely on the synchronous hypothesis, which divides computations and behaviors into a
discrete sequence of computation steps that are equivalently called reactions or execution instants. In itself,
Partly supported by the ARTIST IST European project.
8-1
this assumption is rather common in practical embedded system design. But the synchronous hypothesis
adds the fact that, inside each instant, the behavioral propagation is well-behaved (causal), so that the status
of every signal or variable is established and dened prior to being tested or used. This criterion, which
may be seen at rst as an isolated technical requirement, is in fact the key point of the approach. It ensures
strong semantic soundness by allowing universally recognized mathematical models such as the Mealy
machines andthe digital circuits tobe usedas supporting foundations. Inturn, these models give access toa
large corpus of efcient optimization, compilation, and formal verication techniques. The synchronous
hypothesis also guarantees full equivalence between various levels of representation, thereby avoiding
altogether the pitfalls of nonsynthesizability of other similar formalisms. In that sense, the synchronous
hypothesis is, in our view, a major contribution to the goal of model-based design of embedded systems.
Structured languages have been introduced for the modeling and programming of S/R applications.
They are roughly classied into two families:
Imperative languages, such as Esterel [57] and SyncCharts [8], provide constructs to shape control-
dominated programs such as hierarchical synchronous automata, in the wake of the StateCharts
formalism, but with a full-edged treatment of simultaneity, priority, and absence notication
of signals in a given reaction. Thanks to this, signals assume a consistent status for all parallel
components in the system at any given instant.
Declarative languages, such as Lustre [9] and Signal [10], shape applications based on intensive data
computation and data-ow organization, with the control ow part operating under the form of
(internally generated) activation clocks. These clocks prescribe which data computation blocks
are to be performed as a part of the current reaction. Here again, the semantics of the languages
deal with the issue of behavior consistency, so that every value needed in a computation is indeed
available at that instant.
Here, we shall describe the synchronous hypothesis and its mathematical background, together with
a range of design techniques empowered by the approach and a short comparison with neighboring
formalisms; then, we introduce both classes of S/R languages, with their special features and a couple
of programming examples; nally, we comment on the benets and shortcomings of S/R modeling,
concluding with a look at future perspectives and extensions.
8.2 The Synchronous Hypothesis
8.2.1 What For?
Program correctness (the process performs as intended) and program efciency (it performs as fast as
possible) are major concerns in computer science, but they are even more stringent in the embedded area,
as no online debugging is feasible, and time budgets are often imperative (for instance in multimedia
applications).
Programcorrectness is sought by introducing appropriate syntactic constructs and dedicated languages,
making programs more easily understandable by humans, as well as allowing high-level modeling and
associated verication techniques. Provided semantic preservation is ensured down to actual implement-
ation code, this provides reasonable guarantees on functional correctness. However, while this might
sound obvious for traditional software compilation schemes, the hardware synthesis process is often not
seamless, as it includes manual rewriting.
Program efciency is traditionally handled in the software world by algorithmic complexity analysis,
and expressed in terms of individual operations. But in modern systems, owing to a number of phe-
nomena, this high-level complexity reects rather imperfectly the low-level complexity in numbers
of clock cycles spent. In the hardware domain, one considers various levels of modeling, correspond-
ing to more abstract (or conversely more precise) timing account: transaction level, cycle accurate, time
accurate.
Synchronous Hypothesis and Languages 8-3
One possible way (amongst many) to view synchronous languages is to take up the analogy of
cycle-accurate programming to a more general setting, including (reactive) software as well. This ana-
logy is supported by the fact that simulation environments in many domains (from scientic engineering
to Hardware Description Language [HDL] simulators) often use lockstep computation paradigms, very
close to the synchronous cycle-based computation. In these settings, cycles represent logical steps, not
physical time. Of course timing analysis is still possible afterwards, and in fact often simplied by the
previous division into cycles.
The focus of synchronous languages is thus to allow modeling and programming of systems where cycle
(computation step) precision is needed. The objective is to provide domain-specic structured languages for
their description, and to study matching techniques for efcient design, including compilation/synthesis,
optimization, and analysis/verication. The strong condition insuring the feasibility of these design
activities is the synchronous hypothesis, described in Section 8.2.2.
8.2.2 Basic Notions
What has come to be known as the synchronous hypothesis, laying foundations for S/R systems, is really
a collection of assumptions of a common nature, sometimes adapted to the framework considered. We
shall avoid heavy mathematical formalization in this presentation, and defer the interested reader to the
existing literature, such as References 3 and 4. The basics are:
Instants and reactions. Behavioral activities are divided according to (logical, abstract) discrete time. In
other words, computations are divided according to a succession of nonoverlapping execution instants.
In each instant, input signals possibly occur (for instance, by being sampled), internal computations take
place, and control and data are propagated until output values are computed and a new global system
state is reached. This execution cycle is called the reaction of the system to the input signals. Although
we used the word time just before, there is no real physical time involved, and instant durations need
not be uniform (or even considered!). All that is required is that reactions converge and computations are
entirely performed before the current execution instant ends and a new one begins. This empowers the
obvious conceptual abstraction that computations are innitely fast (instantaneous, zero-time), and
take place only at discrete points in (physical) time, with no duration. When presented without sufcient
explanations, this strong formulation of the synchronous hypothesis is often discarded by newcomers as
unrealistic (while, again, it is only an abstraction, amply used in other domains where all-or-nothing
transaction operations take place).
Signals. Broadcast signals are used to propagate information. At each execution instant, a signal can
either be present or absent. If present, it also carries some value of a prescribed type (pure signals exists
as well, that carry only their presence status). The key rule is that a signal must be consistent (same
present/absent status, same data) for all read operations during any given instant. In particular, reads
from parallel components must be consistent, meaning that signals act as controlled shared variables.
Causality. The crucial task of deciding whenever a signal can be declared absent is of utter importance
in the theory of S/R systems, and an important part of the theoretical body behind the synchronous
hypothesis. This is of course especially true for local signals, that are both generated and tested inside the
system. The fundamental rule is that the present status and value of a signal should be dened before they
are read (and tested). This requirement takes various practical forms depending on the actual language
or formalism considered, and we shall come back to this later. Here, note that before refers to causal
dependency in the computation of the instant, and not to physical or even logical time between successive
instants [11]. The synchronous hypothesis ensures that all possible schedules of operations amount to the
same result (convergence); it also leads to the denition of correct programs, as opposed to ill-behaved
ones where no causal scheduling can be found.
Activation conditions and clocks. Each signal can be seen as dening (or generating) a new clock, ticking
when it occurs; in hardware design, this is called gated clocks. Clocks and sub-clocks, either externally or
internally generated, can be used as control entities to activate (or not) component blocks of the system.
We shall also call them activation conditions.
8.2.3 Mathematical Models
If one forgets temporarily about data values, and one accepts the duality of present/absent signals mapped
on to true/false values, then there is a natural interpretation of synchronous formalisms as synchronous
digital circuits at schematic gate level, or netlists (roughly RTL level with only Boolean variables and
registers). In turn, such circuits have a straightforward behavioral expansion into Mealy Finite State
Machines (FSMs).
The two slight restrictions given here are not essential: the adjunction of types and values into digital
circuit models has been successfully attempted in a number of contexts, and S/R systems can also be seen
as contributing to this goal. Meanwhile, the introduction of clocks and present/absent signal status in
S/R languages departs drastically from the prominent notion of sensitivity list generally used to dene the
simulation semantics of HDLs.
We now comment on the opportunities made available through the interpretation of S/R systems into
Mealy machines or netlists:
Netlists. Here, we consider netlists a simple form, as Boolean equation systems dening the values
of wires and Boolean registers as a Boolean function of other wires and previous register values. Some
wires represent input and output signals (with value true indicating signal presence), others are internal
variables. This type of representation is of special interest because it can provide exact dependency
relations between variables, and thus a good representation level to study causality issues with accurate
analysis. Notions of constructive causality have been the subject of much attention here. They attempt
to rene the usual crude criterion for synthesizability, which forbids cyclic dependencies between nonre-
gister variables (so that a variable seems to depend upon itself in the same instant), but neither takes into
account the Boolean interpretation, nor the potentially reachable congurations. Consider the equation
x = y z, while it has been established that y is the constant true. Then x does not really depend on z,
since its (constant) value is forced by ys. Constructive causality seeks for the best possible faithful notion
of true combinatorial dependency taking the Boolean interpretation of functions into account. For details,
see Reference 12.
Another equally important aspect of the mathematical model is that a number of combinatorial and
sequential optimization techniques have been developed over the years, in the context of hardware syn-
thesis approaches. The main ones are now embedded in the SIS and MVSIS optimization suites, from UC
Berkeley [13, 14]. They come as a great help in allowing programs written in high-level S/R formalisms to
compile into efcient code, either software or hardware targeted [15].
Mealy machines. Mealy machines are nite-state automata corresponding strictly to the synchronous
assumption. In a given state, provided a certain input valuation (a subset of present signals), the machine
reacts by immediately producing a set of output signals before entering a new state.
The Mealy machines can be generated fromnetlists (and by extension fromany S/R system). The Mealy
machine construction can then be seen as a symbolic expansion of all possible behaviors, computing
the space of reachable states (RSS) on the way. But while the precise RSS is won, the precise causal
dependencies relations are lost, which is why both Mealy FSM and netlists models are useful in the course
of S/R design [16].
When the RSS is extracted, often in symbolic Binary Decision Diagram (BDD) form, it can be used in a
number of ways: we already mentioned that constructive causality only considers dependencies inside the
RSS; similarly, all activities of model-checking formal verication, and test coverage analysis are strongly
linked to the RSS construction [1720].
The modeling style of netlists can be extrapolated to block-diagram networks, often used in multimedia
digital signal processing, by adding more types and arithmetic operators, as well as activation condi-
tions to introduce some amount of control ow. The declarative synchronous languages can be seen as
attempts to provide structured programming to compose large systems modularly in this class of applica-
tions, as described in Section 8.4. Similarly, imperative languages provide ways to program in a structured
way, hierarchical systems of interacting Mealy FSMs, as described in Section 8.3.
reaction () {
}
decode state ; read input ;
compute ;
write output ; encode state ;
FIGURE 8.1 The reaction function is called at each instant to perform the computation of the current step.
8.2.3.1 Synchronous Hypothesis versus Neighboring Models
Many quasi-synchronous formalisms exist in the elds of embedded system (cosimulation): the simulation
semantics of SystemC and regular HDLs at RTL level, or the discrete-step Simulink/Stateow simulation,
or the ofcial StateCharts semantics, for instance. Such formalisms generally employ a notion of physical
time in order to establish when to start the next execution instant. Inside the current execution instant,
however, delta-cycles allow zero-delay activity propagation, and potentially complex behaviors occur inside
a given single reaction. The main difference here is that no causality analysis (based on the synchronous
hypothesis) is performed at compilation time, so that an efcient ordering/scheduling cannot be precom-
puted before simulation. Instead, each variable change recursively triggers further recomputations of all
depending variables in the same reaction.
8.2.4 Implementation Issues
The problem of implementing a synchronous specication mainly consists in dening the step reaction
function that will implement the behavior of an instant, as shown in Figure 8.1. Then, the global behavior
is computed by iterating this function for successive instants and successive input signal valuations.
Following the basic mathematical interpretations, the compilation of a S/R program may either consist in
the expansion into a at Mealy FSM, or in the translation into a at netlist (with more types and arithmetic
operators, but without activation conditions). Here, the runtime implementation consists in the execution
of the resulting Mealy machine or netlist. In the rst case, the automaton structure is implemented as a
big top-level switch between states. In the second case, the netlist is totally ordered in a way compatible
with causality, and all the equations in the ordered list are evaluated at each execution instant. These basic
techniques are at the heart of the rst compilers, and some industrial ones.
In the last decade fancier implementation schemes have been sought, relying on the use of activation
conditions: during each reaction, execution starts by identifying the truly useful program blocks that are
marked as active. Then only the actual execution of the active blocks is scheduled (a bit more dynam-
ically) and performed in an order that respects the causality of the program. In the case of declarative
languages, the activation conditions come in the form of a hierarchy of clock under samplings the clock
tree, obtained through a clock calculus computation performed at compile time (see Section 8.4.3).
In the case of imperative formalisms, activation conditions are based on the halting points (where
the control ow can stop between execution instants) and on the signal-generated (sub-)clocks (see
Section 8.3.3).
8.3 Imperative Style: Esterel and SyncCharts
For control-dominated systems, comprising a fair number of (sub-)modes and macro-states with activity
swapping between them, it is natural to employ a description style that is algorithmic and imperative,
describing the changes and progression of control in an explicit ow. In essence, one seeks to represent
hierarchical (Mealy) FSMs, but with some data computation and communication treatment performed
inside states andtransitions. Esterel provides this ina textual fashion, while SyncCharts propose a graphical
counterpart, with visual macro-states. It should be noted that systems here remain in nite-state (at least
their control structure).
8.3.1 Syntax and Structure
Esterel introduces a specic pauseconstruct, usedto divide behaviors into successive instants (reactions).
The pause statement excepts, control is owing through sequential, parallel, and ifthenelse constructs,
performing data operations, and interprocess signaling. But it stops at pause, memorizing the activity of
that location point for the next execution instant. This provides the needed atomicity mechanism, since
the instant is over when all currently active parallel components reach a pause statement.
The full Esterel language contains a large number of constructs that facilitate modeling, but there exists
a reduced kernel of primitive statements (corresponding to the natural structuring paradigms) fromwhich
all the other constructs can be derived. This is of special interest for model-based approaches, because
only primitives need to be assigned semantics as transformations in the model space. The semantics of
the primitives are then combined to obtain the semantics of composed statements. Figure 8.2 provides
the list of primitive operators for the data-less subset of Esterel (also called Pure Esterel). A few comments
are here in order:
In p; q the reaction where p terminates is the same as the reaction where q starts (control can be
split into reactions only by pause statements inside p or q).
The loop constructs do not terminate, unless aborted from above. This abortion can be owing to
an external signal received by an abort statement, or to an internal exception raised through the
trap/exit mechanism, or to any of the two (like for the weak abort statement). The body
of a loop statement should not instantly terminate, or else the loop will unroll endlessly in the
same instant, leading to divergence. This is checked by static analysis techniques. Finally, loops are
the only means of dening iterating behaviors (there is no general recursion), so that the system
remains in nite-state.
The presentsignal testing primitive allows an elsepart. This is essential to the expressive power
of the language, and has strong semantic implications pertaining to the synchronous hypothesis.
It is enough to note that, according to the synchronous hypothesis, signal absence can effectively be
asserted.
The difference between abort p when S and weak abort p when S is that in the rst
case signal S can only come from outside p and its occurrence prevents p from executing during
the execution instant where S arrives. In the second case, S can also be emitted by p, and the
preemption occurs only after p has completed its execution for the instant.
[p]
pause
p; q
loop p end
[p || q]
signal S inend
emit S
present S then p else q end
abort p when S
weak abort p when S
suspend p when S
trap T in p end
exit T
Enforces precedence by parenthesis
Suspends the execution until next instant
Executes p, then q as soon as p terminates
Iterates p forever in sequence
Executes p and q in parallel, synchronously
Declares local signal S in p
Emits signal S
Executes p or q upon S being present or absent!
Executes p until Soccurs (exclusive)
Executes p until S occurs (inclusive)
Executes p unless S occurs
Declare/catch exception T in p
Raise exception T
FIGURE 8.2 Pure Esterel statements.
Technically speaking, the trap/exit mechanisms can emulate the abort statements. But we feel
that the ease of understanding makes the latter worth including in the set of primitives. Similarly,
we shall sometimes use await S as a shorthand for abort loop pause end when S, and
sustain S for loop emit S end.
Most of the data-handling part of the language is deferred to a general-purpose host language (C, C++,
Java, ). Esterel only declares type names, variables types, and function signatures (which are used as mere
abstract instructions). The actual type specications and function implementations must be provided and
linked at later compilation time.
In addition to the structuring primitives of Figure 8.2, the language contains (and requires) interface
declarations (for signals, most notably), and modular division with submodules invocation. Submodule
instantiation allows signal renaming, that is, transforming virtual name parameters into actual ones (again,
mostly for signals). Rather than providing a full user manual for the language, we shall illustrate most of
these features on an example.
The small example of Figure 8.3 has four input signals and one output signal. Meant to model a cyclic
computation like a communication protocol, the core of our example is the loop that awaits the input I,
emits O, and then awaits J before instantly restarting. The local signal END signals the completion of loop
cycles. When started, the await statement waits for the next clock cycle where its signal is present. The
computation of all the other statements present in our example is performed during a single clock cycle,
so that the await statements are the only places where control can be suspended between reactions (they
preserve the state of the program between cycles). A direct consequence is that the signals I and J must
come in different clock cycles in order not to be discarded.
The loop is preempted by the exception handling statement trap when exit T is executed. In
this case, trap instantly terminates, control is given in sequence, and the program terminates. The
preemption protocol is triggered by the input signal KILL, but the exception T is raised only when END is
emitted. The programis suspended no computation is performed and the state is kept unchanged in
clock cycles where the SUSP signal is received. A possible execution trace for our program is given
in Figure 8.4.
module Example: input I,J,KILL,SUSP; output O;
suspend
trap T in %exception handler, performs the preemption
signal END in
loop %basic computation loop
await I;emit O;await J;emit END
end
||
%preemption protocol, triggered by KILL
await KILL;await END;exit T
end
end;
when SUSP %suspend signal
end module
FIGURE 8.3 A simple Esterel programmodeling a cyclic computation (such as a communication protocol) that can
be interrupted between cycles and which can be suspended.
Clock Inputs Outputs Comments
0 Any All inputs discarded
1 I O
2 KILL Preemption protocol triggered
3 Nothing happens
4 J,SUSP Suspend, J discarded
5 J END emitted, T raised, program terminates
FIGURE 8.4 A possible execution trace for our example.
8.3.2 Semantics
Esterel enjoys a full-edged formal semantics, in the form of Structural Operational Semantic (SOS)
rules [12]. In fact, there are two main levels of such rules, with the coarser describing all potential,
logically consistent behaviors, while the more precise one only selects those that can be obtained in a
constructive way (thereby discarding some programs as unnatural in this respect). This issue can be
introduced with two small examples:
present S then emit S end present S else emit S end
In the rst case the signal S can logically be assumed as either present or absent: if assumed present, it will
be emitted, so it will become present; if assumed absent, it will not be emitted. In the second case, following
a similar reasoning, the signal can be neither present nor absent. In both cases, anyhow, the analysis is
done byguessingbefore branching to the potentially validating emissions. While more complex causality
paradoxes can be built using the full language, these two examples already show that the problem stems
from the existence of causality dependencies inside a reaction, prompted by instantaneous sequential
control propagation and signal exchanges. The so-called constructive causality semantics of Esterel checks
precisely that control and signal propagation are well-behaved, so that no guess is required. Programs
that pass this requirement are deemed as correct, and they provide deterministic behaviors for whatever
input is presented to the program(which is a desirable feature in embedded systemdesign).
8.3.3 Compilation and Compilers
Following the pattern presented in Section 8.2.4, the rst compilers for Esterel were based on the trans-
lation of the source into (Mealy) nite automata or into digital synchronous circuits at netlist level.
Then, the generated sequential code was a compiled automata or netlist simulator. The automata-based
compilation [7] was used in the rst Esterel compilers (known as Esterel V3). Automaton generation
was done here by exhaustive expansion of all reachable states using symbolic execution (all data is kept
uninterpreted). Execution time was then theoretically optimal, but code size could blowup (as the number
of states), and huge code duplication was mandatory for actions that were performed in several different
states. The netlist-based compilation (Esterel V5) is based on a quasi-linear, structural Esterel-to-circuits
translation scheme [21] that ensures the tractability of compilation even for the largest examples. The
drawback of the methodis the reactiontime (the simulationtime for the generatednetlist), whichincreases
linearly with the size of the program.
Apart fromthese two compilation schemes, which have matured into full industrial-strength compilers,
several attempts have been made to develop a more efcient, basically event-based type of compilation that
follows more readily the naive execution path and control propagation inside each reaction, and in par-
ticular executes as much as possible only the truly active parts of the program.
1
Here, we mention three
1
Recall that this is a real issue in Esterel, since programs may contain reaction to absence of signals, and determining
this absence may require to check that no emission remains possible in the potential behaviors, whatever feasible test
Loop
Seq
Par
Signal
Trap
Suspend
Hierarchical
state
representation
(a)
Program
activation
Program
start
Concurrent
control-flow
graph
(b)
FIGURE 8.5 GRC intermediate representation for our Esterel example.
such approaches: the Saxo compiler of Closse and co-workers [22], the EC compiler of Edwards [23],
and the GRC2C compiler of Potop-Butucaru and de Simone [24]. All of them are structured around
owgraph-based intermediate representations that are easily translated into well-structured sequential
code. The different intermediate representations also give the differences between approaches, by determ-
ining which Esterel programs can be represented, and what optimization and code generation techniques
can be applied.
We exemplify on the GRC2C compiler [24], which is structured around the GRC intermediate form.
The GRC representation of our example, given in Figure 8.5, uses two graph-based structures a
hierarchical state representation (HSR) and a concurrent control-ow graph (CCFG) to preserve most of
the structural information of the Esterel program while making the control ow explicit with few graph-
building primitive nodes. The HSR is an abstraction of the syntax tree of the initial Esterel program.
It can be seen as a structured data memory that preserves state information across reactions. During
each instant, a set of activation conditions (clocks) is computed from this memory state, to drive the
execution toward active instructions. The CCFG represents, in an operational fashion, the computation of
an instant (the transition function). During each reaction, the dynamic CCFG operates on the static HSR
by marking/unmarking component nodes (subtrees) withactive tags as they are activated or deactivated
by the semantics.
For instance, when we start our small example (Figure 8.3 and Figure 8.5), the program start (1) and
program (0) HSR nodes are active, while all the statements of the program (and the associated HSR
nodes) are not. Like in any instant, control enters the CCFG by the topmost node and uses the rst state
decoding node (labeled 0) to read the state of the HSR and branch to the start behavior, which sets the
program start (1) indicator to inactive (with exit 1), and activates await I and await KILL (with
enter 8 and enter 11).
branches could be taken. To achieve this goal at a reasonable computational price, current compilers require, in fact,
additional restrictions in essence, the acyclicity of the dependency/causality graph at some representation level.
Acyclicity ensures constructiveness, because any topological order of the operations in the graph gives an execution
order which is correct for all instants.
The HSR also serves as a repository for tags, which record redundancies between various activation
clocks, and are used by the optimization and code generation algorithms. Such a tag is #, which tells that
at most one child of the tagged node can retain control between reactions at a time (the activation clocks
of the branches are exclusive). Other tags (not gured here) are computed through complex static analysis
of both the HSR and CCFG. The tags allow efcient optimization and sequential code generation.
The CCFG is obtained by making the control ow of the Esterel programexplicit (a structural, quasi-
linear translation process).
2
Usually, it can be highly optimized using classical compiler techniques and
some methods derived fromcircuit optimization, both driven by the HSRtags computed by static analysis.
Code generation froma GRC representation is done by encoding the state on sequential variables, and by
scheduling the CCFG operators using classical compilation techniques [25].
The Saxo compiler of Closse and co-workers [22] uses a discrete-event interpretation of Esterel to
generate a compiled event-driven simulator. The compiler ow is similar to that of VeriSUIF [26], but
Esterels synchronous semantics are used to highly simplify the approach. An event graph intermediate
representation is used here to split the program into a list of guarded procedures. The guards intuitively
correspond to events that trigger computation. At each clock cycle, the simulation engine traverses the list
once, fromthe beginning to the end, and executes the procedures with an active guard. The execution of a
procedure may modify the guards for the current cycle and for the next cycle. The resulting code is slower
than its GRC2C-generated counterpart for two reasons: rst, it does not exploit the hierarchy of exclusion
relations determined by switching statements like the tests. Second, optimization is less effective because
the programhierarchy is lost when the state is (very redundantly) encoded using guards.
The EC compiler of Edwards [23] treats Esterel as having control-ow semantics (in the spirit of
[25,27]) in order to take advantage of the initial programhierarchy and produce efcient, well-structured
C code. The Esterel program is rst translated into a CCFG representing the computation of a reaction.
The translation makes the control ow explicit and encodes the state access operations using tests and
assignments of integer variables. Its static scheduling algorithm takes advantage of the mutual exclusions
between parts of the program and generates code that uses program counter variables instead of simple
Boolean guards. The result is therefore faster than its Saxo-generated counterpart. However, it is usually
slower than the GRC2C-generated code because the GRC representation preserves the state structure of
the initial Esterel programand uses static analysis techniques to determine redundancies in the activation
pattern. Thus, it is able to better simplify the nal state representation and the CCFG.
8.3.4 Analysis/Verication/Test Generation: Benets from
Formal Approaches
We claimed that the introduction of well-chosen structuring primitives, endowed with formal mathem-
atical semantics and interpretations as well-dened transformations in the realms of Mealy machines and
synchronous circuits, was instrumental in allowing powerful analysis and synthesis techniques as part of
the design of synchronous programs. What are they, and how do they appear in practice to enhance the
condence in the correctness of safety-critical embedded applications?
Maybe, the most obvious is that synchronous formalisms can fully benet from the model-checking
and automatic verication usually associated to the netlist and Mealy machine representations, and now
widely popular in the hardware design community with the PSL/SuGaR and assertion-based design
approaches. Symbolic BDD- and SAT-based model-checking techniques are thus available on all S/R
systems. Moreover, the structured syntax allows in many cases the introduction of modular approaches,
or guide abstraction techniques with the goal of reducing complexity of analysis.
The ability of formal methods akin to model-checking can also be used to automatically produce test
sequences that seek to reach the best possible coverage in terms of visited states or exercised transitions.
Here again specic techniques were developed to match the S/R models.
2
Sucha process is necessary, because most Esterel statements pack together twodistinct, andoftendisjoint behaviors:
one for the execution instants where they are started, and one for instants where control is resumed frominside.
Also, symbolic representations of the reachable state spaces (or abstracted over-approximations), which
can effectively be produced and certied correct, thanks to the formal semantics, can be used in the course
of compilation and optimization. In particular for Esterel, the RSS computation allows more correct
programs with respect to constructiveness: indeed causal dependencies may vary in direction depending on
the state. If all dependencies are put together regardless of the states, then a causality cycle may appear, while
not all components of the cycle may be active at the same instant, and so no real cycle exists (but it takes a
dynamic analysis to establish this). Similarly, the RSS may exhibit combinatorial relations between registers
encoding the local states, so that register elimination is possible to further simplify the state space structure.
Finally, the domain-specic structuring primitives empowering dedicated programming can also be
seen as an important criterion. Readable, easily understandable programs are a big step toward correct
programs. And when issues of correctness are not so plain and easy, as for instance, when regarding proper
scheduling of behaviors inside a reaction to respect causal effects, then powerful abstract hypothesis are
dened in the S/R domain that dene admissible orderings (and build them for correct programs).
A graphical version of Esterel, named SyncCharts for synchronous StateCharts, has been dened to
provide a visual formalism with a truly synchronous semantics.
8.4 The Declarative Style: Lustre and Signal
The presentation of declarative formalisms implementing the synchronous hypothesis as dened in
Section 8.2 can be cast into a model of computation (proposed in Reference 28 consisting of a domain
of traces/behaviors and of a semilattice structure that renders the synchronous hypothesis using a timing
equivalence relation: clock equivalence. Asynchrony can be superimposed on this model by consider-
ing a ow equivalence relation. Heterogeneous systems [29] can also be modeled by parameterizing the
composition operator using arbitrary timing relations.
8.4.1 A Synchronous Model of Computation
We consider a partially-ordered set of tags t to denote instants (which are seen, in the sense of Section 8.2.2,
as symbolic periods in time during which one reaction takes place). The relation t
1
t
2
says that t
1
occurs
before t
2
. A minimum tag exists, denoted by 0. A totally ordered set of tags C is called a chain and denotes
the sampling of a possibly continuous or dense signal over a countable series of causally related tags.
Events, signals, behaviors, and processes are dened as follows:
An event e is a pair consisting of a value v and a tag t .
A signal s is a function from a chain of tags to a set of values.
A behavior b is a function from a set of names x to signals.
A process p is a set of behaviors that have the same domain.
In the remainder, we write tags (s ) for the tags of a signal s, vars (b ) for the domains of b, b [
X
for the
projection of a behavior b on a set of names X, and b /X for its complementary. Figure 8.6 depicts a
behavior (b) over three signals named x, y, and z. Two frames depict timing domains formalized by
chains of tags. Signal x and y belong to the same timing domain: x is a down-sampling of y. Its events are
synchronous to odd occurrences of events along y and share the same tags, for example, t
1
. Even tags of y,
for example, t
2
, are ordered along its chain, for example, t
1
< t
2
, but absent from x. Signal z belongs to
a different timing domain. Its tags, for example, t
3
are not ordered with respect to the chain of y, for
example, t
1
, t
3
and t
3
, t
1
.
The synchronous composition of the processes p and q is denoted by p [ [ q. It is dened by the union b c
of all behaviors b (from p) and c (from q) that hold the same values at the same tags b [
I
= c[
I
for all
signal x I = vars (b ) vars (c ) they share. Figure 8.7 depicts the synchronous composition, right, of the
behaviors b, left, and the behavior c, middle. The signal y, shared by b and c, carries the same tags and the
same values in both b and c. Hence, b c denes the synchronous composition of b and c.
FIGURE 8.6 A behavior (named b) over three signals (x, y, and z) belonging to two clock domains.
FIGURE 8.7 Synchronous composition of b p and c q.
FIGURE 8.8 Scheduling relations between simultaneous events.
FIGURE 8.9 Relating synchronous behaviors by stretching.
A scheduling structure is dened to schedule the occurrence of events along signals during an instant t .
A scheduling by a preorder relation between dates x
t
where t represents the time and x the location
of the event. Figure 8.8 depicts such a relation, superimposed to the signals x and y of Figure 8.6. The
relation y
t
1
x
t
1
, for instance, requires y to be calculated before x at the instant t
1
. Naturally, scheduling
is contained in time: if t < t
/
then x
t
b
x
t
/ for any x and b and if x
t
b
x
t
/ then t
/
,< t .
A synchronous structure is dened by a semilattice structure to denote behaviors that have the same
timing structure. The intuition behind this relation (depicted in Figure 8.9) is to consider a signal as an
elastic with ordered marks on it (tags). If the elastic is stretched, marks remain in the same relative and
partial order but have more space (time) between each other. The same holds for a set of elastics: a behavior.
If elastics are equally stretched, the order between marks is unchanged. In Figure 8.9, the timescale of x
and y changes but the partial timing and scheduling relations are preserved. Stretching is a partial-order
relation which denes clock equivalence. Formally, a behavior c is a stretching of b of same domain, written
b c, if there exists an increasing bijection on tags f that preserves the timing and scheduling relations.
If so, c is the image of b by f . Last, the behaviors b and c are called clock-equivalent, written b c, iff there
exists a behavior d such that d b and d c.
8.4.2 Declarative Design Languages
The declarative design languages Lustre [9] and Signal [10] share the core syntax of Figure 8.10 and
can both be expressed within the synchronous model of computation of Section 8.4.1. In both languages,
FIGURE 8.10 A common syntactic core for Lustre and Signal.
-> pre
if then
FIGURE 8.11 The ifthenelse condition in Lustre.
node counter (tick, reset: bool) returns (count: int);
let
count = i f true->reset
then 0
else if tick then pre count+1 else pre count;
FIGURE 8.12 A resettable counter in Lustre.
a process P is aninnite loopthat consists of the synchronous compositionP [ [ Qof simultaneous equations
x = y f z over signals named x, y, and z. Both Lustre and Signal support the restriction of a signal
name x to a process P, noted P/x. The analogy stops here as Lustre and Signal differ in fundamental
ways. Lustre is a single-clocked programming language, while Signal is a multi-clocked (polychronous)
specicationformalism. This difference originates inthe choice of different primitive combinators (named
f in Figure 8.10) and results in orthogonal systemdesign methodologies.
8.4.2.1 Combinators for Lustre
In a Lustre process, each equation processes the nth event of each input signal during the nth reaction
(to possibly produce an output event). As it synchronizes upon availability of all inputs, the timing
structure of a Lustre programis easily captured within a single clock domain: all input events are related
to a master clock and the clock of the output signals is dened by sampling the master. There are three
fundamental combinators in Lustre:
Delay. x = prey initially lets x undened and then denes it by the previous value of y.
Followed-by. x = y ->z initially denes x by the value v, and then by z. The pre and -> operators
are usually used together, like in x = v -> pre(y), to dene a signal x initialized to v and dened by
the previous value of y. Scade, the commercial version of Lustre, uses a one-bit analysis to check that each
signal dened by a pre is effectively initialized by an ->.
Conditional. x = ifb theny elsez denes x by y if b is true and by z if b is false. It can be used
without alternative x = ifb theny to sample y at the clock b, as shown in Figure 8.11.
Lustre programs are structured as data-ow functions, also called nodes. A node takes a number of
input signals and denes a number of output signals upon the presence of an activation condition. If that
condition matches an edge of the input signal clock, then the node is activated and possibly produces
output. Otherwise, outputs are undetermined or defaulted. As an example, Figure 8.12 denes a resettable
counter. It takes an input signal tick and returns the count of its occurrences. Aboolean reset signal
can be triggered to reset the count to 0. We observe that the boolean input signals tick and reset are
synchronous to the output signal count and dene a data-ow function.
:=
FIGURE 8.13 The delay operator in Signal.
:= when
:= default
FIGURE 8.14 The merge operator in Signal.
process counter = (? event tick, reset ! integer value)
(| value := (0 when reset)
default ((value$ init 0 + 1) when tick)
default (value$ init 0)
|);
FIGURE 8.15 A resettable counter in Signal.
8.4.2.2 Combinators for Signal
As opposed to nodes in Lustre, equations x := y f z in Signal more generally denote processes that dene
timing relations between input and output signals. There are three primitive combinators in Signal:
Delay. x := y$1init v initially denes the signal x by the value v and then by the previous value of
the signal y. The signal y and its delayed copy x := y$1init v are synchronous: they share the same
set of tags t
1
, t
2
, . . . . Initially (at t
1
), the signal x takes the declared value v. At tag t
n
, x takes the value of
y at tag t
n 1
. This is displayed in Figure 8.13.
Sampling. x := y when z denes x by y when z is true (and both y and z are present); x is present
with the value v
2
at t
2
only if y is present with v
2
at t
2
and if z is present at t
2
with the value true. When
this is the case, one needs to schedule the calculation of y and z before x, as depicted by y
t
2
x
t
2
z
t
2
.
Merge. x = y default z denes x by y when y is present and by z otherwise. If y is absent and
z present with v
1
at t
1
then x holds (t
1
, v
1
). If y is present (at t
2
or t
3
) then x holds its value whether
z is present (at t
2
) or not (at t
3
). This is depicted in Figure 8.14.
The structuring element of a Signal specication is a process. A process accepts input signals originating
from possibly different clock domains to produce output signals when needed. Recalling the example of
the resettable counter (Figure 8.12), this allows, for instance, to specify a counter (pictured in Figure 8.15)
where the inputs tick and reset and the output value have independent clocks. The body of
counter consists of one equation that denes the output signal value. Upon the event reset, it sets
the count to 0. Otherwise, upon a tick event, it increments the count by referring to the previous value
of value and adding 1 to it. Otherwise, if the count is solicited in the context of the counter process
(meaning that its clock is active), the counter just returns the previous count without having to obtain a
value fromthe tick and reset signals.
A Signal process is a structuring element akin to a hierarchical block diagram. A process may struc-
turally contain sub-processes. A process is a generic structuring element that can be specialized to the
timing context of its call. For instance, a denition of the Lustre counter (Figure 8.12) starting from the
specication of Figure 8.15 consists of the renement depicted in Figure 8.16. The input tick and reset
clocks expected by the process counter are sampled fromthe boolean input signals tick and reset
process synccounter = (? boolean tick, reset ! integer value)
(| value := counter (when tick, when reset)
| reset = tick = value
|);
FIGURE 8.16 Synchronization of the counter interface.
+ - * (clock expression)
when (clock relations)
FIGURE 8.17 The syntax of clock expressions and clock relations (equations).
:=
:= when when
:= default + when -
FIGURE 8.18 The clock inference system of Signal.
by using thewhen tickandwhen reset expressions. The count is then synchronized to the inputs
by the equation reset = tick = count.
8.4.3 Compilation of Declarative Formalisms
The analysis and code generation techniques of Lustre and Signal are necessarily different, tailored to
handle the specic challenges determined by the different models of computation and programming
paradigms.
8.4.3.1 Compilation of Signal
Sequential code generation starting from a Signal specication starts with an analysis of its implicit
synchronization and scheduling relations. This analysis yields the control- and data-ow graphs that
dene the class of sequentially executable specications and allow to generate code.
8.4.3.1.1 Synchronization and Scheduling Analysis
In Signal, the clock x of a signal x denotes the set of instants at which the signal x is present. It
is represented by a signal that is true when x is present and that is absent otherwise. Clock expressions
(see Figure 8.17) represent control. The clock when x (respectively whennot x) represents the time
tags at which a boolean signal x is present and true (respectively false). The empty clock is denoted
by 0. Clock expressions are obtained using conjunction, disjunction, and symmetric difference over
other clocks. Clock equations (also called clock relations) are Signal processes: the equation e= e
/
synchronizes the clocks e and e

/
while e< e
/
species the containment of e in e
/
. Explicit scheduling
relations x y when e allow the representation of causality in the computation of signals (e.g., x after
y at the clock e).
A system of clock relations E can be easily associated (using the inference system P:E of Figure 8.18) to
any Signal process P, to represent its timing and scheduling structure.
8.4.3.1.2 Hierarchization
The clock and scheduling relations E of a process P dene the control- and data-ow graphs that hold all
necessary information to compile a Signal specication upon satisfaction of the property of endochrony,
as illustrated in Figure 8.19. A process is said to be endochronous iff given a set of input signals (x and
y in Figure 8.19) and ow-equivalent input behaviors (datagrams on the left of Figure 8.19), it has the
capability to reconstruct a unique synchronous behavior up to clock-equivalence: the datagrams of the
Input
buffer
Endochronous
process
Input
buffer
Endochronous
process
FIGURE 8.19 Endochrony: from ow-equivalent inputs to clock-equivalent outputs.
input signals in the middle of Figure 8.19 and of the output signal on the right of Figure 8.19 are ordered
in clock-equivalent ways.
To determine the order x _ y in which signals are processed during the period of a reaction, clock
relations E play an essential role. The process of determining this order is called hierarchization and
consists of an insertion algorithm that proceeds in three easy steps:
1. First, equivalence classes are dened between signals of same clock: if E x=y then x _ y
(we write E E
/
iff E implies E
/
).
2. Second, elementary partial order relations are constructed between sampled signals: if
E x=when y or E x=whennot y then y _ x.
3. Last, assume a partial order of maximum z such that E z = yf w (for some f
{ + , * , - ]) and a signal x such that y _ x _ w, then insertion consist of attaching z to x by
x _ z.
The insertion algorithm proposed in Reference 30 yields a canonical representation of the partial
order _ by observing that there exists a unique minimum clock x below z such that rule 3 holds. Based
on the order _, one can decide whether E is hierarchical by checking that its clock relation _ has a
minimum, written min
_
E vars (E ), so that x vars (E ), y vars (E ), y _ x. If E is furthermore
acyclic (i.e., E x x when e implies E e= 0, for all x vars (E )) then the analyzed process is
endochronous, as shown in Reference 28.
Example 8.1 The implications of hierarchization for code generation can be outlined by considering
the specication of one-place buffer in Signal (Figure 8.20, left). Process buffer implements two
functionalities. One is the process alternate which desynchronizes the signals i ando by synchronizing
them to the true and false values of an alternating boolean signal b. The other functionality is the process
current. It denes a cell in which values are stored at the input clock i and loaded at the output
clock o. cell is a predened Signal operation dened by:
x := y cell z init v =
def
(m:= x$1init v [ [ x := y default m [ [ x=y +z )/m
Clock inference (Figure 8.20, middle) applies the clock inference system of Figure 8.18 to the process
buffer to determine three synchronization classes. We observe that b, c_b, zb, and zo are
synchronous and dene the master clock synchronization class of buffer. There are two other syn-
chronization classes, c_i and c_o, that corresponds to the true and false values of the boolean ipop
variable b, respectively:
b>c_b>zb>zo and b _ c_i>i and b _ c_o>o
This denes three nodes in the control-ow graph of the generated code (Figure 8.20, right). At the main
clock c_b, b, and c_o are calculated from zb. At the sub-clock b, the input signal i is read. At the
sub-clock c_o the output signal o is written. Finally, zb is determined. Notice that the sequence of
instructions follows the scheduling relations determined during clock inference.
process buffer = (? i ! o)
(| alternate (i, o)
| o := current (i)
|) where
process alternate = (? i, o ! )
(| zb := b$1 init true
| b := not zb
| o = when not b
| i = when b
|) / b, zb;
process current = (? i ! o)
(| zo := i cell o init false
| o := zo when o
|) / zo;
(| c_b = b
| b = zb
| zb = zo
| c_i := when b
| c_i = i
| c_o := when not b
| c_o = o
| i -> zo when i
| zb -> b
| zo -> o when o
|) / zb, zo, c_b,
c_o, c_i, b;
buffer_iterate () {
b = !zb;
c_o = !b;
if (b) {
if (!r_buffer_i(&i))
return FALSE;
}
if (c_o) {
o = i;
w_buffer_o(o);
}
zb = b;
return TRUE;
}
FIGURE 8.20 Specication, clock analysis, and code generation in Signal.
8.4.3.2 Compilation of Lustre
Whereas Signal uses a hierarchization algorithm to nd a sequential execution path starting from a system
of clock relations, Lustre leaves this task to engineers, which must provide a sound, fully synchron-
ized program in the rst place: well-synchronized Lustre programs correspond to hierarchized Signal
specications.
The classic compilation of Lustre starts with a static program analysis that checks the correct syn-
chronization and cycle freedom of signals dened within the program. Then, it essentially partitions the
program into elementary blocks activated upon boolean conditions [9] and focuses on generating efcient
code for high-level constructs, such as iterators for array processing [31].
Recent efforts have been conducted to enhance this compilation scheme by introducing effective activ-
ation clocks, whose soundness is checked by typing techniques. In particular, this was applied to the
industrial SCADE version, with extensions [32, 33].
8.4.3.3 Certication
The simplicity of the single-clocked model of Lustre eases program analysis and code generation.
Therefore, its commercial implementation Scade by Esterel Technologies provides a certied Ccode
generator. Its combinationtoSildex (the commercial implementationof Signal by TNI-Valiosys) as a front-
end for architecture mapping and early requirement specication is the methodology advocated in the
IST project Safeair (URL: http://www.safeair.org). The formal validation and certication of
synchronous programproperties has been the subject of numerous studies. In Reference 34, a coinductive
axiomatization of Signal in the proof assistant Coq [35], based on the calculus of constructions [36], is
proposed.
The application of this model is twofold. It allows, rst of all, for the exhaustive verication of formal
properties of innite-state systems. Two case studies have been developed. In Reference 37, a faithful
model of the steam-boiler problemwas given in Signal and its properties proved with Signals Coq model.
In Reference 38, it is applied to proving the correctness of real-time properties of a protocol for loosely
time-triggered architectures (TTAs), extending previous work proving the correctness of its nite-state
approximation [39].
Another important application of modeling Signal in the proof assistant Coq is being explored: the
development of a reference compiler translating Signal programs into Coq assertions. This translation
allows to represent model transformations performed by the Signal compiler as correctness-preserving
transformations of Coq assertions, yielding a costly yet correct-by-construction synthesis of the
target code.
Other approaches to the certication of generated code have been investigated. In Reference 40,
validation is achieved by checking a model of the C code generated by the Signal compiler in the theorem
prover PVS with respect to a model of its source specication (translation validation).
Related work on modeling Lustre have equally been numerous and started in Reference 41 with
the verication of a sequential multiplier using a model of stream functions in Coq. In Reference 42,
the verication of Lustre programs is considered under the concept of generating proof obligations and
by using PVS. In Reference 43, a semantics of Lucid-Synchrone, an extension of Lustre with higher-order
streamfunctions, is given in Coq.
8.5 Success Stories A Viable Approach for System Design
Synchronous and reactive formalisms appeared in the early nineties and the theory matured and expanded
since then to cover all the topics presented in this chapter. Research groups were active mostly in France,
but also notably in Germany and in the United States. Several large academic projects were completed,
including the IST Syrf, Sacres, and Safeair projects, as well as industrial early-adopters ones.
S/R modeling and programming environments are today marketed by two French software houses,
Esterel Technologies for Esterel and SCADE/Lustre, and TNI-Valiosys for Sildex/Signal. The inuence
of S/R systems tentatively pervaded to hardware CAD products such as Synopsys CoCentric Studio and
Cadence VCC, despite the omnipotence of classical HDLs there. The Ptolemy cosimulation environment
fromUC Berkeley comprises a S/R domain based on the synchronous hypothesis.
There have been a number of industrial take-ups on S/R formalisms, most of them in the aeronautics
industry. Airbus Industries is nowusing Scade for the real design of parts of the newAirbus A-380 aircraft.
S/R languages are also used by Dassault Aviation (for the next-generation Rafale ghter jet) and Snecma
([4] gives an in-depth coverage of these prominent collaborations). Car and phone manufacturers are also
paying increasing attention (for instance, at Texas Instruments), as well as advanced development teams
in embedded hardware divisions of prominent companies (such as Intel).
8.6 Into the Future: Perspectives and Extensions
Future advances in and around synchronous languages can be predicted in several directions:
Certied compilers. As already seen, this is the case for the basic SCADE compiler. But as the demand
becomes higher, owing to the critical-safety aspects of applications (in transportation elds notably), the
impact of full-edged operational semantics backing the actual compilers should increase.
Formal models and embedded code targets. Following the trend of exploiting formal models and semantic
properties to help dene efcient compilation and optimization techniques, one can consider the case
of targeting distributed platforms (but still with a global reaction time). Then, the issues of spatial
mapping and temporal scheduling of elementary operations composing the reaction inside a given inter-
connect topology become a fascinating (and NP complete) problem. Heuristics for user guidance and
semiautomatic approaches are the main topic of the SynDEx environment [44,45]. Of course this requires
an estimation of the time budgets for the elementary operations and communications.
Desynchronized systems. In larger designs, the full global synchronous assumption is hard to maintain,
especially if long propagation chains occur inside a single reaction (in hardware, for instance, the clock tree
cannot be distributed to the whole chip). Several types of answers are currently being brought to this issue,
trying to instill a looser coupling of synchronous modules in a desynchronized network (one then talks
of Globally Asynchronous Locally Synchronous, GALS, systems). In the theory of latency-insensitive
design, all processes are supposed to be able to stall until the full information is synchronously available.
The exact latency duration meant to recover a (slower) synchronous model are computed afterwards, only
after functional correctness on the more abstract level is achieved [46, 47]. Fancier approaches, trying to
save on communications and synchronizations, are introduced in Section 8.6.1.
Relations between transactional and cycle-accurate levels. If synchronous formalisms can be seen as a
global attempt at transferring the notion of cycle-accurate modeling to the design of SW/HW embed-
ded systems, then the existing gap between these levels must also be reconsidered in the light of formal
semantics and mathematical models. Currently, there exists virtually no automation for the synthesis of
RTLfromTLMlevels. The previous item, withits well-denedrelaxationof synchronous hypothesis at spe-
cicationtime, couldbe a denite stepinthis direction(of formally linking twodistinct levels of modeling).
Relations between cycle-accurate and timed models. Physical timing is of course a big concern in syn-
chronous formalisms, if only to validate the synchronous hypothesis and establish converging stabilization
of all values across the system before the next clock tick. While in traditional software implementations
one can decide that the instant is over when all treatments were effectively completed, in hardware or other
real-time distributed settings a true compile-time timing analysis is in order. Several attempts have been
made in this direction [48, 49].
8.6.1 Asynchronous Implementation of Synchronous Specications
The relations between synchronous and asynchronous models have long remained unclear, but investig-
ations in this direction have recently received a boost owing to demands coming from the engineering
world. The problem is that many classes of embedded applications are best modeled (at least in part)
under the cycle-based synchronous paradigm, while their desired implementation is not. This problem
covers implementation classes that become increasingly popular (such as distributed software or even
complex digital circuits such as the Systems-on-a-Chip), hence the practical importance of the problem.
Such implementations are formed of components that are only loosely connected through communica-
tion lines that are best modeled as asynchronous. At the same time, the existing synchronous tools for
specication, verication, and synthesis are very efcient and popular, meaning that they should be used
for most of the design process.
In distributed software, the need for global synchronization mechanisms always existed. However, in
order to be used in aerospace and automotive applications, an embedded systemmust also satisfy very high
requirements inthe areas of safety, availability, andfault tolerance. These needs promptedthe development
of integrated platforms, such as TTA [50], which offer higher-level, proven synchronization primitives,
more adapted to specication, verication, and certication. The same correctness and safety goals are
followed in a purely synchronous framework by two approaches: the AAA methodology and the SynDEx
software of Sorel and co-workers [45] and the Ocrep tool of Girault and co-workers [51]. Both approaches
take as input a synchronous specication, an architecture model, and some real-time and embedding
constraints, and produce a distributed implementation that satises the constraints and the synchrony
hypothesis (supplementary signals simulate at runtime the global clock of the initial specication). The
difference is that Ocrep is rather tailored for control-dominated synchronous programs, while SynDEx
works best on data-ow specications with simple control.
In the (synchronous) hardware world, problems appear when the clock speed and circuit size become
large enough to make global synchrony unfeasible (or at least very expensive), most notably in what
concerns the distribution of the clock and the transmission of data over long wires between functional
components. The problem is to ensure that no communication error occurs owing to the clock skew
or the interconnect delay between the emitter and the receiver. Given the high cost (in area and power
consumption) of precise clockdistribution, it appears infact that the only long-termsolutionis the division
of large systems into several clocking domains, accompanied by the use of novel on-chip communication
and synchronization techniques.
When the multiple clocks are strongly correlated, we talk about mesochronous or plesiochronous systems.
However, when the different clocks are unrelated (e.g., for power saving reasons), the resulting circuit
is best modeled as a GALS systemwhere the synchronous domains are connected through asynchronous
communication lines (e.g., FIFOs). Such approaches are pausible clocking by Yun and Donohue [52],
or, in a framework where a global, reference clock is still preserved, latency-insensitive design by Carloni
et al. [46]. A multi-clock extension of the Esterel language [53] has been proposed for the description of
such systems. A more radical approach to the hardware implementation of a synchronous specication
is desynchronization [54], where the clock subsystem is entirely removed and replaced with asynchronous
handshake logic. The advantages of such implementations are those of asynchronous logic: smaller power
consumption, average-case performance, smaller electromagnetic interference.
At an implementation-independent level, several approaches propose solutions to various aspects of
the problem of GALS implementation. The loosely TTAs of Benveniste et al. [39] dene a sampling-based
approach to (inter-process) FIFO construction. More important, Benveniste et al. [55] dene semantics
preservation an abstract notion of correct GALS implementation of a synchronous specication (asyn-
chronous communication is modeled here as message passing). Latency insensitivity ensures in a very
simple, highly constrained way the semantics preservation. Less constraining and higher-level conditions
are the compositional criteria of nite ow-preservation of Talpin et al. [56,57] and of weak endo-isochrony
of Potop-Butucaru et al. [58]. While nite-ow preservation focuses on checking equivalence through
nite desynchronization protocols, weak endo-isochrony allows to exploit the internal concurrency of
synchronous systems in order to minimize signalization, and to handle innite behaviors.
References
[1] Nicolas Halbwachs. Synchronous programming of reactive systems. In Computer AidedVerication
(CAV98). Kluwer Academic Publishers, 1998, pp. 116.
[2] Grard Berry. Real-time programming: general-purpose or special-purpose languages. In Inform-
ation Processing 89, G. Ritter, Ed. Elsevier Science Publishers B.V. (North Holland), Amsterdam,
1989, pp. 1117.
[3] Albert Benveniste and Grard Berry. The synchronous approach to reactive and real-time systems.
[4] Albert Benveniste, Paul Caspi, Stephen Edwards, Nicolas Halbwachs, Paul Le Guernic, and Robert
de Simone. Synchronous languages twelve years later. Proceedings of the IEEE. Special Issue on
Embedded Systems, January 2003.
[5] Grard Berry and Laurent Cosserat. The Synchronous Programming Language Esterel and its
Mathematical Semantics, Vol. 197 of Lecture Notes in Computer Science, 1984.
[6] Frdric Boussinot and Robert de Simone. The Esterel language. Proceedings of the IEEE,
September 1991.
[7] Grard Berry and Georges Gonthier. The Esterel synchronous programming language: design,
semantics, implementation. Science of Computer Programming, 19: 87152, 1992.
[8] Charles Andr. Representation and analysis of reactive behavior: a synchronous approach. In
Computational Engineering in Systems Applications (CESA96). IEEE-SMC, Lille, France, 1996,
pp. 1929.
[9] Nicolas Halbwachs, Paul Caspi, and Pascal Raymond. The synchronous data-ow programming
language Lustre. Proceedings of the IEEE, 79: 1991.
[10] Albert Benveniste, Paul Le Guernic, and Christian Jacquemot. Synchronous programming with
events and relations: the Signal language and its semantics. Science of Computer Programming,
16: 103149, 1991.
[11] Grard Berry. The Constructive Semantics of Pure Esterel. Esterel Technologies, 1999. Electronic
Version Available at http://www.esterel-technologies.com.
[12] TomShiple, Grard Berry, and Herv Touati. Constructive analysis of cyclic circuits. In Proceedings
of the International Design and Testing Conference (ITDC). Paris, 1996.
[13] Ellen Sentovich, Kanwar Jit Singh, Luciano Lavagno, Cho Moon, Rajeev Murgai, Alexander
Saldanha, Hamid Savoj, Paul Stephan, Robert Brayton, and Alberto Sagiovanni-Vincentelli. SIS:
a systemfor sequential circuit synthesis. MemorandumUCB/ERL M92/41, UCB, ERL, 1992.
[14] Minxi Gao, Jie-Hong Jiang, Yunjian Jiang, Yinghua Li, Subarna Sinha, and Robert Brayton. MVSIS.
In Proceedings of the International Workshop on Logic Synthesis (IWLS01). Tahoe City, June 2001.
[15] Ellen Sentovich, Horia Toma, and Grard Berry. Latch optimization in circuits generated from
high-level descriptions. In Proceedings of the International Conference on Computer-Aided Design
(ICCAD96), 1996.
[16] Herv Touati and Grard Berry. Optimized controller synthesis using Esterel. In Proceedings of the
International Workshop on Logic Synthesis (IWLS93). Lake Tahoe, 1993.
[17] Amar Bouali, Jean-Paul Marmorat, Robert de Simone, and Horia Toma. Verifying synchronous
reactive systems programmed in esterel. In Proceedings of the FTRTFT96, Vol. 1135 of Lecture
Notes in Computer Science, 1996, pp. 463466.
[18] Amar Bouali. Xeve, an Esterel verication environment. In Proceedings of the Tenth International
Conference on Computer Aided Verication (CAV98), Vol. 1427 of Lecture Notes in Computer
Science. UBC, Vancouver, Canada, June 1998.
[19] Robert de Simone and Annie Ressouche. Compositional semantics of Esterel and verication by
compositional reductions. In Proceedings of the CAV94, Vol. 818 of Lecture Notes in Computer
Science, 1994.
[20] Laurent Arditi, Hdi Boufaed, Arnaud Cavani, and Vincent Stehl. Coverage-Directed Generation
of System-Level Test Cases for the Validationof a DSPSystem, Vol. 2021 of Lecture Notes inComputer
Science. Springer-Verlag, Heidelberg, 2001.
[21] Grard Berry. Esterel on hardware. Philosophical Transactions of the Royal Society of London,
Series A, 19: 87152, 1992.
[22] Daniel Weil, Valrie Bertin, Etienne Closse, Michel Poize, Patrick Vernier, and Jacques Pulou.
Efcient compilation of Esterel for real-time embedded systems. In Proceedings of the CASES00.
San Jose, CA, 2000.
[23] Stephen Edwards. An Esterel compiler for large control-dominated systems. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 21: 169183, February 2002.
[24] Dumitru Potop-Butucaru and Robert de Simone. Optimizations for faster execution of Esterel
programs. In Formal Methods and Models for System Design, Rajesh Gupta, Paul Le Guernic,
Sandeep Shukla, and Jean-Pierre Talpin, Eds. Kluwer, Dordrecht, 2004.
[25] Steven Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers,
San Francisco, 1997.
[26] Robert French, Monica Lam, Jeremy Levitt, and Kunle Olukotun. A general method for compiling
event-driven simulations. In Proceedings of the 32nd Design Automation Conference (DAC95).
San Francisco, CA, 1995.
[27] Jaejin Lee, David Padua, and Samuel Midkiff. Basic compiler algorithms for parallel programs. In
Proceedings of the 7thACMSIGPLANSymposiumonPrinciples andPractice of Parallel Programming.
Atlanta, GA, 1999.
[28] Paul Le Guernic, Jean-Pierre Talpin, and Jean-Christophe Le Lann. Polychrony for systemdesign.
Journal of Circuits, Systems and Computers. Special Issue on Application-Specic Hardware Design,
12(3): 261304, 2003.
[29] Albert Benveniste, Paul Caspi, Luca Carloni, and Alberto Sangiovanni-Vincentelli. Heterogen-
eous reactive systems modeling and correct-by-construction deployment. In Embedded Software
Conference (EMSOFT03). Springer-Verlag, Heidelberg, October 2003.
[30] Pascalin Amagbegnon, Loc Besnard, and Paul Le Guernic. Implementation of the data-ow
synchronous language Signal. In Conference on Programming Language Design and Implementation
(PLDI95). ACMPress, NewYork, 1995.
[31] Florence Maraninchi and Lionel Morel. Arrays and contracts for the specication and analysis of
regular systems. In Proceedings of the International Conference on Applications of Concurrency to
SystemDesign (ACSD04). IEEE Press, 2004.
[32] Jean-Louis Colao and Marc Pouzet. Clocks as rst class abstract types. In Proceedings of the
EMSOFT03, 2003.
[33] Jean-Louis Colao, Alain Girault, Grgoire Hamon, and Marc Pouzet. Towards a higher-order
synchronous data-ow language. In Proceedings of the EMSOFT04, 2004.
[34] David Nowak, Jean-Rene Beauvais, and Jean-Pierre Talpin. Co-inductive axiomatization of a
synchronous language. InProceedings of the International Conference onTheoremProving inHigher-
Order Logics, Vol. 1479 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 1998.
[35] Eduardo Gimnez. Un Calcul de Constructions Innies et son Application la Vrication des Sys-
tmes Communicants. Ph.D. thesis, Laboratoire de lInformatique du Paralllisme, Ecole Normale
Suprieure de Lyon, December 1996.
[36] Benjamin Werner. Une Thorie des Constructions Inductives. Ph.D. thesis, Universit Paris VII,
Mai, 1994.
[37] Mickael Kerboeuf, David Nowak, and Jean-Pierre Talpin. Specication and verication of a steam-
boiler with Signal-Coq. In International Conference on Theorem Proving in Higher-Order Logics,
vol. 1869 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2000.
[38] Mickael Kerboeuf, David Nowak, and Jean-Pierre Talpin. Formal proof of a polychronous pro-
tocol for loosely time-triggered architectures. In International Conference on Formal Engineering
Methods, vol. 2885 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2003.
[39] Albert Benveniste, Paul Caspi, Paul Le Guernic, Herv Marchand, Jean-Pierre Talpin, and Stavros
Tripakis. A protocol for loosely time-triggered architectures. In Embedded Software Conference
(EMSOFT02), Vol. 2491 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg,
October 2002.
[40] Amir Pnueli, O. Shtrichman, and M. Siegel. Translation validation: from Signal to C. In Correct
System Design Recent Insights and Advance, Vol. 1710 of Lecture Notes in Computer Science.
[41] Christine Paulin-Mohring. Circuits as streams in Coq: verication of a sequential multiplier.
In Types for Proofs and Programs, TYPES95, Vol. 1158 of Lecture Notes in Computer Science,
S. Berardi and M. Coppo, Eds. Springer-Verlag, Heidelberg, 1996.
[42] Ccile Dumas Canovas and Paul Caspi. A PVS proof obligation generator for Lustre programs. In
International Conference on Logic for Programming and Reasoning, Vol. 1955 of Lecture Notes in
Articial Intelligence. Springer-Verlag, Heidelberg, 2000.
[43] Sylvain Boulme and Grgoire Hamon. Certifying synchrony for free. In Logic for Programming,
Articial Intelligence and Reasoning, Vol. 2250 of Lecture Notes in Articial Intelligence. Springer-
Verlag, Heidelberg, 2001.
[44] Christophe Lavarenne, Omar Seghrouchni, Yves Sorel, and Michel Sorine. The SynDEx software
environment for real-time distributed systems design and implementation. In Proceedings of the
ECC91. France, 1991.
[45] Thierry Grandpierre, Christophe Lavarenne, and Yves Sorel. Optimized rapid prototyping for real
time embedded heterogeneous multiprocessors. In Proceedings of the 7th International Workshop
on Hardware/Software Co-Design (CODES99). Rome, 1999.
[46] Luca Carloni, Ken McMillan, and Alberto Sangiovanni-Vincentelli. The theory of latency-
insensitive design. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
20(9): 10591076, 2001.
[47] Alberto Sangiovanni-Vincentelli, Luca Carloni, Fernando De Bernardinis, and Marco Sgroi.
Benets and challenges of platform-based design. In Proceedings of the Design Automation
Conference (DAC04), 2004.
[48] George Logothetis and Klaus Schneider. Exact high-level WCETanalysis of synchronous programs
by symbolic state space exploration. In Proceedings of the DATE2003. IEEE Computer Society,
Germany, 2003.
[49] Etienne Closse, Michel Poize, Jacques Pulou, Joseph Sifakis, Patrick Venier, Daniel Weil, and
Sergio Yovine. TAXYS: a tool for the development and verication real-time embedded systems.
In Proceedings of the CAV01, Vol. 2102 of Lecture Notes in Computer Science. Springer-Verlag,
Heidelberg, 2001.
[50] Hermann Kopetz. Real-Time Systems, Design Principles for Distributed Embedded Applications.
Kluwer Academic Publishers, Dordrecht, 1997.
[51] Paul Caspi, Alain Girault, and Daniel Pilaud. Automatic distribution of reactive systems for asyn-
chronous networks of processors. IEEE Transactions on Software Engineering, 25: 416427, 1999.
[52] Kenneth Yun and Ryan Donohue. Pausible clocking: a rst step toward heterogenous systems. In
Proceedings of the International Conference on Computer Design (ICCD96), 1996.
[53] Grard Berry and Ellen Sentovich. Multiclock Esterel. In Proceedings of the CHARME01, Vol. 2144
of Lecture Notes in Computer Science, 2001.
[54] Ivan Blunno, Jordi Cortadella, Alex Kondratyev, Luciano Lavagno, Kelvin Lwin, and Christos
Sotiriou. Handshake protocols for de-synchronization. In Proceedings of the International
Symposiumon Asynchronous Circuits and Systems (ASYNC04). Crete, Greece, 2004.
[55] Albert Benveniste, Benot Caillaud, and Paul Le Guernic. Compositionality in dataow syn-
chronous languages: specication and distributed code generation. Information and Computation,
163: 125171, 2000.
[56] Jean-Pierre Talpin, Paul Le Guernic, Sandeep Kumar Shukla, Frdric Doucet, and Rajesh Gupta.
Formal renement checking in a system-level design methodology. Fundamenta Informaticae.
IOS Press, Amsterdam, 2004.
[57] Jean-Pierre Talpin and Paul Le Guernic. Algebraic theory for behavioral type inference. Formal
Methods and Models for SystemDesign, Chap. VIII, Kluwer Academic Press, Dordrecht, 2004.
[58] Dumitru Potop-Butucaru, Benot Caillaud, and Albert Benveniste. Concurrency in synchronous
systems. In Proceedings of the International Conference on Applications of Concurrency to System
Design (ACSD04). IEEE Press, 2004.
9
Introduction to UML
and the Modeling of
Embedded Systems
ystein Haugen and
Birger Mller-Pedersen
University of Oslo
Thomas Weigert
Motorola
9.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-1
9.2 UML Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-2
Static System Structure System Behavior Execution
Architecture Embedded Systems and UML
9.3 Example Automatic Teller Machine . . . . . . . . . . . . . . . . . 9-7
Domain Statement and Domain Model Behavior Overview
through a Use Case Diagram Context Description by a
Collaboration Diagram Behavior Modeling with
Interactions Behavioral Modeling with State Machines
Validation Generalizing Behavior Hierarchical
Decomposition The Difference between the UML System
and the Final System
9.4 A UML Prole for the Modeling of Embedded
Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-32
9.1 Introduction
Embedded systems have the following characteristics: most often, their software runs on a hardware
platform dedicated to a specic task (e.g., a telephony switch). Naturally, the hardware imposes limits
on what the software can do. Most embedded systems can be considered reactive the system is sent a
message and it is supposed to give a response, which may involve a change in the state of the controlled
hardware. Such software is usually real-time but the performance requirements are more often statistical
than absolute. Embedded systems are often required to handle many independent requests at the same
time. Typically, this software is parallel and distributed.
These characteristics result in embedded systems being expensive to develop. In response to the con-
straints imposedby the underlying hardware as well as by the concerns of managing parallel anddistributed
systems, developers of embedded systems embraced modeling techniques early. As a consequence, soft-
ware and system modeling for embedded system development has matured considerably. For example,
telecommunications equipment manufacturers and operators have standardized notations to describe
systems as early as the late seventies: the rst standardized notation for state diagrams was the ITU (then
CCITT) Specication and Description Language (SDL), originally approved in 1976 [13]. It recognized
9-1
that a telecomsystemwould be built of blocks (we might sayobjectstoday) with a well-dened boundary
(today we might say an interface) across which explicitly indicated signals (or messages) pass. Propri-
etary variations of sequence diagrams (these were often referred to as message ow graphs or bounce
diagrams) had long been in use in the telecomindustry. The standardization of Message Sequence Charts
(MSC) was triggered by a paper by Grabowski and Rudolph [4], leading to the rst MSCrecommendation
in 1992 (the current version is [23]).
As specication techniques matured, notations to capture software and system models proliferated
[58]. With the adoption of the UML specication [9] by the OMG this situation changed. Users and tool
vendors alike began to adopt the emerging UML notation as the primary means of visualizing, specifying,
constructing, and documenting software artifacts. A major revision [10] has addressed shortcomings and
made UML amenable to systems modeling as well as to the representation of embedded systems.
This chapter is meant to give an overview of how UML can be used for modeling embedded systems.
Supplementary introductions can be found in Reference 11.
1
We begin with a terse overview of UML and
discuss those features of UML suited to represent the characteristics of embedded systems outlined above.
We will illustrate these key features using examples. We will also show how to use UML when describing
a simple, easily understood embedded system; we begin by describing the domain statement and domain
model of the example system, give an overview of the system behavior through a use case diagram, and
use collaboration diagrams to establish the system context. We then show how to model various aspects
of the system behavior using interactions and state machines. Interactions describe the system behavior as
a set of partially-ordered sequence of events that the system needs to exhibit. State machines describe the
system behavior as an automaton that must induce these event sequences. We illustrate a simple process
of validating that the state machine descriptions indeed specify the same behavior as the interactions.
We then continue the example by exhibiting UML features useful to represent real-life systems: generaliza-
tion allows to express variation in system behavior abstractly and hierarchical decomposition allows to
make large system descriptions more scalable and understandable. Finally, we outline characteristics of
embedded systems that are abstracted away by a UML specication but need to be considered when
deploying such systems. This chapter concludes by presenting a standardized UML prole (a specication
language instantiated from the UML language family) suitable for the modeling of embedded systems.
9.2 UML Overview
The UML denes a number of diagrams to describe various aspects of a system:
Class diagrams, composite structure diagrams, component diagrams, and object diagrams specify
its static structure.
Activity diagrams, interaction diagrams, state machine diagrams, and use case diagrams provide
different mechanisms of describing its behavior.
Deployment diagrams depict its implementation and execution architecture.
Package diagrams dene how a system specication is grouped into modular units as the basis for
conguration control, storage, access control, and deployment.
9.2.1 Static System Structure
The static structure of a system describes the entities that exist in the system, their structural properties,
and their relationships to other entities.
Entities of a system are classied according to their features. Instances with common features are
specied by a classier. Classiers may participate in the generalization relationships to other classiers;
features and constraints specied for instances of the general classier are implicitly specied for instances
1
Part of the work reported in this chapter has been sponsored by the SARDAS (Securing Availability by Robust
Design, Assessment and Specication) project funded by the Norwegian research council.
Introduction to UML and the Modeling of Embedded Systems 9-3
of a more specic classier. The structural features of a classier describe relationships that an instance of
this classier has to other instances: for example, an association declares that there may be runtime links
between instances of the associated classiers (when a structural feature at an end of an association is
owned by one of the associated classiers, this structural feature is referred to as attribute). The dynamic
nature of those links can be characterized as reference or composition. Taken jointly, the structural features
of a classier dene the state of the specied instances.
Structured classiers specify the behavior of an instance as the result of the cooperation of a set of
instances (its parts) which interact with each other and the environment of their container instance
through well-dened communication channels (connectors). Ports serve to isolate a classier from its
environment by providing a well-dened point of interaction between the classier and its environment
(or the classier and its internal parts). A port species the services an instance provides (offers) to its
environment as well as the services that an instance expects (requires) of its environment.
The runtime nature of an instance is described by the particular kind of classier that species an
instance: a class species an instance that exists at runtime (an object) and has similar structure, behavior,
and relationships. An association class species a relationship between instances of connected classiers
which has features of its own. An interface species a set of public features (a contract) that an instance
of a classier that implements the interface must possess (fulll). A collaboration describes only the
relevant aspects of the cooperation of a set of instances by abstracting those properties into roles that
those instances may play (a role species the required set of features a participating instance must have).
A component is a structured class satisfying constraints such that its instances can be deployed on popular
component technologies.
Values may be speciedthrougha computationthat yields this value (either composedfromother values
as expressionor textually as opaque expression), as literals of various types (suchas Boolean, Integer, String,
unlimited Naturals), literals of enumerations, or as reference to a specic instance. Instance specications
may describe an instance partially, by listing its salient characteristics only.
Entities in a modeled system are described by instance specications which show the values that
correspond to each of the structural features of the classier that species the entity.
Dependencies describe relationships between model elements such that the semantics of a set of client
elements is dependent on one or more supplier elements. Examples of dependencies are abstraction,
realization, substitution, implementation, usage, and permission relationships. For example, one model
element implements another if an instance of the former has all the properties specied by the latter;
a model element uses another if the former requires the latter to function.
9.2.2 System Behavior
System behavior is the direct consequence of the actions of objects. Models of behavior describe how
the states of these objects, as reected by their structural features, change over time. The execution of
a behavior is caused by events, such as being directly invoked by an action or triggered indirectly. An
executing behavior is performed by an object, while emergent behavior results from the interaction of one
or more participant objects. Avariety of specicationmechanisms of behaviors are supported by the UML,
such as automata, Petri-net like graphs, informal descriptions, or partially-ordered sequences of events.
The styles of behavioral specication differ in their expressive power, albeit the choice of specication style
is often one of convenience and purpose; typically, the same kind of behavior could be described by any
of the different mechanisms. They may specify behaviors either explicitly, by describing the observable
events resulting from the execution of the behavior, or implicitly, by describing a machine that would
induce these events. When a behavior is invoked, argument values corresponding to its formal parameters
are made available to the execution. A behavior executes within a context object and independently of any
other behavior executions. When a behavior completes its execution, a value or set of values is returned
corresponding to its result parameters. Behavior specications may either describe the overall behavior of
an instance which is executed when the instance is created, or they may describe the behaviors executed
when a behavioral feature is invoked.
In addition to structural features, classiers may have behavioral features which specify that an instance
of this classier will respond to a designated request (such as an operation call or a sent signal) by invoking
a behavior.
Actions are the fundamental unit of behavior specication. An action takes a set of inputs and converts
them into a set of outputs, though either or both sets may be empty. In addition, some actions modify
the state of the system in which the action executes. Actions may perform operation calls and signal
sends, receive events and reply to invocations, access (read and write) structural features of objects and
temporary variables, create and destroy objects, and perform computations.
A state machine performs actions based on the current state of the machine and an event that triggers
a transition to another state, where a state represents a situation during which some implicit, invariant
condition holds. Each transition species a trigger upon which the transition will re, provided that its
guard conditions hold and the event is not deferred in this state. Possible triggers are the reception of a
signal or the invocation of an operation, the change of a Boolean value to true, or the expiration of a
predetermined deadline. Any action that may be associated with that transition will be performed; the
next event is only processed when all such actions have been completely processed (run-to-completion
semantics). In addition, actions may be performed upon entry to a state, while in a state, or upon
exit from a state. Transitions may be broken into segments connecting so-called pseudo states: history,
join, fork, junction, and choice vertices, as well as entry and exit points. State machines support two
structuring mechanisms: submachine states allow factoring and reuse of common aspects of a state
machine. Composite states partition the set of states into disjunct regions.
An interaction describes system behavior as a set of partially-ordered sequences of events. The parts of
the underlying classier instance are the participants in an interaction; they are represented by lifelines.
Events are depicted along the lifelines, ordered from top to bottom. Typical events are the sending and
receiving of signals or operation calls, the invocation, and termination of behaviors, as well as the creation
or destruction of instances. Events on the same lifeline retain their order, while events on different lifelines
may occur in any order (they are interleaved). However, coregions (and parallel merge) make the order
of events on a lifeline irrelevant, and an order may be imposed on events on different lifelines through
constraints and a general ordering mechanism. Communication between instances is shown by messages
which may specify both the content of the communication as well as the nature of the communication
(synchronous versus asynchronous, signal versus call, etc.). The compactness and conciseness of interac-
tions is enhanced by providing combinators for fragments of event sequences: alternatives and options
describe a choice in behavior; parallel merge, weak sequencing, and strict sequencing dene the ordering
of events, while critical regions prevent interleaving of events in a fragment altogether; loops indicate that
a fragment is repeated a number of times. Sequence fragments may be prexed by guards which prevent
the execution of the fragment when false. Interaction references allow factoring and reuse of common
interaction fragments. In addition, the high-level structure of interactions can be depicted in a ow graph
like structure, showing how interaction fragments may follow each other, yet make choices, loops, or par-
allelismexplicit in the owgraph. Interactions may also be used to assert behavioral properties of a system.
Activities emphasize the sequence of actions, where these actions are initiated because other actions n-
ish executing, because objects and data become available, or because triggering events occur. Activities are
actions connected by ows: a data owroutes values between actions and between parameters and actions;
a control ow enables actions without the transmission of data. Activities support simple unstructured
sequences of actions involving branching and joining; concurrent control and data ow; structured con-
trol ows through loops and conditionals; and exception handling. The semantics of activities is gleaned
from Petri-nets: activities are interpreted as graphs of nodes (actions, ow-of-control operators, etc.) and
edges (data and control ows). A node begins execution when specied conditions on its input tokens are
satised; when a node begins execution, tokens are taken from the input edges; upon completion, tokens
are offered on its output edges. Execution of the actions comprising an activity are solely constraint by
ow relationships. If two actions are not ordered through ow relationships, they may execute concur-
rently, unless otherwise constrained. In addition, there are various loosely dened mechanisms, such as
streaming outputs or token weights to impose or vary the manner according to which tokens owthrough
nodes and along edges (with the intent to support the modeling of processes and work ows). Activities
may be hierarchically structured in that actions may invoke behaviors, which may in turn be activities.
Further, elements of an activity may be grouped into partitions. These do not affect the token ow, but
may, for example, be used to allocate each partition to a separate part of a structured classier.
A use case is the specication of a set of actions performed by the system. Each use case represents
a behavior that the system can perform in collaboration with one or more actors, without reference to
its internal structure. Use cases are typically leveraged for the description of system requirements, for
the specication of requirements that the system imposes on its environment, or for the description of
functionality offered by the system.
9.2.3 Execution Architecture
The execution architecture (the conguration of runtime processing resources and the processes and
objects that execute on them) of a system is specied in terms of a topology of nodes connected through
communication paths, representing hardware devices or software execution environments. Deployment
is the allocation of a software artifact (physical elements that are the result of a development process)
to nodes.
9.2.4 Embedded Systems and UML
In Section 9.1, we discussed that embedded systems often run on dedicated hardware, are reactive, respond
in real-time with soft time constraints, and are parallel and distributed. Various aspects of UML cater well
to modeling these system characteristics.
The UML describes system behavior independent of the constraints of a particular platform; a UML
system can be viewed as being executed by a virtual UML machine. While this UML machine is not
subjected to the limitations of the actual hardware (see Section 9.3.9), it allows to describe systembehavior
independent of these limitations, or for that matter, special capabilities that the hardware may offer. It will,
therefore, be possible to choose the allocation of various aspects of the system functionality to different
underlying processing elements based on systemneeds and architectural considerations. Functionality can
be moved from one underlying system component to another, should this be required by the performance
requirements of the system. The allocation of functional elements to the underlying hardware can be
shown in deployment diagrams. Any ancillary information required to deploy a functional element on
a processing element can be expressed as part of the deployment description; the description of the
functionality remains clean of these implementation details.
A system is reactive when it responds to stimuli received from its environment during the life time of
the system. The UML communication model is particularly well-suited for describing system behavior as
result of the interaction with the environment. The objects described by a UML specication may have
behavioral features, referred to as operations and receptions, that are triggered by environmental stimuli
(the receipt of operation calls or signals, respectively). State machines describe how the system responds
to such triggers by executing actions in a state-dependent manner. Interactions describe system behavior
as a set of (partially ordered) event sequences which typically exhibit how the system responds to stimuli
received from its environment.
Systems that are required to respond to environmental stimuli in real-time need to be able to process
multiple requests concurrently. The environment may present the systemwith multiple, often concurrent,
requests. Handling one request may potentially take a long time but must not delay responding to a
different stimulus. The UML provides the descriptionof the systemas comprised of multiple, concurrently
executing internal components (its parts). The overall system behavior is the result of the cooperation
of these parts with each other and the environment. Note that the decomposition into independently
executing architectural components is necessary even when assuming innitely fast processing speeds, as
the handling of a given stimulus may require an innite amount of time, yet must not delay the handling
of other requests. All behavioral descriptions afforded by UML are sensitive to the internal architecture
of a system as described by its composite structure. In addition, time constraints can be expressed for
all behavioral elements. A UML system specication can state hard real-time constraints, but it cannot
guarantee that the system will meet these constraints.
As described above, the system behavior is represented as the result of the cooperation of the internal
parts of the system. The UML assumes that the parts of the system execute concurrently and it makes
no assumptions as to the allocation on processing elements. Consequentially, UML describes a system as
fully parallel and distributed. However, the system functionality can be allocated to underlying processing
elements using deployment diagrams, thus constraining the actual system implementation.
Of course, these features may also be useful when modeling other systems, but they are essential in
representing embedded systems.
The UML is a very rich specication language, as it aims to be applicable to any software or system
application domain. Consequentially, not each of its constructs is equally amenable to deployment in
every possible application domain. Practitioners have proposed methodologies that focus on deploying
UML in the embedded system domain.
For example, the ITU service description methodology (see Figure 9.1), as adapted to UML [12],
identies the following development stages: Stage 1 gives the description of the services expected of the
system to be developed, from the user point of view; Stage 2 describes the usersystem interface and
the interfaces between different service access points within the system; and Stage 3 gives the description
of the detailed system components, as well as of the protocols and message formats. This methodology
leverages the following UML diagrams:
The Stage 1 descriptions treat the system as a single entity that provides services to the users. Use
cases or use case maps [13] are used to describe the services provided by the system in terms of the
Step 2.3
Dynamic description
of functional entity

Step 2.2
Information flow
diagrams
Step 2.4
Functional entity
actions
Step 1.1
Service definition
and description

Step 1.3
Dynamic description
of service

Step 1.2
Static description of
service

Step 2.1
Derivation of a
functional model

Step 2.5
Allocation of
functional entities

Step 3.2
Switching and
service nodes

Step 3.1
Protocols and
formats

Stage 3
(Network implementation aspects)
Stage 1
(Service aspects)

Stage 2
(Functional network aspects)
FIGURE 9.1 ITU method for service description.
perception of the users as well as the users involved in a service. The dynamic information that is
sent and received by the user is described by simple state machines.
Stage 2 identies the functional capabilities and the information ows needed to support the
service as described in Stage 1. Composite structure diagrams describe the architecture of the
functional entities comprising the system. Interaction diagrams specify the information ows
between functional entities for both successful operation and error conditions. State machines are
used to give the dynamic description of each functional entity. The actions performed by each
functional entity are characterized in terms of pseudo-code or a formalized action language. In
the nal step at this stage, functional entities are allocated to underlying system components in
composite structure diagrams.
At Stage 3, for each system component, the functional requirements of these elements are dened
using class diagrams, composite structure diagrams (for hierarchical decomposition), and state
machines (for specication of behavior). In addition, the relationship supported between two
functional entities located in different system components must be realized by protocol(s) sup-
ported between those system components. These protocols are typically described in the protocol
denition language ASN.1 [14].
In the following, we shall leverage some of the key features of UML identied above, composite structure
diagrams, sequence diagrams, and state machine diagrams, to illustrate, by example, the modeling of
embedded systems using UML.
9.3 Example Automatic Teller Machine
This chapter will use one example an automated teller machine (ATM) to illustrate how UML
can be used to model an embedded system in a compact, but readable way. While an automated teller
machine is not the most exciting example of an embedded system, its functionality and requirements are
well understood.
9.3.1 Domain Statement and Domain Model
An ATM is a system with mechanical as well as electronic parts. Its purpose is to provide a bank user with
cash provided that the user can be authenticated and has adequate funds in the bank account.
In order to be authenticated, the user presents a card to the ATM card reader, and provides a Personal
Identication Number (PIN) through the ATM keyboard (which may be a physical keypad or a
touch screen pad, or similar).
The ATM is connected electronically, possibly through a network, to the bank such that the account
status may be checked online.
The ATM is relled with cash notes regularly or when the number of specic notes falls below a
predetermined limit.
The ATM may provide foreign currency to the customer.
We begin by constructing a simple class model describing the domain of the ATM system, see Figure 9.2
and Figure 9.3. The domain model does not yet describe the classes that will be developed as part of
the system; rather, it captures the essence of the system as described by the domain statement. We see a
number of classes that we expect to appear in the nal system as well as high-level relationships between
these classes, based on the domain statement: an ATM is associated with users and a bank, and it consists
of keyboard, card reader, screen and cash dispenser. Users will have a number of cards that gives them
access to accounts maintained by the bank. We may use multiplicities and role names on the relationship
to document important characteristics of the domain. Note that according to the domain model, each
card gives access to exactly one account.
Card
Account
User ATM Bank
1
*
1
*
*
1
* *
myAccount
1
1

FIGURE 9.2 Domain class model-I.
Keyboard CashDispenser CardReader Screen
ATM
FIGURE 9.3 Domain model-II.
ATM
include
include
User
Bank
Withdrawal
CashRepository
Currency
Authentication
FIGURE 9.4 Use cases of the ATM.
9.3.2 Behavior Overview through a Use Case Diagram
We rely on use cases in a very restricted way, simply to give an overview of the different services that
external entities (users and other systems) expect from the ATM system.
As we represent the system by means of a class (here ATM), we dene use cases as part of a class, see
Figure 9.4. This provides a direct and simple way of specifying what the subject for the use cases is, namely,
the ATM. An alternative would be to dene use cases as part of packages, combined with some informal
indication of what the subject for the use cases is.
Bank Context
:User :ATM :Bank
User-reader
User-keyboard
ATM-bank
User-cash
User-screen
FIGURE 9.5 The context of an ATM.
9.3.3 Context Description by a Collaboration Diagram
We use collaboration diagrams for mainly two purposes:
To dene the context for interactions between the system and entities in its environment.
To tell which other entities the system is dependent upon. In this case only the ATM part will be
developed, but it depends upon agreed interfaces to users and banks. While the interaction with
the user is elaborated as part of the development of the ATM, the interaction with the bank may be
given in advance.
Figure 9.5 shows the bank context of the ATM class. The context in which the ATM operates is comprised
of two other parts: User and Bank. The parts are linked by connectors, which specify communication
paths between the objects playing these parts. Note the difference between this model and the domain
model: here we only describe the parts that interact, while classes such as Card and Account are not
involved. In addition we do not use the fact that ATMs consist of parts such as, for example, CardReader;
instead, we use the port User-Reader to indicate the connection to this internal part of the ATM,
thus isolating the internal structure of the ATM from the environment.
This collaboration is used to specify a view of the cooperating entities only. It species the required
features of the parts as well as required communications between them. Any object that plays these parts
must at least have the properties specied by the classiers that type the parts of the collaboration. The
system specied by the collaboration may have additional parts not shown, and the objects playing the
parts may have additional properties. In addition, we can specify constraints on the number of objects
that may play each part. In Figure 9.6 it is specied that there may be up to 10,000 users, up to 100 ATMs,
and just one bank.
The collaboration in Figure 9.6 species that the context in which the ATM works is comprised of
three sets of objects. Each part of a collaboration represents a set of objects, in this case a set of User
objects, a set of ATM objects, and a set of Bank objects. The specied system will be made up of objects
corresponding to each of these parts, as specied by the multiplicities of the parts, but there will not be
any object corresponding to the BankContext as such.
9.3.4 Behavior Modeling with Interactions
In this section, we will describe the services of the ATM based on the domain statement and the con-
cepts and structure described above. We shall use sequence diagrams to describe the services of the
system. A formalization of the approach to sequence diagram modeling shown here has been presented
in Reference 15.
BankContext
:User
[1...10,000]
:ATM [1...100] :Bank
User-reader
User-keyboard
ATM-bank
User-cash
User-screen
FIGURE 9.6 Bank context with multiplicities on parts.
sd EnterPIN
:Bank :User :ATM
msg(Enter PIN)
Digit
Digit
Digit
Digit
Code(cid, pin)
PIN OK
OK
FIGURE 9.7 EnterPIN a very simple sequence diagram.
9.3.4.1 A Simple Sequence Diagram for a Simple Situation
We rst present a very simple sequence diagram describing how to enter a PIN on the ATM. We are later
going to apply this simple behavior in more elaborated scenarios. EnterPIN is a sequence diagram that
requires no structuring mechanisms.
In Figure 9.7, we see that the communicating objects refer to the parts of the collaboration depicted in
Figure 9.5 and Figure 9.6. The objects in an interaction are called lifelines as their most signicant asset is
the vertical line on which the different messages are attached. Each lifeline represents exactly one object
and not a set of objects such as in the collaboration of Figure 9.6. The single object may be identied by
a selector expression, but typically the interactions refer to an arbitrary object of a set since the objects of
the set are indistinguishable for most purposes.
Most people, regardless of their competence in UML and computer science, will intuitively interpret
EnterPIN correctly. First, the ATM will give a message to the user that a PIN needs to be provided.
We rely on a message type msg with a plain string literal as actual parameter. This is a very simplistic
approach; in practice, such systems will use strings from a language specic repository to accommodate
for different locales (the language may actually be selectable by the user during the interaction). The user
knows that this number consists of four digits and presses those digits. The ATM will combine those four
digits into the pin code and transmit that code as well as the earlier collected card identication cid to the
bank for authentication. In this scenario, the bank returns a message ok to indicate that the pin and the
card identication were accepted. At the bottom of the sequence diagram we have placed a continuation
with the label PIN OK to indicate that further continuations could start with the assumption that card
and pin were accepted. We shall look at continuations shortly.
The scenario in Figure 9.7 is obviously not the complete story, even regarding the simple matter of
entering the PIN. We have given only one of many possible scenarios that a real system would have to take
into account. We have only considered a case where the card and pin are actually accepted. But cards and
pins are not always accepted, which is the whole purpose of authentication. To dene a situation where
the card and/or the pin are rejected, we could construct another scenario that would start out identically
to that of Figure 9.7, but would end up with a rejection of the user. Then, we could add yet another
scenario where the user keyed in only three digits, or ve, or the user entered a letter instead of a digit,
and so on. We quickly would arrive at a large number of scenarios for the same small detail. The need
for organizing these scenarios and categorizing them is apparent. (For brevity, in this section we limit
ourselves to describing that card and pin may be rejected as well as accepted.)
In Figure 9.8, we have introduced a combined fragment with the operator alt (alternative) to indicate
that there are alternative scenarios. The upper operand ends in a continuation PIN NOK and the lower
is identical to that of Figure 9.7. Since the rst messages were identical, it is more compact and more
transparent to describe the difference only where it appears.
Messages are represented by the arrows attached to either side of a lifeline. In Figure 9.8, all the arrows
have an open stick arrow head which means that these messages are asynchronous and have an associated
signal that is transferred from the sender to the receiver. Real-time embedded systems typically rely on
asynchronous messages; for simplicity we shall only use this kind of messages in this chapter. For other
sd EnterPIN
:Bank :User :ATM
msg(Enter PIN)
Digit
Digit
Digit
Digit
Code(cid, pin)
alt
PIN NOK
PIN OK
NOK
OK

FIGURE 9.8 EnterPIN with alternatives.
purposes messages may also describe synchronous or asynchronous (remote) calls requiring a return
message. A call is indicated by a lled arrow head.
The message is attached to either side of a lifeline. Events are represented by the point where a message
meets a lifeline. In other words, a message is the pair of a sending event and a receiving event. There are
two important invariants about (simple) sequence diagrams:
Message invariant : the sending event of a message must precede the receiving event of the same
message.
Lifeline invariant : events are ordered from top to bottom on a lifeline (with exceptions for events
on different combined fragment operands).
The formal meaning of a sequence diagram is the set of event traces that it describes in the presence of
the two invariants. Even simple diagrams often represent more traces than rst meets the eye especially
because the different lifelines are independent. For example, the diagram shown in Figure 9.8 describes
several traces owing to that the user keys in numbers independently of the ATM consuming them. Thus,
the sending events and receiving events may be arbitrarily interleaved, resulting in many different traces.
Sometimes different traces represent signicant behavioral differences, while in other cases, such as in our
example diagram, there is little signicance to the relative order of sending and receiving digits as long as
the digits are consumed in the order they are sent, which is precisely described by the diagram.
9.3.4.2 Factoring out General/Common Interactions
EnterPIN is a simple scenario that is not even mentioned in the use cases of Figure 9.4. We could have
introduced it as being used by the Authentication use case.
We decided that the card holder has three tries to enter the PIN code correctly. This means that the
interaction representing the entering of the PIN code will occur several times in the full interaction for
authentication, as shown in Figure 9.9.
The scenario described by Authenticate it is also quite intuitive. First, the user will insert a card
and thus transfer the card identication cid to the ATM. Then the EnterPIN scenario will take place.
The nal loop shows a situation that only occurs if the card and pin are not accepted. This is shown by the
loop starting with the continuation PIN NOK. Within the loop the EnterPIN scenario will again occur.
The loop may execute from 0 to 2 times, as shown by the parameters in the operator tab.
:Bank :User
sd Authenticate
EnterPIN
ref
loop(0,2)
EnterPIN
ref
Idle
Cardid(cid)
msg("Try again!")
:ATM
PIN NOK
FIGURE 9.9 Interaction for authentication.
The reference to EnterPIN is given by an interaction occurrence as indicated by the symbol ref in
a corner tab. Interaction occurrences allow to structure sequence diagrams, but must not be confused
with operation calls. These are sub-scenarios; they are not invoked by an executing entity.
9.3.4.3 Describing the Full Withdrawal Service
Now, we are ready to describe the full Withdrawal service using the Authenticate scenario as a
sub-scenario, as shown in Figure 9.10. There are no new language features in this scenario apart from the
obvious fact that combined fragments (in this case, alternatives) may be nested.
We have assumed that the ATM will keep the users card during the execution of the withdrawal service.
The card is returned at the end of the sequence.
Again, we show not all possible scenarios that may occur when withdrawing money, but only some of
the scenarios that are important and must be considered by the implementers.
When the sequence diagrams become complex, consisting of several lifelines and nested combined
fragments, a designer may choose to utilize an interaction overview diagram. Such diagrams also describe
interactions, but the detailed messages and lifelines are abstracted away while the control structure is
shown not as nested rectangles, but as a branching graph.
:Bank :User
sd Withdrawal
Authenticate
ref
alt
PIN NOK
PIN OK
Withdrawal
msg(Enter amount!)
ok
:ATM
amount(v) checkaccount(v)
alt
money(v)
receipt(v)
nok
msg(Amount too large)
msg(Illegal entry)
card
card taken
msg(Select service)

FIGURE 9.10 Withdrawal of native money.

sd Withdrawal
Authenticate
ref
PIN NOK
PIN OK
ok money(v)
receipt(v)
:Bank :User :ATM
sd
nok

:Bank :User :ATM
sd

:User :ATM
sd
card
card taken
:User :ATM
sd
ref
SpecifyAmount
msg(Illegal entry)
FIGURE 9.11 Withdrawal as interaction overview diagram.
Figure 9.11 shows Withdrawal as an interaction overview diagram. Nodes in the graphs are either
references to other interaction diagrams or inlined sequence diagrams. The graph notation is inherited
from activity diagrams, but this diagram denes an interaction.
The interaction overview diagram provides a high-level overview when the nodes are interaction
occurrences, but on the other hand isolating every small piece of an interaction in a separate diagramleads
to a loss of understanding, as there may be too many concepts to keep track of. The designer needs to nd
the appropriate balance of when to apply interaction overview diagrams instead of sequence diagrams.
In Figure 9.11, we have chosen to isolate the interaction of specifying the withdrawal amount in a
separate sequence diagram (not shown here), and kept the other trivial sequences as inline diagrams.
9.3.5 Behavioral Modeling with State Machines
Interactions describe behavior as a sequence of events and may focus on a subset of possible behaviors
only. Alternatively, behavior can be described by a state machine that induces those sequences of events.
sm EnterPIN
enterDigit
send(msg(Give PIN)); n=1; PIN=0
waitOK
[n=4]digit/PIN=PIN+digit
send(Code(cid,PIN))
ok
[n<4]digit/
n++;
PIN=PIN+digit*10
(3-n)
nok
nok
ok
FIGURE 9.12 The state machine EnterPIN.
9.3.5.1 Simple State Machine for a Simple Situation
The state machine for the behavior of entering the PIN code in terms of four digits is similarly simple, see
Figure 9.12.
The EnterPIN state machine obtains four digits, computes the PIN from these, and sends a signal
carrying both the card identication (cid) and the PIN code (PIN) to the bank in order to check whether
this is a valid combination of card identication and PIN code, before it enters the state waitOK, where
it waits for the reply from the bank.
The state machine is simple, but we have chosen to apply structuring mechanisms to enable its reuse
in different situations, where entering a PIN code may be needed. This is done by dening the interface
of the state machine by means of two exit points: in case the bank sends the signal ok, EnterPIN exits
via the exit point ok, whereas in case the bank sends the signal nok, the EnterPIN state is exited via the
exit point nok.
9.3.5.2 High-Level ATM with Submachine States
The ATM state machine in Figure 9.13 is at a high level of abstraction in the sense that it describes the overall
states and transitions, and that most of the behavior of the state machine is part of submachine states.
Submachine states are states that are dened by means of a separate state machine. The state machine
EnterPIN dened above is one such submachine. The syntax for a submachine state is a state symbol with
an optional name followed by a colon (:) and the name of a state machine (e.g., EnterPIN). Entering
a submachine state means entering the corresponding state machine via its initial (pseudo) state. When
the submachine completes its behavior, it exits via one of the exit points.
Each state in the ATM state machine corresponds to the major modes of the ATM and also to the
different screens presented to the user. As an example the Service state corresponds to the screen where
the user is asked to choose between two services (withdrawal or status).
The reason for representing the detailed (partial) behaviors for entering a PIN code, withdrawing fund,
and requesting account status by the corresponding submachine states EnterPIN, Withdrawal, and
Statusis that for the overall ATMbehavior it is not important howthese detailed behaviors are specied.
After having received a CardId, the ATM will enter a state in which a PIN code is entered. The two
different outcomes are represented by two exits of the state EnterPIN. In case a valid PINcode is received
it will provide the services withdrawal and status, as represented by the state Service. The selection of
a specic service is represented by its two exit points, Withdrawal and Status; the behaviors of these
services are represented by the two states Withdrawal and Status.
sm ATM
:Withdrawal
entry: send(card)
CardOut
:EnterPIN
/authN=0
Idle
CardId(cid)
[authN<3]/
authN++;
send(msg(Try again))
/authN=0
[authN==3]/
authN=0
send(msg
(illegal entry));
nok
ok
Cancelled
ok
:Status
:Service
Status
Withdrawal
cardTaken

FIGURE 9.13 High-level state machine for ATM.
The fact that the card shall be released in three situations (if the PIN is not accepted, when the withdrawal
service is completed, and when the status service is completed) is represented by the entry action of the
state CardOut. In addition, this state represents the fact that the ATM has to know that the card has been
taken by the user (event CardTaken) before a new card can be accepted in state Idle.
The use of submachine states and exit points in the ATM state machine implies that we do not (at this
level) specify which events trigger the detailed behavior. As an example the submachine state EnterPIN
has two exit points with the names ok and nok (these are thus not names of events). The events that
trigger the EnterPIN submachine to be exited via the ok exit point are described in the state machine
EnterPIN, see Figure 9.12. In this example, the overall ATM only has to know the names of the exits
points ok and nok; it does not have to know about the events that trigger the EnterPIN state machine.
We use submachine states as:
The overall state machine becomes manageable and concise.
The overall state machine is independent on the type of signals used within the sub-state machine.
The state machine ATM needs the following attributes:
authN in order to count the number of attempts of authentication.
cid to store the card identication passed with the CardId signal. (The signal event specication
CardId(cid) means that the value carried by the CardId signal is assigned to the local attribute
cid.)
We represent this information by a classier symbol stereotyped as state machine, see Figure 9.14. Similarly,
the state machine EnterPIN needs two attributes as shown in Figure 9.15:
n to keep track of the number of digits that have been entered.
PIN to hold the PIN code computed from the entered digits.
authN:integer
cid:integer
<<statemachine>>
ATM
FIGURE 9.14 Attributes of the ATM state machine.
n: i nt eger
PI N: i nt eger
<<st a t e machi ne>>
En t e r PI N
FIGURE 9.15 Attributes of EnterPIN.
The attribute cid used when sending of the signal Code(cid,PIN) is the one dened in the ATM
state machine.
9.3.5.3 Withdrawal
Withdrawal of funds requires that the user provides an amount and that the transaction is veried, see
Figure 9.16.
The attribute sa represents the selected amount that is to be veried against the account. This attribute
is set by the GetAmount submachine. If the amount is too large, a message is given to the user who may
enter another amount. The operation sendMoney represents the actions needed in order to deliver the
requested amount of funds (Figure 9.17).
9.3.5.4 Flexibility through Specialization
To describe the state machine GetAmount, we start out with a simple version that merely lets the user
select among predened amounts (such as 100, 200, 300, 400, etc.). At any time during this selection, the
user may cancel the selection, see Figure 9.18. We omit the details of selecting an amount, which is covered
by the state machine SelectAmount.
A more exible ATM would give the user the opportunity to enter the desired amount, in addition
to selecting among predened amounts. We dene FlexibleATM as a subclass of ATM and extend
GetAmount in the subclass, as shown in Figure 9.19.
By extending the state GetAmount, the GetAmount submachine of FlexibleATM maintains some
of the behavior of the more general GetAmount submachine, see Figure 9.20.
The extended state machine GetAmount adds the state EnterAmount and also adds transitions
to and from this new state. In addition, it redenes the transition from the entry point again so that
the target state is the added state EnterAmount. The inherited state SelectAmount is drawn with
a dashed line. Although inherited, this state is shown in the extended state diagram in order to add the
transition with the trigger otherAmount, leading to the new state EnterAmount.
In general, states and transitions that are not shown are inherited from the general state machine.
Inherited states and transitions may be drawn using dashed lines in order to establish the context for
extension and redenition.
The resultant state machine dened implicitly for the extended GetAmount state machine is depicted
in Figure 9.21.
This example illustrates the difference to state machines in earlier versions of UML. Without the
structuring mechanisms recently introduced to UML, this state machine would have to be specied as
shown in Figure 9.22.
sm Withdrawal
:GetAmount
Cancelled
VerifyTransaction
send(CheckAccount(sa))
nok/
send(msg(Amount too large))
ok
ok/
sendMoney(sa);
send(Receit(sa));
Again
Cancelled
FIGURE 9.16 The withdrawal sub-state machine.
sendMoney(a:Amount)
sa: Amount
<<statemachine>>
Withdrawal
FIGURE 9.17 Properties of withdrawal.
sm GetAmount
: Select Amount
Send(msg(select amount))
amount(sa);
Cancel
Cancelled
Again
Send(msg(select another amount))
FIGURE 9.18 Simple version of GetAmount.
<<statemachine>>
ATM
<<statemachine>>
GetAmount
<<statemachine>>
FlexibleATM
<<statemachine>>
GetAmount {extended}
FIGURE 9.19 Flexible ATM as a specialization of ATM.
sm GetAmount {extended}
:SelectAmount
:enterAmount
otherAmount/
send(msg(enter amount))
Cancel
Again
Cancelled
ok
FIGURE 9.20 GetAmount extended.
This example also illustrates that the use of exit points gives us the ability to dening actions on trans-
itions where they really belong. Consider the transitions triggered by cancel (with state Cancelled as
target). In Figure 9.22, this transition is a single transition crossing the boundary of the state GetAmount
having a single action. However, in Figure 9.21, where GetAmount is dened as a submachine state
with entry and exit points, these transitions will be composed of two (partial) actions: one inside
GetAmount (and therefore with access to whatever attributes it may have) and one outside, in the scope
of Withdrawal (and therefore with access to attributes of Withdrawal only). The inner actions is
performed before the outer action.
sm GetAmount
:SelectAmount
:EnterAmount
amount(sa);
otherAmount/
Cancel
Cancel
Again
Cancelled
ok
FIGURE 9.21 Resulting GetAmount.
sm Withdrawal
GetAmount
VerifyTransaction
send(CheckAccount(sa))
nok/
send(msg(Amount too large))
ok/
sendMoney(sa);
send(Receit(sa));
SelectAmount
EnterAmount
amount(sa);
otherAmount/
Cancel
Cancel
Cancelled
ok
FIGURE 9.22 Withdrawal with a composite state GetAmount.
9.3.6 Validation
We have now presented (part of) the requirements on the ATM in the form of sequence diagrams and
(part of) its design in the form of a structured state machine. The interesting question now is, of course,
whether the design satises the requirements.
We take the specication of the withdrawal service from Figure 9.10 and walk through the described
behavior while checking the state machine design of the ATM, applying the following principles:
Establish an initial alignment between the interaction and the state machine. What is the state of
the state machine?
Assume that the messages to the state machine are determined by the interaction.
Check that the actions and especially the output messages owing to the transition of the state
machine correspond with the event occurrences on the lifeline.
Perform this test for all traces of the interaction.
This procedure of consistency checking can be automated provided that the model is sufciently precise.
In our example, informal text is used at several occasions to simplify the illustration, and this would
obstruct automatic checking. It is, however, possible to dene the interactions and the state machines
such that automatic checking is feasible.
To align Withdrawal, we assume that the beginning of the Withdrawal sequence correspond to
the Idle state in Figure 9.13.
Withdrawal starts with the reference to Authenticate (Figure 9.9). Authenticate begins with
the continuation Idle and even though it is not a state invariant it does correspond well with our chosen
alignment to the state Idle in the state machine. Then the ATM receives the Cardid message and we
look into the state machine to see whether this will trigger a transition to EnterPIN. Hopefully the
submachine (see Figure 9.12) corresponds directly to the sub-scenario EnterPIN (see Figure 9.8).
The transition triggered by the Cardid message is continued inside the EnterPIN submachine and
transmits msg("Give PIN") before it enters the state EnterDigit. We check the sequence diagram
and nd that this corresponds well. Then the sequence diagram continues to describe that the ATM
receives four digits. In the state machine the pin is built up and the loop will terminate before the fourth
digit. At the fourth digit another transition will be triggered, a message will be sent to the bank, and the
waitOK state will be entered. Depending on the answer from the bank, exit from the state will be either
through the ok exit point or the nok exit point. Let us rst assume that the sequence diagram entered
the nok branch.
In this case, the sequence diagram will end in the PIN NOK continuation and it will continue into the
loop that starts with the same continuation. There we can see that the ATM transmits the msg("Try
again!") message to the user. In the state machine, we have now left the EnterPIN submachine
through the nok exit point and returned to the top level ATM state machine. Since the loop has just been
entered, the transition will send the appropriate message to the user and return to the EnterPIN state.
This corresponds well with the situation within the sequence diagram where we nd a reference within
the loop to EnterPIN.
If the next time around we follow the ok branch, the state machine will transit through the ok exit
point and enter the Service state. The sequence diagram has nished the Authenticate scenario
and continues inthe Withdrawaldiagramwithinthe alternative starting with the PINOKcontinuation.
Prompted by a Select service! message, the ATM will now receive a Withdrawal message, which in
the state machine Service triggers a transition to the Withdrawal submachine.
Within Withdrawal the transition continues into GetAmount where the user is prompted with
msg("Select Amount") and then the state SelectAmount is entered. Without going into further
detail, we conclude that the user is given the chance to provide an amount and this is input to the ATM. The
state machine leaves GetAmountonthis trigger andcontinues tosendCheckAccountonthe transition
to VerifyTransactionwithin Withdrawal. An okwill cause the state machine to exit through the
exit point ok after having transmitted money and a receipt. Finally, the card is ejected and the user takes
the card to return the state machine to the Idle state. This matches the sequence diagram, and we can
conclude that at least for these few traces the ATM state machine correspond to the requirements. There
are more traces described in the sequence diagrams; an examination of these traces is left to the reader.
On the other hand, there are also more traces induced by the state machine execution than captured in
the sequence diagrams. Sequence diagrams, as is the case for most property descriptions, are only partial
and concentrate on the most important aspects of the system behavior.
9.3.7 Generalizing Behavior
To show how UML diagrams can be used to dene generalization of behavior, we start by describing an
ATM service where foreign currency can be obtained.
9.3.7.1 Generalizing with Interactions
A scenario describing the withdrawal of foreign currency is shown in Figure 9.23; the similarities to
the withdrawal of local currency as shown in Figure 9.10 is apparent. The challenge is to highlight the
differences and to only describe the commonalities once.
In Figure 9.23, we have also shown guards (interaction constraints) on the operands of the inner
alternative combined fragment. A guard refers to data values accessible to the instance represented by the
lifeline at the rst event of the operand. We have stated the guards in informal text since UML has not
dened a concrete data language and the evaluation of the guard expressions is not a focus of this chapter.
The retrieval of local and foreign currencies can be seen as two specializations of an abstract withdrawal
service. The hierarchy of withdrawal services expressed through interactions as classiers is shown in
Figure 9.24.
The utility interactions getAmount and giveMoney are dened local to GenWithdrawal so
that they can be redened in specializations.
The sequence diagram for GenWithdrawal is shown in Figure 9.25. The interactions for getting the
amount and delivery of money have been separated out and replaced by references to local sub-scenarios.
To achieve the same behavioral description as that of Figure 9.10 we need to redene getAmount and
giveMoney as shown in Figure 9.26.
This gure depicts message gates, a feature that is new in UML sequence diagrams. In the general
withdrawal service (Figure 9.25), the ok message enters the giveMoney interaction occurrence; that
connection point is called an actual gate. The corresponding formal gate can be found in the denition of
giveMoney in Figure 9.26, with the meaning that the message is routed as shown. No event occurs when
a message passes through a gate. Gates may be named explicitly, but implicit naming is often sufcient,
such as when the identity of the gate is determined by the message name and the direction of the message.
Gates are visible from the immediate outside of the interaction occurrence and from the inside of the
denition diagram.
In order to describe the behavior of currency purchase as rst shown in Figure 9.23, we redene
getAmount and giveMoney as shown in Figure 9.27.
9.3.7.2 Generalizing with State Machines
In order to express the interaction generalization given earlier, we similarly generalize the ATM state
machine to express only the general behavior, that is, we only include behavior that is specic to obtaining
the PIN code and to authorization, see Figure 9.28.
We assume this state machine to be the state machine of the ATM class, and introduce two subclasses of
ATM as shown in Figure 9.29.
The state machines for the two subclasses of ATM, as shown in Figure 9.30, add the states Withdrawal
and Currency, respectively, together with their transitions, and extend the Service state by the
additional exit points Withdrawal and Currency, respectively.
:Bank :User
sd Currency
Authenticate
ref
alt
PIN NOK
PIN OK
Currency
msg(Enter currency!)
ok
:ATM
amount(e)
checkaccount(v(e))
alt
money(curtyp,e)
receipt(v)
nok
msg(Illegal entry)
card
CurrencySelected(curtyp)
msg(Enter amount!)
[enough on account]
[inadequate funds]
card taken
msg(Select service!)
FIGURE 9.23 Withdrawing foreign currency.
We will not delve into the details of the states Withdrawal and Currency, but it can readily be
seen that the submachines SelectAmount and EnterAmount may in fact be the same for both as
they merely model the behavior of obtaining the amount independently of the chosen currency. While the
ordinary withdrawal service assumes this number is the amount in local currency, the CurrencyATM
will convert this number from the foreign currency to local currency. Verication against the account is
performed in local currency.
9.3.8 Hierarchical Decomposition
Until now we have considered the ATM as one atomic entity even though we earlier hinted in the domain
model that it was internally structured (Figure 9.3). We shall further explain the behavior of the ATM as
the result of the interaction between the parts of the composite structure of the ATM.
sd getAmount
sd giveMoney
GenWithdrawal
redefined getAmount
redefined giveMoney
Withdrawal
redefined getAmount
redefined giveMoney
Currency
FIGURE 9.24 Generalized withdrawal inheritance hierarchy.
9.3.8.1 Decomposition of Classes
While the bank context was specied as a collaboration with a composite structure in terms of User,
ATM, and Bank (Figure 9.6), the ATM itself is further specied as a class with a composite structure
representing the fact that an ATM consists of a screen, a card reader, a keyboard, and a cash dispenser.
The domain model had simply specied that an ATM consists of these parts; we now need to impose
additional architectural constraints. For example, the user does not interact directly with the card reader.
Instead, the user interacts with the card reader as a part of the ATM. This is expressed by dening the ATM
as a composite class with ports for each kind of interaction, and with parts that internally are connected
to these ports, as shown in Figure 9.31. The oval shapes indicate behavior ports, which are ports through
which the behavior of the ATM class, in this case, the ATM state machine, communicates. Behavior directed
at a behavior port is directed at the behavior of the containing class.
Note that the types of the parts are not owned by the ATM class, they are simply used. We have not yet
examined how the collection of classes of the ATM is organized, but typically one would dene a package
of classes related to ATM, see Figure 9.32.
9.3.8.2 Decomposition of Lifelines
We indicate that the lifeline is detailed in another sequence diagram by a reference in the lifeline header
to that sequence diagram (see Figure 9.33).
The reference clause indicates which sequence diagram denes the decomposition of the lifeline.
Decomposition is relative to both the lifeline and the behavior in which the lifeline exists. Thus, for
example, ATM_Withdrawal details the withdrawal service within the ATM (see Figure 9.34).
There are strong syntactic requirements ensuring the correspondence between the constructs on the
decomposed lifeline (in this case, the ATM in Withdrawal) and the referenced detail diagram (here,
ATM_Withdrawal). The series of fragments on the lifeline should have an exact corresponding global
fragment in the referenced diagram. In this example, this condition holds: for every interaction occurrence
covering ATM there is a corresponding interaction occurrence that covers all the lifelines of ATM in
ATM_Withdrawal. The same holds for combined fragments even nested fragments. The combined
fragments in Figure 9.34 extend beyond the boundaries of the decomposition diagram to show that the
semantics of this fragment takes into account the enclosing level.
There are interaction references of two kinds. Interaction occurrences refer to sub-scenarios at the same
level of hierarchical abstraction (such as when Authenticateis referenced fromWithdrawal), while
decomposition refers fromone hierarchy level to the next (such as when ATM_Withdrawalis referenced
fromATM in Withdrawal).
:Bank :User
sd GenWithdrawal
Authentication
ref
alt
PIN NOK
PIN OK
ok
:ATM
checkaccount(v(e))
alt
receipt(v)
nok
msg(Illegal entry)
card
[enough on account]
[inadequate funds]
getAmount
ref
giveMoney
ref
card taken
FIGURE 9.25 General withdrawal service.
These two kinds of reference must be consistent: if we start with Withdrawal (Figure 9.10) and
follow the reference Authenticate (Figure 9.9) and then decompose ATM in Authenticate
referencing ATM_Authenticate (Figure 9.35), we should reach the same diagram as if we decom-
posed ATM in Withdrawal to ATM_Withdrawal (Figure 9.34) and then followed the reference to
ATM_Authenticate. Decomposition commutes, that is, the nal diagram in such chain of references
represents the description of the behavioral intersection of the lifeline (here, the ATM) and the reference
(in this case, Authenticate).
9.3.9 The Difference between the UML System and the Final System
A UML system dened by a set of interacting state machines can be viewed as being executed by a UML
virtual machine. Most often the UML machine (or runtime system) has properties that the real system
money(e)
Withdrawal
amount(e)
msg("Enter amount!")
sd getAmount
:User :ATM
sd giveMoney
:User :ATM
ok
msg("Select service!")
FIGURE 9.26 getAmount and giveMoney for withdrawing native money.
money(curtyp,e)
Currency
msg(Enter currency!)
amount(e)
CurrencySelected(curtyp)
msg(Enter amount!)
sd getAmount
:User :ATM
sd giveMoney
:User :ATM
ok msg(Select service!)
FIGURE 9.27 getAmount and giveMoney for purchasing currency.
will not have. The following differences between the ideal UML machine and the nal system have been
discussed extensively by Brk and Haugen [16].
9.3.9.1 Processing Time
In the denition of the ATM, we have not given any time constraints or made assumptions about the
processing time of the system functionality. Such time constraints can be introduced by using the simple
time model that is part of the common behavior of UML. The ideal UML machine has no limitations in
processing capacity; neither is it constrained by nite processor speed or clock frequency.
Thus we cannot know whether our ATM will actually fulll the real requirements that would include
adequate response times for the user trying to obtain cash.
9.3.9.2 Errors and Noise
UML system specications may have logical errors, but they do not suffer from physical errors. It is
assumed that whenever a message is sent, it will arrive at its destination. The UML machine does not stop
without reason, and the contents of signals will not change. In the real world, however, such malfunction
does occur and a good designer will have to cope with them during system development.
sm ATM
entry: send(card)
CardOut
:EnterPIN
/authN=0
Idle
CardId(cid)
[authN<3]/
authN++;
send(msg(Try again))
/authN=0
[authN==3]/
authN2=0
send(msg
(illegal entry));
nok
ok
:Status
:Service
Status
cardTaken
FIGURE 9.28 Generalized ATM state machine.
ATM
WithdrawalATM CurrencyATM
FIGURE 9.29 Subclasses of ATM.
Physical errors will, of course, depend on the physical entities involved. To ensure the system is not
subjected to such defects, we need to answer questions such as: what are the transmission media? What
low-level protocols are used? Are there hardware redundancies?
9.3.9.3 Physical Distribution
We have dened an ATM in the context of a bank system. We have said little or nothing about how this
system is physically realized: are there physical cables that connect the ATM and the bank or is there a
wireless link? The safety of the real system is dependent on how the distribution is implemented. If there
are separate cables, any particular error will only affect a small part of the system, while if there is a central
wireless communication center, a failure of this center will have global impact.
sm WithdrawalATM
:Withdrawal
CardOut
Cancelled
ok
:Service
{extended}
Withdrawal
sm CurrencyATM
:Currency
CardOut
Cancelled
ok
:Service
{extended}
Currency
FIGURE 9.30 The two specialized state machines.
ATM
:CardReader
:CashDispenser :Keyboard
User-reader
User-keyboard
ATM-bank
User-cash
:Screen
User-screen
FIGURE 9.31 The architecture of the ATM by means of a composite class.
Even within the ATM itself there are different kinds of physical errors that may occur. The card reader
is an electromechanical device that will need regular maintenance, as is the cash dispenser. The screen is
probably longer-lasting, but can also be subjected to damage. The internal communication is short and
probably stable, while the external communication between the ATM and the central bank may be more
at risk.
The physical system must be protected against both unintended malfunction and vandalism or sab-
otage. The design must be able to recover from physical malfunction, or, at least, it should not react by
surrendering cash to the person destroying it.
BasicATMclasses
ATM
Card
Account
User ATM
Bank
1
*
1
*
Keyboard CashDispenser CardReader Screen
*
1
*
*
myAccount
1
1
FIGURE 9.32 Package basic ATM classes.
:ATM
ref ATM_Withdrawal
FIGURE 9.33 Decomposition reference.
9.3.9.4 Finite Resources
The UML machine has innite resources. Its queues have no limit; the number of parallel processes can be
increased at will. Not so in the real world. Message queues have a denite limit if by no other restriction
than that the memory space is nite. Also, the processor word length is nite, limiting the calculation
accuracy and how large numbers can be.
What should the system do when the nite resources are exhausted? There must be a plan B which
lets the system recover or introduce emergency measures when these limits are reached.
9.3.9.5 Concurrency and Independence
We have started out by considering the ATMas one atomic unit, but later described the ATMas a composite
of a number of parts. Whether those parts are implemented as fully concurrent processes or merely as
properties of the ATM state machine will make a difference.
Communication is modeled as asynchronous, but hardware such as card reader, screen, and cash
dispenser often have synchronous direct interfaces.
:CardReader :Controller :CashDispenser
sd ATM_Withdrawal
ATM_Authenticate
ref
alt
ATM_PIN NOK
ATM_PIN OK
Withdrawal
msg(Enter amount!)
ok
:Keyboard
amount(v) checkaccount(v)
alt
money(v)
receipt(v)
nok msg(Amount too large)
msg(Illegal entry)
card
Withdrawal
msg(Enter amount!)
amount(v)
money(v)
msg(Illegal entry)
card
:Screen
FIGURE 9.34 Decomposition of ATM within withdrawal.
9.4 A UML Prole for the Modeling of Embedded Systems
UML aims to be applicable to a wide range of application domains, ranging from health and nance to
aerospace to e-commerce. In order to subsume the possible variances of application domains, UML does
not dene all language concepts (such as its concurrency semantics) to the level of detail necessary to allow
unambiguous interpretation. As such, UML denes not a language per se, but a family of languages from
which a specic language must rst be instantiated by possibly selecting a subset of the modeling concepts,
providing a dynamic semantics for these concepts suitable to the application domain, and possibly adding
certain concepts unique to the application domain. The mechanism to instantiate a particular language
from the UML language family is referred to as a prole. In addition to giving detailed semantics where
the UML denition is intentionally vague, a prole can also provide notations suitable for the instantiated
language.
:Keyboard :Controller :CardReader
sd ATM_Authenticate
ATM_EnterPIN
ref
loop(0,2)
ATM_EnterPIN
ref
ATM_Idle
Cardid(cid)
msg(Try again!)
:CashDispenser
ATM_PIN NOK
msg(Try again!)
Cardid(cid)
:Screen
Code(cid, pin)
NOK
Code(cid, pin)
OK, NOK
FIGURE 9.35 ATM_Authenticate.
The SDL UML prole [17] focuses on the modeling of reactive, state/event driven systems typically
found in embedded applications. They give precise, formal semantics for all concepts and constitute a
language for specifying executable models independently of animplementationlanguage. While inheriting
the traditional strength of UML in object-oriented data modeling, the SDL UML prole provides the
following concepts that are of particular importance in the embedded systems domain:
Modeling of active objects executing concurrently or by interleaving, the hierarchical structure of
active objects, and their connection by means of well-dened interfaces.
A complete action language that is independent of implementation languages. In practice, actions
may be translated into target languages, but (correct) translation does not change the behavior.
Actions are specied in imperative style and may be mixed with graphical notation.
Object oriented data based on single inheritance and with both polymorphic references (objects)
and values, even in the same inheritance hierarchy. Type safety is preserved in the presence of
covariance through multiple dispatch.
Mapping of the logical layout of data to a transfer syntax per the encoding rules of a protocol.
An exception handling mechanism for behavior specied either through state machines or
constructs of the action language that makes it suitable as a design/high-level implementation
language.
Composite states that are dened by separate state diagrams (for scalability); entry/exit points are
used instead of state boundary crossing (for encapsulation), any composite state can be of a state
type (for reuse), and state types can be parameterized (for even more reuse). Composite states may
have sequential or interleaving interpretation.
Roadmaps (high-level MSC) for improved structuring of sequence diagrams; inline expressions
for compactness of description, and references for better reuse of sequence diagrams.
Object-orientation applied to active objects including inheritance of behavior specied through
state machines and inheritance of the (hierarchical) structure and connection of active objects.
Constraints on redenitions in subclasses and on actual parameters in parameterization that afford
strong error checking at modeling time.
In contrast to the UML, the SDL prole provides an operational semantics for all the constructs of the
UML. Developers can rely on the precise description of these constructs to unambiguously understand
the meaning of their specications. In addition, a number of tools have been developed that leverage the
formal operational semantics to enable the verication and validation of specications and to generate
applications from the specications.
Sequence diagrams are an excellent starting point for the development of test cases for the nal system.
Techniques have been developed for deriving test suites from the message sequences captured at the
requirements stage [18], together with the denition of the information content of each message. By
selecting different system components to serve as the instance under test, test suites for each component
can be derived.
While during the development of individual service descriptions limited attention is paid to the possib-
ility of services interfering with each other, any potential interaction between services must be eliminated
when synthesizing the system behavior from requirements specications. (Such feature interactions, e.g.,
by multiple services being triggered by the same message primitive, or the same message primitive having
different meaning in concurrently active services, have been recognized as the source of the most costly
defects of telecommunication systems.) Tools can be deployed to pinpoint mutual interaction between
services executing on system component [1922].
Finally, the operational semantics of the SDL prole allows the derivation of product code for applic-
ations implementing the various software system components. SDL provides sufcient expressive power
to fully specify such systems, and many commercial systems have been delivered to customers with code
generated by tools directly fromcomponent-level design specications. Auxiliary notations such as ASN.1
provide the necessary expressive power to allow messages between system components to be specied
abstractly and to allow the automated generation of data marshaling code imposing the encoding rules of
the intra-system messaging.
A tool chain leveraging the SDL prole and supporting the verication and validation steps between
stages of the software development life cycle as well as the generation of complete applications code
from design specications is available. A number of vendors have shipped embedded systems (e.g., net-
work elements for telecommunications systems) implemented with high levels of automation from these
models.
References
[1] International Telecommunications Union. Specication and Description Language (SDL).
Recommendation Z.100, 2002.
[2] J. Ellsberger, D. Hogrefe, and A. Sarma. SDL. Prentice-Hall, New York, 1997.
[3] A. Olsen, O. Frgemand, B. Mller-Pedersen, R. Reed, and J.R.W. Smith. Systems Engineering
Using SDL-92. North-Holland, Amsterdam, 1994.
[4] J. Grabowski and E. Rudolph. Putting Extended Sequence Charts to Practice. In Proceedings of the
4th SDL Forum. North-Holland, Amsterdam, 1989.
[5] D. Hatley and I. Pirbhai. Strategies for Real-Time System Specication. Dorset House, New York,
1987.
[6] D. Harel. Statecharts: A Visual Approach to Complex Systems. Technical Report CS84-05,
Weizmann Institute of Science, 1984.
[7] P.T. Ward and S. Mellor. Structured Development for Real-Time Systems. Yourdon, Englewood
Cliffs, 1985.
[8] B. Selic, G. Gullekson, and P.T. Ward. Real-Time Object Oriented Modeling. John Wiley & Sons,
New York, 1994.
[9] J. Rumbaugh, I. Jacobson, and G. Booch. Unied Modeling Language Reference Manual. Addison
[10] Object Management Group. UML 2.0 Language Specication. ptc/03-08-02, 2003.
[11] . Haugen, B. Mller-Pedersen, and T. Weigert. Structural Modeling with UML 2.0. In UML for
Real, L. Lavagno, G. Martin, and B. Selic, Eds. Kluwer Academic Publishers, Boston, MA, 2003.
[12] T. Weigert and R. Reed. Specifying Telecommunications Systems with UML. In UML for Real:
Design of Embedded Real-Time Systems, L. Lavagno, G. Martin, and B. Selic, Eds. Kluwer Academic
[13] R.J.A. Buhr and R.S. Casselman. Use Case Maps for Object-Oriented Systems. Prentice-Hall,
New York, 1996.
[14] International Telecommunications Union. SDL Combined with ASN.1 Modules (SDL/ASN.1).
Recommendation Z.105, 2001.
[15] . Haugen and K. Stlen. STAIRS Steps to Analyze Interactions with Renement Semantics.
In UML, Vol. 2863 of Lecture Notes in Computer Science. Springer, 2003.
[16] R. Brk and . Haugen. Engineering Real Time Systems. Prentice-Hall, New York, 1994.
[17] International Telecommunications Union. SDL Combined with UML. Recommendation Z.109,
ITU-T, Geneva, 1999.
[18] P. Baker, P. Bristow, C. Jervis, D. King, and B. Mitchell. Automatic Generation of Conformance
Tests FromMessage Sequence Charts. In Proceedings of the 3rd SAMWorkshop, Vol. 2599 of Lecture
Notes in Computer Science. Springer, 2003, pp. 170198.
[19] L. Helouet. Some Pathological Message Sequence Charts, and How to Detect Them. In Proceed-
ings of the 10th SDL Forum, Vol. 2078 of Lecture Notes in Computer Science. Springer, 2001,
pp. 348364.
[20] G. Holzmann. Formal Methods for Early Fault Detection. In Proceedings of the Formal Techniques
for Real-Time and Fault Tolerant Systems, Vol. 1135 of Lecture Notes inComputer Science. Springer,
1995, pp. 4054.
[21] R. Alur, G. Holzmann, and D. Peled. An Analyzer for Message Sequence Charts. Software
Concepts and Tools, 17: 7077, 1996.
[22] G. Holzmann. Early Fault Detection Tools. Software Concepts and Tools, 17: 6369, 1996.
[23] International Telecommunications Union. Message Sequence Charts (MSC). Recommendation
Z.120, ITU-T, Geneva, 1999.
10
Verication
Languages
Aarti Gupta
NEC Laboratories America
Ali Alphan Bayazit and
Yogesh Mahajan
10.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1
Overview
10.2 Verication Methods: Background . . . . . . . . . . . . . . . . . . . . . 10-2
Simulation-Based Verication Formal Verication
Assertion-Based Verication Formal Specication of Properties
10.3 Languages for Hardware Verication. . . . . . . . . . . . . . . . . . . 10-6
HDLs and Interfaces to Programming Languages Open
Verication Library Temporal e OpenVera and OVA
ForSpec Property Specication Language
10.4 Languages for Software Verication . . . . . . . . . . . . . . . . . . . . 10-11
Programming Languages Software Modeling Languages
10.5 Languages for SoCs and Embedded Systems
Verication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-13
System-Level Modeling Languages Domain-Specic System
Languages
10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17
Acknowledgments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-17
10.1 Introduction
Verication is the process of checking whether a given design is correct with respect to its specication.
The specication itself can be in different forms it can be a nontangible entity such as a designers
intent, or a document written in a natural language, or expressions written in a formal language. In
this chapter, we consider a verication language to be a language with a formal syntax, which supports
expression of verication-related tasks. It is very useful for a verication language to have a precise
unambiguous semantics. However, we also consider languages where ambiguity in semantics is resolved
by accepted practices. Broadly speaking, we focus on the support a verication language provides for
dynamic vericationbased onsimulation, as well as static vericationbased onformal techniques. Though
we discuss some high-level behavioral modeling languages also, we highlight their verication-related
features. Other details on how they are used for representing higher-level models can be found in other
recent articles [1,2].
Verication is increasingly becoming a bottleneck in the design of embedded systems and System on
Chips (SoCs). The primary reason is the inability of design automation methodologies to keep up with
10-1
growing design complexity, due to the increasing number of transistors achievable on a chip in accordance
with the well-known Moores Law. The emerging trend toward systematic reuse of design components
and platforms can help amortize the verication costs across multiple designs. At the same time, it
places an even greater responsibility on the verication methodology, because the correctness of reusable,
parameterized components must be veried in potentially multiple contexts. The cost of detecting bugs
late in the design cycle is very high, both in terms of design respins, and in terms of time lost to market.
A verication methodology for system-level design needs to address issues related to component-level
verication, as well as system-level integration of these components. For embedded systems, this requires
verifying the effects of distributed concurrent computation, not only for hardware, but also for the
embedded software. Additional complexity arises due to the heterogeneous nature of the components.
For example, many embedded applications include digital controllers and analog sensors and actuators.
This requires verifying the hybrid systems, where the dynamics consist of discrete as well as continuous
behaviors. Another crucial problem for embedded systems is verifying the interfaces for parameterized
components, and third-party IP components, mostly without access to the design details. Finally, real-
time constraints play an important role in hardwaresoftware partitioning and scheduling (typically
implemented in an on-chip real-time operating system kernel). It is important to specify and verify these
real-time assumptions and requirements.
10.1.1 Overview
Though there has been progress in verication of embedded systems, there has been relatively little effort
at language standardization targeted at verication features described above. At the same time, much of
the recent standardization activities for hardware and system-level verication languages (e.g., by
Accellera [3] and IEEE Design Automation Standards Committee [DASC] [4]), are very much applic-
able for embedded systems. Therefore, in this chapter, we describe verication languages for hardware,
software, and embedded systems. A comprehensive survey in any of these categories is beyond the scope
of this chapter. Instead, we focus on representative languages, including language standards which are
available. Note that many of these languages and standards are still evolving, and are likely to undergo
further changes from the details described in this chapter.
For hardware designs, the popularity of Hardware Description Languages (HDLs), such as Verilog
[5] and VHDL [6], combined with the success of logic synthesis technology for Register-Transfer Level
(RTL) designs, has led to the emergence of Hardware Verication Languages (HVLs). Examples of these
languages, described in Section 10.3, include e, OpenVera, Sugar/PSL, ForSpec. In the area of software
verication related to embedded applications, described in Section 10.4, we focus on the use of standard
programming languages, such as C/C++, and Java, and on software modeling languages, such as UML,
SDL, and Alloy. For embedded systems, described in Section 10.5, we focus on languages for system-level
verication, such as SystemC, SpecC, SystemVerilog. We also describe domain-specic verication efforts,
such as those based on Esterel, and hybrid systems.
10.2 Verication Methods: Background
We start by providing some background for verication methods, in order to provide a sufcient context
and terminology for our discussion on verication languages. More details can be found in the references
cited.
10.2.1 Simulation-Based Verication
Simulation has been, and continues to be, the primary method for functional verication of hardware
and system-level designs. It consists of providing input stimuli to the Design Under Verication (DUV),
and checking the correctness of the output response. In this sense, it is a dynamic verication method. The
practical success of simulation has largely been due to automation through development of testbenches,
Verication Languages 10-3
which provide the verication context for a DUV. (Details of testbench development are described in a
recent book [7].)
A typical testbench consists of a generator of testcases (input stimuli), a checker or monitor for
checking the output response, and a coverage analyzer for reporting how much of the design functionality
has been covered by the testcases. Depending on how much of the internal design state is controllable and
observable by the generators and checkers, verication can be classied as black-box (none), white-box
(full), or gray-box (partial, added to aid verication).
It is impossible to simulate all testcases for designs of modest size. Traditionally, directed testbenches are
used to generate specic scenarios of interest, especially to nail down designers intent in the early phase of
the design cycle. In contrast, constrained random testbenches are used to generate random stimuli, subject
to certain constraints added to the testbench. These are very useful for verifying unexpected scenarios,
which help in nding bugs in later phases of the design cycle. In all cases, progress is evaluated in terms
of coverage metrics, which have been used with varying degrees of success. These include code coverage
metrics, such as statement/branch/toggle/expression coverage, and functional coverage metrics, such as
state/transition coverage of the FSMs (Finite State Machines) in the design description.
As systems become increasingly complex, designs need to be specied at higher levels of abstraction.
A transaction is a sequence of lower-level tasks, which implements a logical operation at the higher level.
For example, a read transaction by an agent on a shared bus typically consists of transferring the address
to the bus, waiting for the data to be ready, and transferring the data from the bus. In complex systems,
it is more natural to model intermodule communication at the transaction level, rather than at the signal
level. This has led to the development of transaction-based testbenches for system-level verication. These
testbenches use adaptors called transactors/BFMs (Bus Functional Models) to translate between the higher
and lower levels.
Development of testbenches at RTL and system level can be quite tedious. It is useful to have language
support for abstraction and object-oriented features, such as abstract data types, data encapsulation,
dynamic object creation/destruction, inheritance, and polymorphism. The testbench language must also
be able to specify and launch many tasks concurrently, and provide features for synchronization, monit-
oring multiple in-ight tasks, and reentrancy of tasks. Most of these features were absent from the HDLs,
which motivated the emergence of testbench languages, HVLs, and system-level modeling languages,
described in Sections 10.3 and 10.5, respectively.
10.2.2 Formal Verication
In contrast to simulation, formal verication methods do not rely upon the dynamic response of a DUVto
certain testcases. Rather, they perform static analysis on a formal mathematical model of the given DUV,
to check its correctness with respect to a given specication under all possible input scenarios. In this
section, we briey describe some of these methods. Successful instances of industry applications have
been described in a survey [8].
Apopular formal method is equivalence checking, where a given DUV is checked for equivalence against
a given reference design. Equivalence checking has been applied very successfully in automated hardware
design, to check that no errors are introduced during logic synthesis ow from RTL to gate-level design.
Since both the reference design and the DUVare adequately represented in standard HDLs, there has been
little motivation to develop a special language for equivalence checking.
When reference models are unavailable, or where the correctness of the highest-level model needs to be
checked, design requirements are expressed in terms of correctness properties. Property checking is used
to check whether a given design satises a given correctness property, expressed as a formal specication.
Most languages for specifying properties are derived from formal logics, or automata theory (described
in more detail in Section 10.2.4).
The two main techniques for property checking are model checking, and theorem proving. In model
checking [9], the DUV is typically modeled as a nite state transition system, the property is specied
as a temporal logic formula, and verication consists of checking whether the formula is true in that
model.
1
In theorem proving, both the DUV and the specication are modeled as logic formulas, and the
satisfaction relation between them is proved as a theorem, using the deductive proof calculus of a theorem
prover. Model-checking techniques have found better acceptance in the industry so far [10,11], primarily
because they can be easily automated and they provide counterexamples which are useful for debugging.
Though theorem-proving techniques [1214] can handle more general problems, including innite state
systems, they are not as highly automated, and the learning curve is steeper for new users.
The practical applicationof model-checking techniques is limitedby the state explosionproblem, that is,
the state-space to be searched grows exponentially with the number of state components. Symbolic model-
checking techniques [9,10] use methods based on Binary Decision Diagrams (BDDs) [15] to symbolically
manipulate sets of states without explicit enumeration. Though this improves scalability to some degree,
these techniques can fail due to the memory explosion for BDDs. As an alternative, bounded model-
checking techniques use methods based on checking Boolean satisability (SAT) procedures for nding
bounded-length counterexamples [16]. These techniques scale better for nding bugs, but they need
additional reasoning for obtaining conclusive proofs.
10.2.3 Assertion-Based Verication
Another emerging methodology is assertion-based verication. Though its basic elements have been part
of industry practice for a while, it has gained attention more recently as a systematic means of enhancing
the benets of simulation and formal verication, and for combining them effectively. In particular, the
Accellera organization [3] has been actively involved in developing and promoting language standards
for top-down specication using a formal property language, as well as for bottom-up implementation
assertions. (The chosen standards are described in Sections 10.3.6 and 10.5.1.3. More details can be found
in a recent book [17].)
The key ingredient of assertion-based verication is formal specication of properties to capture
designer intent at all levels of the design. Properties can be used as assertions, to check for violations
of correct behavior or functionality. The checking can be done dynamically during simulation, statically
using formal verication techniques, or by a combination of the two. Properties that specify interfaces
between modules can also be used as constraints, that is, as assumptions on the input interface of a given
module. For simulation, these constraints can be translated into stimulus generators, or added to the
constraint-solvers for constrained random simulation. For formal verication, these constraints serve
the role of an environment, which can be handled explicitly as an abstract module, or implicitly by
the verication method.
Assertions and constraints are crucial in exploiting an assume-guarantee verication paradigm, whereby
assertions (guarantees) on the output interface of a given module, serve as constraints (assumptions) on
the input interface of the downstream module. (Care must be exercised in dening interfaces, in order to
avoid a circular reasoning.) This allows verication of a large design to be handled modularly by verifying
each of its components. Assertions and constraints also help to assess functional coverage achieved by
simulation testbenches, for example, howmany assertions have been checked, howmuch of the constraint
space has been covered, etc.
10.2.4 Formal Specication of Properties
Property specication languages are derived mostly from formal logics and automata theory. There are
many related theoretical issues expressiveness (i.e., what kinds of properties a logic can capture),
complexity of the related property checking problem. In general, the more expressive a logic, the higher
is the complexity of the property checking problem. There are also many practical issues ease of use,
available tool support, etc. In this section, we briey describe some logics that form the basis of languages
described in the rest of this chapter.
1
A related method is called language containment, where the specication and DUV are represented as automata,
and verication consists of checking that the language of the DUV is contained in the language of the specication.
10.2.4.1 Propositional Logic
This is essentially the Boolean logic familiar to most hardware designers. Formulas consist of Boolean-
valued variables connected using standard Boolean operators (NOT/AND/OR). The complexity of
checking the satisability of a Boolean formula (SAT) is known to be NP-complete [18]. When both
universal and existential quantiers are added, the complexity of checking satisability of a Quantied
Boolean Formula (QBF) is PSPACE-complete [18]. The expressiveness of these logics is fairly limited, but
they are well suited for handling bit-level hardware designs.
10.2.4.2 First-Order Predicate Logic
The expressiveness of Boolean logic is extended here, by allowing variables to range over elements of an
arbitrary set, not necessarily nite. This is very useful for handling higher-level systems and software, for
example, to reason directly about integers or real numbers. Though the problem of checking validity of
many interesting classes of formulas over these sets is undecidable, automated decisionprocedures exist for
useful subsets of the logic [19,20], whichhave beensuccessfully utilized for vericationof microprocessors,
and word-level hardware designs. Higher-order logics, that is, where quantiers are allowed to range over
subsets of sets, have also been used, but mainly by specialists [12].
10.2.4.3 Temporal Logics
Temporal logics are used for specication of concurrent reactive systems, which are modeled as state
transition systems labeled by propositions [21]. Here, the expressiveness of predicate logic is further
extended, such that the interpretation of a proposition can change dynamically over the states. In addition
to the standardBooleanoperators, there are typically four temporal operators, withthe intendedsemantics
as shown below:
Gp: p is Globally (always) true
Fp: p is true some time in the Future (eventually)
Xp: p is true at the neXt instant
pUq: p is true Until q becomes true
Temporal logics are well-suited for expressing qualitative temporal behavior. The correctness properties
are typically categorized as follows:
Safety properties. Nothing bad happens, for example,
G(!p1_critical +!p2_critical )
This formula expresses the mutual exclusion property that processes p1 and p2 should never be in
the critical section simultaneously.
Liveness properties: Something good eventually happens, for example,
G(request F grant )
This captures absence of starvation, that is, every request for a resource should be eventually
granted.
Precedence properties: Ordering of events, for example,
G(!p2_req U p1_req !p2_grant U p1_grant )
This expresses an ordering requirement, that is, if process p1 requests a resource before process p2,
then it should be granted the resource earlier.
Different kinds of temporal logics have been proposed, depending upon the view of time. Linear
Temporal Logic (LTL) [21] takes a linear view, where formulas are interpreted over linear sequences of
states. In contrast, Computation Tree Logic (CTL) [9] takes a branching view of time, where formulas
are interpreted over a tree of possible computations starting from a state (with additional path quantiers
to denote all/some paths). In terms of expressiveness, LTL and CTL are incomparable, that is, each can
express a property that cannot be expressed by the other. In terms of model-checking complexity, LTL
is linear in the size of the model, but exponential in the size of the specication, while CTL is linear in
both. Despite the relative stability of these logics, there is an ongoing debate regarding their suitability for
various verication applications [22].
10.2.4.4 Regular and -Regular Languages
Though temporal logics are quite useful, they cannot express regular (or -regular) properties of
sequences. Such properties are easily expressed as languages recognized by nite state automata on nite
(innite) words [23]. Regular expressions are formed by using three basic operations concatenation
(), choice (|), and bounded repetition (
). For example, a sequence with alternation of a request

signal with grant or error signals, can be expressed as (request (grant|error))
. An -regular expression
consists of the form U V
, where U and V are

-free regular expressions, and V
denotes an innite
repetition of expression V. These are used for specifying properties of systems that do not terminate.
10.3 Languages for Hardware Verication
In this section, we focus on verication languages for hardware RTL designs. (Languages for system-level
hardware designs are covered in Section 10.5.)
10.3.1 HDLs and Interfaces to Programming Languages
Register Transfer Level (RTL) designs are typically implemented in standard HDLs, such as Verilog and
VHDL. However, it is not practical to implement simulation testbenches in HDLs alone. Indeed the test-
bench can be purely behavioral, that is, it need not be synthesizable into hardware. Furthermore, it can be
implemented at higher levels than RTL. Historically, a popular testbench development approach has been
to implement some of its parts in a software programming language, such as C/C++ or Perl. These parts
are integrated with HDL simulators through standard programming language interfaces. Unfortunately,
this approach not only slows down the simulator, but it also requires signicant development effort, such
as dening new data types (e.g., 128-bit bus), handling concurrency, dynamic memory objects, etc. [7].
For property specication, VHDL has some support for static assertions, with different severity levels.
Although Verilog lacks such explicit constructs, it is straightforward to use the if and $display
constructs to implement a similar effect.
Example 10.1 Suppose we need to check that two signals A and B cannot be high at the same time.
The fragment below shows an assertion template and an instance in VHDL. The keyword assert species
a property that must hold during simulation.
[label] assert expression
[report message]
[severity level]
assert (A NAND B)
report "error: A & B cannot both be 1"
severity 0;
A similar effect can be achieved by using Verilogs if/$display combination, which species the undesired
situation, as shown below:
always (A or B) begin
if (A & B) begin
$display("error: A = B = 1");
$finish; // end simulation
end
end
10.3.2 Open Verication Library
Though simple assertions are quite useful, they do not provide a practical way for specifying properties.
Temporal properties can be specied using checkers or monitors in HDLs, but this involves a signicant
development effort. Therefore, there has been a great deal of interest in developing a library of reusable
monitors. An example is the Open Verication Library (OVL), available from Accellera [3]. It is not a
standalone language, but a set of modules that can be used to check common temporal specications
within Verilog or VHDL design descriptions.
Example 10.2 As an example [17] consider the following PCI Local Bus Specication: to prevent AD,
C/BE#, and PAR signals from oating during reset, the central resource may drive these lines during reset
(bus parking) but only to a logic low level; they may not be driven high.
Suppose ad, cbe_, par, rst_ are the Verilog signal names corresponding to AD, C/BE#, PAR, and RST#
respectively.
2
The given property can be specied within a Verilog implementation of OVL as follows:
assert_always master_reset (clk,!rst_,!|(ad ,cbe_ ,par));
Here, assert-always is the name of the library monitor, and master_reset is an assertion instance. On every
rising edge of clk signal, whenever !rst_ is high, the monitor asserts that the last parameter should evaluate
to true. Here the last parameter is the negation (!) of the bitwise-or ( |) over the given bits. Note that this
is a simple safety property, which is expected to always hold during simulation.
Example 10.3 Consider another example [17] from the PCI Local Bus Specication: the assertion of
IRDY# and deassertion of FRAME# should occur as soon as possible after STOP# is asserted, preferably
within one to three cycles.
For simplicity, consider the FRAME# and STOP# signals only, that is, check whether frame_ will be
de-asserted within one to three clock cycles after stop_ is asserted, as shown below:
assert_frame#(1,1,3) check_frame_da (clk, true,!stop_, frame_)
Again, check_frame_da is an instance of the assert_frame module dened in the library. Three optional
parameters #(1, 1, 3) are used, corresponding to the severity level, minimum number of clocks, and
maximum number of clocks, respectively. The monitor will check whether frame_ goes high within one
to three clock cycles after !stop_. Here, the severity level is set to 1 to continue the simulation even if the
assertion is violated, and the reset parameter is set to true, as an example of where it is not needed.
Open Verication Library has many advantages. First, it can be used with any Verilog, VHDL, or a
mixed simulator, with no need for additional verication tools. Second, it is open, that is, the library can
be modied easily, for example, for assessing functional coverage [17]. Another useful feature of OVL is
that it does not slow down simulation, primarily because it is hard to specify very complex assertions.
Unfortunately, OVL is used mainly for checking safety properties during simulation, and is not very useful
for checking liveness or for formal verication. In some sense, OVL provides a transition from a traditional
simulation-based methodology, to an assertion-based methodology.
10.3.3 Temporal e
Verisitys e language is an advanced verication language that is intended to cover many verication aspects.
It has been recently chosen by the IEEE DASC [4] for standardization as a verication language. Like
many high-level languages, it has constructs for Object-Oriented Programming, such as class denitions,
inheritance and polymorphism [7]. It also provides elements of Aspect-Oriented Programming (AOP).
AOP allows modifying the functionality of the environment without duplicating or modifying the original
code, in a manner more advanced than simple inheritance (see References 7 and 24 for more details).
2
In PCI Local Bus Specication, signal names ending with # indicates that signal is active low. In the examples,
we will use _, for the same purpose.
As a testbench language, e provides many constructs related to stimuli generation, such as specication
of input constraints and facilities for data packing, as well as for assessing simulation coverage. It also
provides support for property specication, and has been used widely both in simulation-based and
formal verication.
Example 10.4 As an example for stimuli generation, suppose we have a struct type
3
frame for modeling
an Ethernet frame, with one of the data elds dened as %payload. The type of payload can be dened in
e as follows [24]:
struct payload {
%id :byte;
%data :list of byte;
keep soft data.size()in [45..1499];
}
In this example, the % character in front of the eld name means that the corresponding eld is physical
and represents data to be sent to the DUV. The keep soft keywords are used for bounding the values of
the variable the size of the data eld in this case. It also allows specication of weighted ranges or
constraints. In the example, the size will be varied automatically within the given range. (Using the !
character along with % would have indicated that the eld would not be generated automatically.)
Typically, a user-denedfunctionis usedfor driving stimuli tothe DUV. For example, suppose my_frame
is an instance of the struct frame. The following e code can be used to input the frame data serially into
the DUV:
Example 10.5
var bitList: list of bit;
bitList = pack (packing.low, my_frame);
for each (b) in bitList {
testbench.duv.transmit_stream0= b;
wait cycle; };
In this example, the keyword pack provides the mechanism to pack all data elds into a single list of bits,
which is then fed serially to the Verilog signal testbench.duv.transmit_stream0. After each bit transfer, the
function waits for one clock, denoted by the wait cycle keywords.
Support for specication of temporal properties is provided in e through use of Temporal Expressions
(TEs). A TE is dened as a combination of events and temporal operators. The language also supports the
keyword sync, which is used as a point of synchronization for TEs.
Example 10.6 Returning to PCI specications, consider the following requirement, and its corresponding
specication in e: Once a master has asserted IRDY#, it cannot change IRDY# or FRAME # until the
current data phase completes regardless of the state of TRDY#.
expect @new_irdy => {
[..]*((not @irdy_rise) and (not change (frame_)));
@ data_phase_complete} @sys.pci_clk;
else
dut_error("Error, IRDY# or FRAME# changed",
"before current data phase completed.");
Here, suppose that the events (shown as @event) have been dened already. The shown expression
species that whenever IRDY# is asserted (@new_irdy), de-assertion of IRDY# (@irdy_rise) or a change
in FRAME# should not occur, until the data phase completes (@data_phase_complete). The use of
3
A struct type basically corresponds to a class type in C++, that is, it allows method denitions along with data
denitions. Since it is conceptually similar to other object-oriented languages, we omit the actual syntax.
@sys.pci_clk denotes that the event pci_clk is used for sampling signals in evaluating the given TE. This
feature is also useful for verifying multiple clocked designs.
10.3.4 OpenVera and OVA
OpenVera from Synopsys is another testbench language similar to e in terms of functionality and similar
to C++ in terms of syntax. Since conceptually OpenVera is very similar to e, we do not include here
testbench examples for OpenVera. It has similar constructs for coverage, random stimuli generation, data
packing, etc.
OpenVera Assertions (OVA) is a standalone language, which is also part of the OpenVera suite [25].
OpenVera comes with a checker library (OVA IP), which is similar to OVL. OVA and OpenVera also have
event denitions, repetition operators (
[..]), and sequencing, where different sequences can be combined

to create more complex sequences using logical and repetition operators.
Example 10.7 The following example shows the OVA description of the PCI specication from
Example 10.3:
clock posedge clk {
event chk:
if (negedge stop_) then
#[1..3]posedge frame_;
}
assert frame_chk : check (chk)
Note the specication of the sampling clock, the event, and the corresponding action. The implication
operation (ifthen) is similar to conditionals in programming languages, and # is the cycle delay operator.
10.3.5 ForSpec
ForSpec is a specication and modeling language developed at Intel [26]. The underlying temporal logic
ForSpec is ForSpec Temporal Logic (FTL), which is composed of regular language operators, and LTL-style
temporal operators. ForSpec is aimed at an assume-guarantee verication paradigm, which is suitable for
modular verication of large designs. The language also provides explicit support for multiple clocks,
reset signals, and past-time temporal operators. Although none of these additional features increases the
expressiveness of the basic language, they clearly ease specication of properties in practice. Example
templates for asynchronous set/reset using FTL are shown below:
accept(boolean_expr & clock)in formula_expr
reject(boolean_expr & clock)in formula_expr
Recently, some constructs of ForSpec have also been added to OVAVersion 2.3.
4
In particular, the concept
of temporal formulas is added, which can be composed by applying temporal operators on sequences.
The supported temporal operators are: followed_by, triggers, until, wuntil, next, wnext [25,27], which
can be expressed in terms of standard temporal operators (described in Section 10.2.4). Asynchronous
set/reset have also been added.
10.3.6 Property Specication Language
Property Specication Language (PSL) is the language standard established by Accellera for formal prop-
erty specication [3]. It originated from the language Sugar, developed at IBM. The rst version of Sugar
was based on CTL, and was aimed at easing expression of CTL properties by users for the RuleBase model
checker [28]. PSL is based on Sugar Version 2.0, where the default temporal logic is based on LTL, called
PSL/Sugar Foundation Language. The main advantage of LTL in assertion-based verication is that its
4
Not yet supported by Synopsys at the time of this writing.
semantics, that is, evaluation on a single execution path, is more natural for simulation-based methods
where a single path is traced at a time. Though CTL is more efcient in model-checking complexity, it is
harder to support in simulation-based methods. For formal verication, CTL continues to be supported in
PSL through an Optional Branching Extension (OBE). According to the Accellera standard, PSL assertions
are declarative. Since many users prefer procedural assertions, nonstandard pragma-based assertions
5
are
also supported.
PSL properties consist of four layers: Boolean layer, temporal layer, verication layer, and modeling
layer. The bottom-most is the Boolean layer, which species the Boolean expressions that are combined
using operators and constructs from the upper layers. The syntax of this layer depends on the hardware
modeling language used to represent the design. Currently, PSL supports both Verilog and VHDL. (Sugar
also supports the internal language EDL of RuleBase.)
The temporal layer species the temporal properties, through the use of temporal operators and
regular expression operators. The temporal operators (eventually, until, before, next), and some regular
expression operators have two different types. One is called the strong sufx, indicated with an ! at the
end (e.g., eventually!), which species that the operator must hold before the end of the simulation. The
other one, called the weak sufx (without the !), denotes that it does not hold only when the chance of
it being true vanishes completely. PSL/Sugar also has some predened temporal constructs regarded as
syntactic sugar, targeted at easing usage without adding expressiveness.
The vericationlayer species the role of the property for the purpose of verication. This layer supports
keywords assert, assume, assume-guarantee, restrict, restrict-guarantee, cover, and fairness. The
keyword assert indicates that the property should be checked as an assertion. The keywords assume and
assume-guarantee denote that the property should be used as an assumption, rather than checked as an
assertion. The keywords restrict and restrict-guarantee can be used to force the design into a specied
state. The keyword cover is used for coverage analysis, and fairness to specify fairness constraints for the
DUV. This layer also provides support for organizing properties into modular units.
Finally, the modeling layer provides support for modeling the environment of the DUV, in terms of
the behavior of the design inputs, and the auxiliary variables and signals required for verication.
Example 10.8 Going back to the same PCI specication, as shown in Examples 10.3 and 10.7, the property
is written in PSL as follows:
assert always (!stop_-> eventually! frame_);
Note that the above specication is qualitative in terms of when the frame_ signal should be de-asserted,
that is, the eventually! operator does not specify that it should do so within one to three cycles after stop_
is asserted.
Example 10.9 Consider a property that states that a request-acknowledge sequence should be followed by
exactly eight data transmissions, not necessarily consecutive. This can be expressed using a SERE (Sugar
Extended Regular Expression) as follows:
always {req;ack} |=> {start_trans;data[=8];end_trans}
In addition to the number of data transmissions, the property also species that start_trans is asserted
exactly one cycle after ack is asserted (|=>). The notation [=8] signies a (not necessarily consecutive)
repetition operator, with parameter 8. Another SERE that is equivalent to data[=8] is shown as follows:
{!data[*];data;!data[*]}[*8]
PSL/Sugar also allows specication of a sampling clock for each property, or the use of a default clock
(as shown in the previous examples). The abort operator can be used to specify reset-related properties,
inspired by ForSpec. To support modularity and reusability, SEREs can be named and used in other SEREs
or properties. (This feature is similar to events in e and OVA.)
5
These assertions are escaped from simulators by dening them inside comments.
Example 10.10 Another PCI specication states that the least signicant address bit AD[0] should never
be 1 during a data transfer. This can be expressed in PSL as follows [17]:
sequence SERE_MEM_ADDR_PHASE =
{frame_; !frame_ && mem_cmd};
property PCI_VALID_MEM_BURST_ENCODING =
always {SERE_MEM_ADDR_PHASE} |->
{!ad[0]} abort !rst_ @(posedge clk);
assert PCI_VALID_MEM_BURST_ENCODING;
In this example, |->is a weak sufx implication operator, which denotes that whenever the rst sequence
holds we should expect to see the second sequence. Note the denition of the named SERE using keyword
sequence, and its use within a property specication. The property is used as assertion, to be checked by
the verication tool.
10.4 Languages for Software Verication
In this section, we focus on verication efforts for software relevant to embedded systemapplications. (For
more general software systems and software analysis techniques, see overview articles [29,30] for useful
pointers.) In comparison to the hardware design industry, which has been very active in standardization
efforts for verication languages, there has been little activity in the area of software systems. Instead,
the efforts are targeted at taming the complexity of verication, largely by focusing on specic forms of
correctness or specic applications. The correctness requirements for software are typically expressed
in the form of FloydHoare style assertions [31] (invariants, pre/postconditions of actions represented
in predicate logic) for functional verication, or temporal logic formulas for behavioral verication.
10.4.1 Programming Languages
Given the popularity of C/C++ and Java, verifying programs directly written in these languages is very
attractive in principle. However, there are many challenging issues handling of integers/oating point
data variables, handling of pointers (in C/C++), function/procedure calls, and object-oriented features,
such as classes, dynamic objects, and polymorphism.
10.4.1.1 Verication of C/C++ Programs
The VeriSoft project [32] at Lucent Technologies focused on guided search for deadlocks and violations of
assertions expressed directly in C/C++, and has been used for verifying many communication protocols.
The SLAMproject [33] has beensuccessfully used at Microsoft for proving the correctness of device drivers
written in C. Designers use a special language called SLIC[34] to provide correctness requirements. These
requirements are translated automatically into special procedures, which essentially implement automata
for checking safety properties. These procedures are added to the given program, such that a bug exists
if a statement labeled error can be reached in the modied program. Verication is performed by
obtaining a nite state abstract model, performing model checking, and rening the abstract model if
needed. Other similar efforts [35,36] have focused largely on the abstraction and renement techniques
so far. There has been relatively little effort in development of a separate verication language.
10.4.1.2 Verication of Java Programs
There have been many efforts for verication of Java programs also. The Extended Static Checker for
Java (ESC/Java) [37] performs static checks that go beyond type checking. It automatically generates
verications conditions for catching commonprogramming errors (such as null dereferences, array bound
errors, etc.), and synchronization errors (race conditions, deadlocks). These conditions are transparently
checked by a backend theorem prover. It also uses a simple annotation language for the user to provide
object invariants and requirements, which aid the theorem prover. The annotation language, called JML
(Java Modeling Language), is part of a broader effort [38,39]. JML is used to specify the behavior and
syntactic interfaces in Java programs, through pre/postconditions and invariants for classes and methods.
Its syntax is quite close to that of Java, thereby making it easy to learn by programmers. JML is used by
a host of verication tools, which span the spectrum from dynamic runtime assertion checkers and unit
testers, to static theorem provers and invariant generators. It has been applied successfully in smartcard
applications implemented using a dialect of Java called Java Card [40].
Example 10.11 Consider the following code [37], which shows Java code for a class denition of integer
multisets:
1 class Bag {
2 //@invariant 0<=size && size <=elements.length
3 int size;
4 int[] elements;
5 // @requires input != null
6 Bag(int[] input) {
7 ...
8 }
9 int extractMin( ) {
10 ...
11 }
12 }
As shown, the bag class provides two operations (details are not shown) one for constructing a bag
from an array of integers, and another for extracting the smallest element. The example also shows user
annotations on Lines 2 and 5, to be used with ESC/Java. The annotation on Line 2 species an object
invariant, which should hold for every initialized instance of the class. The annotation on Line 5 species
a precondition, which is assumed to hold by the operation that follows.
Model-checking techniques have also been used successfully for Java programs. Bandera [41] and Java
PathFinder (JPF) [42,43] use various program analysis methods and abstractions to obtain a nite state
model of the given Java program. This model is veried against temporal specications provided by the
user. In many cases, model checking is performed by translation of the (abstract) Java program into
Promela, the input language of the model checker SPIN [11]. Promela focuses on system descriptions of
asynchronous concurrent processes, and the correctness properties are specied in LTL. SPIN has been
used successfully to verify the correctness of numerous protocols, distributed algorithms, controllers for
reactive systems, etc.
The Bandera project has alsofocusedondevelopment of the Bandera SpecicationLanguage (BSL) [44].
BSL provides support for dening general assertions, pre/postconditions on methods, and atomic pre-
dicates for specifying temporal properties. There has also been related work [45] on developing a
pattern-based approach for specifying complicated temporal properties, and translating them into input
for model-checking tools.
10.4.2 Software Modeling Languages
The Unied Modeling Language (UML) [46] is emerging as a popular standard for designing object-
oriented software systems. There has been recent interest in its use for specication and modeling of
real-time and embedded systems [47,48]. It provides different kinds of diagrams to specify the system
structural and behavioral aspects, including Statecharts [49], and Message Sequence Charts [50,51].
A language called Object Constraint Language (OCL) [52] has been developed as part of UML, for
specifying requirements, such as invariants, and pre/postconditions on operations in class diagrams. Tools
for simulationof UMLmodels, andfor automatic code generationfromUMLinto C/C++programs, have
become popular for embedded software development in automotive and avionics applications. However,
formal verication efforts have been hindered by a lack of a precise formal semantics for UML and OCL,
which are still under development.
The Specication and Description Language (SDL) [53] is another popular language, especially targeted
for communication systems. It provides both graphical and textual syntax, for specifying systems at
different levels of hierarchy. Tool support for SDL is widely available, especially for automatic code
generation to C, C ++, Java. Verication is supported by methods based on Message Sequence Charts, as
well as model checking [54].
Another set of efforts is based on the use of declarative specication languages, such as Z [55], and more
recently Alloy [56]. Their main drawback is that there is little support for automatic extraction of such
models from existing programs, or for automatic generation of programs from these models. Furthermore,
Z is not very amenable to automatic analysis. However, Alloy does provide automated verication support,
and has been used for some embedded applications. The analysis tool in Alloy attempts to nd a nite
scope model that satises all given constraints, by translation to a Boolean SAT problem.
10.5 Languages for SoCs and Embedded Systems Verication
In this section, we describe verication languages for SoCs and embedded systems. We start by describing
system-level behavioral modeling languages that have been heavily inuenced by existing HDLs. Next, we
describe verication support for domain-specic languages, such as synchronous languages like Esterel,
and efforts in the area of hybrid systems.
10.5.1 System-Level Modeling Languages
As systems become increasingly complex, it becomes useful to specify designs at higher levels of abstraction.
This allows early design exploration for optimizing system requirements, such as performance, power
consumption, chip area, etc. Many system-level behavioral modeling languages focus on hardware
software partitioning issues, and development of system-level verication platforms.
10.5.1.1 SystemC and SystemC Verication Library
SystemC is a C++ based modeling platform that supports hardware module abstractions at RTL, beha-
vioral and system levels, as well as software modules. Both the DUV and the testbench are written in the
SystemC language. The platform provides a class library and a simulation kernel.
The SystemC Verication Library adds features for transaction-based testbench development. A tech-
nique called data introspection is exploited to allow arbitrary data types to be used in constraints, assertions,
transaction recording, and other high-level activities. Randomization of data generation for arbitrary data
types is supported by dening a distribution through Boolean constraints and probability weights. The
verication library also provides a callback mechanism to observe activities at transaction level during sim-
ulation, and a minimal set of HDL connection APIs, to permit interfacing with VHDL or Verilog. So far,
there has not been much effort to add support for formal specication of properties within SystemC,
though the standards can potentially be supported.
Example 10.12 A typical example application using the SystemC Verication Library is shown in
Figure 10.1 [57], where a transactor bridges the gap between the testbench and the DUV.
10.5.1.2 SpecC
SpecC is an executable modeling language based on C, targeted for hardwaresoftware codesign [58]. The
SpecC methodology advocates a clear separation between communication and computation in system-
level designs. It provides support for creating an executable behavioral specication, design exploration
including hardwaresoftware partitioning, communication synthesis, and automatic generation of the
software as well as hardware components. A signicant effort has been made in providing clean formal
Port
Port-transactor
binding
A module instance
An abstract method A signal A C
++
base class
Class test
rw_port<rw_task_if> pipelined_bus_ports pipelined_bus_ports rw_task_if
Class rw_pipelined_transactor Class design
clk
...
data
FIGURE 10.1 Transaction-level testbench in SystemC [57].
semantics, and in development of an associated simulation kernel. However, so far, there has not been
much effort focused on its use in property specication or formal verication.
10.5.1.3 SystemVerilog and SystemVerilog Assertions
SystemVerilog is an extension of the Verilog HDL, with the latest version SystemVerilog 3.1 standardized
by Accellera [59]. It adds many system-level modeling and verication capabilities to the older Verilog
(IEEE Standard 1364). It also provides a Direct Programming Interface, which allows C functions to be
called from Verilog code and vice versa.
For transaction-level testbench development, SystemVerilog has incorporated many features from
C/C++, such as structures, pointers, classes, and inheritance. It also provides support for multiple threads,
events and semaphores for interprocess synchronization, mailboxes for interprocess communication, and
random number classes to help generate random test vectors for simulation.
SystemVerilog also provides support for an assertion-based verication methodology, using the
SystemVerilog Assertions (SVAs) language, described below. The SystemVerilog standard has enhanced
the scheduling scheme of Verilog, by dening new scheduling phases in which to sample stable signals, to
evaluate the assertions, and to execute testbench code. This scheduling scheme avoids races between the
DUV and testbench, guarantees that assertions and testbenches see stable values, and ensures common
semantics across the simulation, verication, synthesis, and emulation tools.
SystemVerilog Assertions combines many ideas from the languages described in Section 10.3. Properties
are specied in terms of signal sequences, each of which is associated with a clock for sampling signals.
Sequences are rst class objects and can be declared, parameterized, or combined to build other sequences.
It also provides limited support for combining sequences with multiple clocks.
Example 10.13 The following example denes a requestacknowledge sequence called req_ack. It is
dened in terms of parameter signals, and can be reused in multiple properties. The notation ##[1:3]
indicates 1 to 3 cycles of the associated clock:
sequence req_ack (req, del, ack) ;
// ack occurs within 1 to 3 cycles after req
(req ##[1:3] ack);
endsequence
SystemVerilog Assertions also allows dynamic local variables to be declared within a sequence, which
remain valid during the current invocation of the sequence. These can be used to check regular properties.
Example 10.14 Consider the following sequence which captures the I/O behavior of a pipeline register
with depth 16 [17]:
sequence pipe_operation;
int x;
(write_en, (x = data_in) ) |-> ##16 (data_out == x);
endsequence
Note that the input value is saved in the local variable x at the beginning of the sequence, which is
compared with the value of data_out after 16 cycles. (Here, |-> denotes an implication operator which
matches the ending of the previous subsequence with the beginning of the next.)
Different directives act on properties to indicate their role assert (check for violations), cover (observe
and track when property is exercised), and bind (attaches an externally specied assertion to the code).
Assertions are classied as immediate or concurrent. The immediate assertions are like assert statements in
an imperative programming language, such as C/C++. They are executed immediately when encountered
in the SystemVerilog code, while following its event-based simulation semantics. On the other hand,
concurrent assertions usually describe behavior that spans time, and they are executed in the special
scheduling phase using sampled values. In both cases, the action taken after evaluating an assertion
may include system tasks to control severity, such as $error, $fatal, and $warning [17]. Immediate
assertions can be either declarative or procedural. Declarative assertions require enabling conditions to be
explicitly provided by the user, which are monitored continuously on clock edges. In the case of procedural
assertions, the enabling conditions are partially inferred fromthe code context. This makes the procedural
assertions easier to maintain when the code changes.
Example 10.15 Consider a simple parameterized property declaration, and its use as a concurrent
assertion [17]:
property mutex (clk, reset_n, a, b);
@(posedge clk) disable iff (reset_n)
(! (a & b ))
endproperty
assert_mutex:
assert property (mutex (clk_a, master_reset_n,
write_en, read_en)) ;
The disable iff clause in the mutex property indicates that the property is disabled, that is, treated as if it
holds, when the reset_n signal is active. The assertion checks that write_en and read_en cannot occur at
the same time, provided master_reset_n is not active.
10.5.2 Domain-Specic System Languages
In this section, we describe some domain-specic languages for embedded system applications, and
highlight the support they provide for verication. Unlike the system-level languages inuenced by HDLs,
there has been relatively little standardization effort aimed at their verication features.
10.5.2.1 Synchronous Languages
Esterel is a programming language used for the design of synchronous reactive systems [60]. The lan-
guage provides features for describing the control aspects of parallel compositions of multiple processes,
including elaborate clocking and exception-handling mechanisms. Typically, the control part of a reactive
application is written in Esterel, which is combined with a functional part written in C. Esterel has a
well-dened semantics, which is used by the associated compilers to automatically generate sequential
C code for software modules, or Boolean gate-level netlists implementing FSMs for hardware modules.
Standard simulation or model checking can be performed on the synthesized FSMs.
A central assumption in Esterel and other synchronous languages is the synchrony hypothesis, which
assumes that a process can react innitely fast to its environment. In practice, it is important to validate
this assumption for the target machine. An example is the tool TAXYS [61], which checks this assumption
by modeling the Esterel application and its environment as a real-time system. Two kinds of real-time
constraints are specied as annotations in Esterel code throughput constraints (which express the
requirements that a system react fast enough for the given environment model), and deadline constraints
(which express maximum delay between a given input and output of a system). These constraints are
checked by Kronos [62], a model checker for real-time systems.
Example 10.16 An example of annotated application code [61] in Esterel with deadline constraints is
shown below:
1 loop
2 await A; %{# Y = clock(last A) %}
3 call F( ); %{# Fmin < CPU < Fmax %}
4 %{# 0 < clock(last A) < d1 %}
5 end loop
6 ||
7 loop
8 await B;
9 call G( ); %{# Gmin < CPU < Gmax %}
10 %{# 0 < Y < d2 %}
11 end loop
In this example, it is required that the function F must terminate within d1 time units after arrival of
event A, and also that the function G must terminate within d2 time units after the arrival of event A
which was consumed by F. These constraints are specied as shown in Lines 4 and 10 (with Line 2),
respectively. Note also that the estimated runtimes for functions F and G are provided in Lines 3 and 9,
respectively.
10.5.2.2 Languages for Hybrid Systems
Many embedded system applications require interaction between digital and analog components, for
example, automotive controllers, avionics systems, robotic systems, manufacturing plant controllers, etc.
Since many suchapplications are alsosafety-critical, there has beena great deal of interest inthe verication
of such systems, called hybrid systems. The verication efforts can be broadly classied into those using
classical control-theoretic methods, and others using automata-based methods.
A popular paradigm for control-theoretic methods is the MATLAB-based toolset, with the Simulink/
Stateow modeling language [63]. It provides standard control engineering components to model the
continuous domain, and a Statechart-like language to model the discrete controller. MATLAB-based tools
are used for analysis, optimization, and simulation of the continuous behavior specied in terms of differ-
ential equations. These tools have been used very successfully for applications dominated by continuous
dynamics. However, complex interaction between the continuous and discrete control components has
not received much attention. Furthermore, the simulation semantics has not been related to a formal
semantics in any standard way.
The automata-based methods typically use discrete abstractions of the hybrid system in a way that
preserves the properties of interest, typically expressed in temporal logic [64]. Many model checkers for
handling these abstractions have been applied in industry settings, for example, Hytech [65], Uppaal [66],
Kronos [62], Charon [67]. The details of the discrete abstractions, and the resulting automata models,
are beyond the scope of this chapter. In most cases, the correctness requirement is to avoid a set of bad
states, typically specied using the syntax of the modeling language itself. The model checkers perform
an exact (or approximate) reachability analysis to (conservatively) check that all reachable states are safe.
There has also been work on translating subsets of the popular Simulink/Stateow-based models to the
formal automata-based models, using abstraction and model-checking techniques [68,69].
Recently, there has been some standardization effort in this domain the Hybrid Systems Interchange
Format (HSIF) standard is aimed at dening an interchange format for hybrid system models that can be
shared between modeling and analysis tools. So far, the focus has been on representation of the system
dynamics, which include both continuous and discrete behaviors. Support for verication-related features
will potentially follow its wider adoption.
10.6 Conclusions
We have presented a tutorial on verication languages in industry practice for verication of embedded
systems and SoCs. We have described features that aid development of testbenches, and specication
of correctness properties, to be used by simulation-based as well as formal verication methods. With
verication becoming a critical activity in the design cycle of such systems today, these languages are
receiving a lot of attention, and there are several standardization efforts underway.
Acknowledgments
The authors would like to thank Sharad Malik for valuable comments on the manuscript, and Rajiv Alur
and Franjo Ivancic for helpful discussions on verication of hybrid systems.
References
[1] S. Edwards, Languages for Embedded Systems. In The Industrial Information Technology Handbook,
D.R. Zurawski, Ed. CRC Press, Boca Raton, FL, 2004.
[2] A. Bunker, G. Gopalakrishnan, and S. McKee, Formal Hardware Specication Languages for
Protocol Compliance Verication. ACM Transactions on Design Automation of Electronic Systems,
9: 132, 2004.
[3] Accellera Organization. http://www.accellera.org
[4] IEEE, Design Automation Standards Committee (DASC). http://www.dasc.org
[5] D.E. Thomas and P.R. Moorby, The Verilog Hardware Description Languages. Kluwer Academic
Publishers, Norwell, MA, 1991.
[6] D.R. Coelho, The VHDL Handbook. Kluwer Academic Publishers, Norwell, MA, 1989.
[7] J. Bergeron, Writing Testbenches, Functional Verication of HDL Models. Kluwer Academic
[8] E.M. Clarke, J. Wing et al., Formal Methods: State of the Art and Future Directions. ACM
Computing Surveys, 28: 626643, 1997.
[9] E.M. Clarke, O. Grumberg, and D. Peled, Model Checking. MIT Press, Lancaster, England, 1999.
[10] K.L. McMillan, Symbolic Model Checking: An Approach to the State Explosion Problem. Kluwer
[11] G.J. Holzmann, The Model Checker SPIN. IEEE Transactions of Software Engineering, 23:
279295, 1997.
[12] M.J.C. Gordon, R. Milner, and C.P. Wadsworth, Edinburgh LCF: A Mechanized Logic of
Computation, Vol. 78. Springer-Verlag, Heidelberg, 1979.
[13] R.S. Boyer and J.S. Moore, A Computational Logic Handbook. Academic Press, New York, 1988.
[14] S. Owre, J.M. Rushby, and N. Shankar, PVS: A Prototype Verication System. In Proceedings of the
International Conference on Automatic Deduction (CADE), Vol. 607 of Lecture Notes on Computer
Science. Springer-Verlag, Saratoga, NY, 1992.
[15] R.E. Bryant, Graph-Based Algorithms for Boolean Function Manipulation. IEEE Transactions on
Computers, C-35: 677691, 1986.
[16] A. Biere, A. Cimatti, E.M. Clarke, and Y. Zhu, Symbolic Model Checking without BDDs. In Pro-
ceedings of the Workshop on Tools and Algorithms for Analysis and Construction of Systems (TACAS),
Vol. 1579 of Lecture Notes on Computer Science. Springer-Verlag, London, UK, 1999.
[17] H. Foster, A. Krolnik, and D. Lacey, Assertion Based Design. Kluwer Academic Publishers,
Dordrecht, 2003.
[18] M.R. Garey and D.S. Johnson, Computers and Intractability: A Guide to the Theory of
NP-Completeness. W.H. Freeman and Co., San Francisco, CA, 1979.
[19] C. Barrett, D. Dill, and J. Levitt, Validity Checking for Combinations of Theories with Equality.
In Proceedings of the Formal Methods in Computer Aided Design (FMCAD), Vol. 1166 of Lecture
Notes in Computer Science. Springer-Verlag, 1996.
[20] R.E. Bryant, S. Lahiri, and S. Seshia, Modeling and Verifying Systems using a Logic of Counter
Arithmetic with Lambda Expressions and Uninterpreted Functions. In Proceedings of the Con-
ference on Computer Aided Verication, Vol. 2404 of Lecture Notes in Computer Science.
Springer-Verlag, 2002.
[21] A. Pnueli, The Temporal Logic of Programs. In Proceedings of the 18th IEEE Symposium on
Foundation of Computer Science. IEEE Press, 1977, pp. 4657.
[22] M. Vardi, Branching vs. Linear Time: Final Showdown. In Proceedings of the Tools and Algorithms
for Analysis and Construction of Systems (TACAS), Vol. 2031 of Lecture Notes in Computer Science.
Springer-Verlag, London, UK, 2003.
[23] W. Thomas, Automata on Innite Objects. In Handbook of Theoretical Computer Science, Vol. B.
Elsevier and MIT Press, Cambridge, MA, 1990, pp. 133191.
[24] Z. Kirshenbaum, Understanding the e Verication Languages, 2003. http://www.eetimes.com/
story/OEG20030529S0072
[25] OpenVera Language Reference Manual: Assertions Version 2.3. 2003. http://www.openvera.org
[26] R. Armoni, L. Fix, A. Flaisher, R. Gerth, B. Ginsburg, T. Kanza, and A. Landver, The ForSpec
Temporal Logic: A New Temporal Property-Specication Language. In Proceedings of Tools and
Algorithms for Analysis and Construction of Systems (TACAS), Vol. 2280 of Lecture Notes on
Computer Science. Springer-Verlag, London, UK, 2001.
[27] Synopsys, OVA White Paper, 2003. www.openvera.org
[28] I. Beer, S. Ben-David, C. Eisner, D. Fisman, A. Gringauze, andY. Rodeh, The Temporal Logic Sugar.
In Proceedings of the International Conference on Computer Aided Verication, Vol. 2102 of Lecture
Notes in Computer Science. Springer-Verlag, 2001.
[29] D. Craigen, S. Gerhart, and T. Ralston, Formal Methods Reality Check: Industrial Usage. IEEE
Transactions of Software Engineering, 21: 9098, 1995.
[30] D. Jackson and M. Rinard, Software Analysis: A Roadmap. In The Future of Software Engineering,
A. Finkelstein, Ed. ACM Press, 2000.
[31] C.A.R. Hoare, An Axiomatic Basis for Computer Programming. Communications of the ACM,
12: 576580, 1969.
[32] P. Godefroid, Model Checking for Programming Languages using VeriSoft. In Proceedings of the
ACMSymposiumon Principles of Programming Languages. ACM Press, 1997.
[33] T. Ball and S. Rajamani, The SLAM Toolkit. In Proceedings of the Conference on Computer Aided
Verication, Vol. 2102 of Lecture Notes on Computer Science. Springer-Verlag, London, UK, 2001.
[34] T. Ball and S. Rajamani, SLIC: A Specication Language for Interface Checking (of C). Microsoft
Research MSR-TR-2001-21, 2001.
[35] T.A. Henzinger, R. Jhala, R. Majumdar, andG. Sutre, SoftwareVericationwithBlast. InProceedings
of the 10th SPIN Workshop on Model Checking, Vol. 2648 of Lecture Notes on Computer Science.
[36] D. Kroening, E.M. Clarke, and K. Yorav, Behavioral Consistency of C and Verilog Programs using
Bounded Model Checking. In Proceedings of the ACM/IEEE Design Automation Conference. ACM
Press, 2003.
[37] C. Flanagan, K.R.M. Leino, M. Lillibridge, G. Nelson, J. Saxe, and R. Stata, Extended Static
Checking for Java. In Proceedings of the ACM Conference on Programming Language Design and
Implementation (PLDI). ACM Press, 2002.
[38] G.T. Leavens, A.L. Baker, and C. Ruby, Preliminary Design of JML: A Behavioral Interface Specic-
ation Language for Java. Technical report 98-06t, Iowa State University, Department of Computer
Science, 2002.
[39] G.T. Leavens, K.R.M. Leino, E. Poll, C. Ruby, and B. Jacobs, JML: Notations and Tools Supporting
Detailed Design in Java. In Proceedings of the Conference on Object-Oriented Programming, Systems,
Languages, and Applications. ACM Press, 2000.
[40] E. Poll, J.v.d. Berg, and B. Jacobs, Specication of the JavaCard API in JML. In Proceedings of the
Smart Card Research and Advanced Application Conference, 2000.
[41] J.C. Corbett, M.B. Dwyer, J. Hatcliff, S. Laubach, C.S. Pasareanu, Robby, and H. Zheng,
Bandera: Extracting Finite-State Models from Java Source Code. In Proceedings of the International
Conference on Software Engineering. IEEE Press, 2000.
[42] K. Havelund and T. Pressburger, Model Checking Java Programs Using Java PathFinder.
International Journal on Software Tools for Technology Transfer (STTT), 2: 366381, 2000.
[43] W. Visser, K. Havelund, G. Brat, and S. Park, Model Checking Programs. In Proceedings of IEEE
International Conference on Automated Software Engineering. IEEE Press, 2000.
[44] J.C. Corbett, M.B. Dwyer, J. Hatcliff, and Robby, A Language Framework for Expressing Checkable
Properties of Dynamic Software. In Proceedings of the SPIN Software Model Checking Workshop,
2000.
[45] M.B. Dwyer, G. Avrunin, and J.C. Corbett, Patterns in Property Specications for Finite-State
Verication. In Proceedings of the International Conference on Software Engineering. IEEE Press,
1999.
[46] J. Rumbaugh, I. Jacobson, and G. Booch, The Unied Modeling Language Users Guide. Addison
[47] G. Martin, UML for Embedded Systems Specication and Design: Motivation and Overview.
In Proceedings of Design Automation and Test Europe (DATE), 2002.
[48] B. Selic, The Real-Time UML Standard: Denition and Application. In Proceedings of Design
Automation and Test Europe (DATE), 2002.
[49] D. Harel, StateCharts: AVisual Formalism for Complex Systems. Science of Computer Programming,
8: 231274, 1987.
[50] ITU, Message Sequence Chart: International Telecommunication Union-T Recommendation,
1999.
[51] A. Muscholl and D. Peled, From Finite State Communication Protocols to High-Level Message
Sequence Charts. In Proceedings of the International Symposium on Mathematical Foundations of
Computer Science, Vol. 2076 of Lecture Notes on Computer Science. Springer-Verlag, 2001.
[52] J. Warmer and A. Kleppe, The Object Constraint Language: Precise Modeling with UML. Addison-
[53] J. Ellsberger, D. Hogrefe, and A. Sarma, SDL: Formal Object-Oriented Language for Communication
Systems. Prentice Hall, New York, 1997.
[54] V. Levin and H. Yenigun, SDLcheck: A Model Checking Tool. In Proceedings of the Con-
ference on Computer Aided Verication, Vol. 2102 of Lecture Notes on Computer Science.
Springer-Verlag, 2001.
[55] J.M. Spivey, The Z Notation: A Reference Manual. Prentice-Hall, New York, 1992.
[56] D. Jackson, Alloy: A Lightweight Object Modelling Notation. ACM Transactions on Software
Engineering and Methodology (TOSEM), 11: 256290, 2002.
[57] N. Ip and S. Swan, Using Transaction-BasedVerication in SystemC, 2002. http://www.systemc.org/
[58] D.D. Gajski, J. Zhu, J. Doemer, A. Gerstlauer, and S. Zhao, SpecC: Specication Language and
[59] Accellera, SystemVerilog 3.1: Accelleras Extensions to Verilog, 2003. http://www.eda.org/
sv-ec/SystemVerilog_3.1_nal.pdf
[60] G. Berry and G. Gonthier, The Esterel Synchronous Programming Language: Design, Semantics,
Implementation. Science of Computer Programming, 19: 87152, 1992.
[61] E. Closse, M. Poize, J. Pulou, J. Sifakis, P. Venier, D. Weill, and S. Yovine, TAXYS: A Tool for the
Development and Verication of Real-Time Embedded Systems. In Proceedings of the Conference
on Computer Aided Verication (CAV), 2001.
[62] C. Daws, A. Olivero, S. Tripakis, and S. Yovine, The Tool Kronos. In Proceedings of the
Hybrid Systems III, Verication and Control, Vol. 1066 of Lecture Notes on Computer Science.
Springer-Verlag, NY, 1996.
[63] MATLAB/Simulink, http://www.mathworks.com
[64] R. Alur, T.A. Henzinger, G. Lafferriere, and G. Pappas, Discrete Abstractions of Hybrid Systems.
In Proceedings of the IEEE, 2000.
[65] T.A. Henzinger, P.-H. Ho, and H. Wong-Toi, HyTech: A Model Checker for Hybrid Systems.
International Journal of Software Tools for Technology Transfer, 1: 110122, 1997.
[66] K. Larsen, P. Pettersson, and W. Yi, UPPAAL in a Nutshell. International Journal of Software Tools
for Technology Transfer, 1: 134152, 1997.
[67] R. Alur, T. Dang, J. Esposito, R. Fierro, Y. Hur, F. Ivancic, V. Kumar, I. Lee, P. Mishra, G. Pappas,
and O. Sokolsky, Hierarchical Hybrid Modeling of Embedded Systems. In Proceedings of the
EMSOFT 01: First Workshop on Embedded Software, Vol. 2211 of Lecture Notes on Computer
Science. Springer-Verlag, 2001.
[68] B.I. Silva, K. Richeson, B.H. Krogh, and A. Chutinam, Modeling and Verication of Hybrid
Dynamical System Using CheckMate. In Proceedings of the Automation of Mixed Processes: Hybrid
Dynamic Systems (ADPM), 2000.
[69] A. Tiwari, N. Shankar, and J. M. Rushby, Invisible Formal Methods for Embedded Control Systems.
Operating Systems and
Quasi-Static
Scheduling
11 Real-Time Embedded Operating Systems: Standards and Perspectives
12 Real-Time Operating Systems: The Scheduling and Resource Management
Aspects
Giorgio C. Buttazzo
13 Quasi-Static Scheduling of Concurrent Specications
Alex Kondratyev, Luciano Lavagno, Claudio Passerone, and Yosinori Watanabe
11
Real-Time Embedded
Operating Systems:
Standards and
Perspectives
IEIIT-CNR Istituto di Elettronica e
di Ingegneria dellInformazione e
delle Telecomunicazioni
11.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-1
11.2 Operating System Architecture and Functions . . . . . . . . 11-3
Overall System Architecture Process and Thread Model
Processor Scheduling Interprocess Synchronization and
Communication Network Support Additional Functions
11.3 The POSIX Standard . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-9
Attribute Objects Multithreading Process and Thread
Scheduling Real-Time Signals and Asynchronous Events
Interprocess Synchronization and Communication
Thread-Specic Data Memory Management
Asynchronous and List Directed Input and Output
Clocks and Timers Cancellation
11.4 Real-Time, Open-Source Operating Systems . . . . . . . . . . 11-23
11.5 Virtual-Machine Operating Systems . . . . . . . . . . . . . . . . . . . 11-25
Related Work Views of Processor State Operating
Principle Virtualization by Instruction Emulation
Processor-Mode Change Privileged Instruction Emulation
Exception Handling Interrupt Handling Trap
Redirection VMM Processor Scheduler
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11-33
11.1 Introduction
Informally speaking, a real-time computer system is a system where a computer senses events from the
outside world and reacts to them; in such an environment, the timely availability of computation results
is as important as their correctness.
Instead, the exact denition of embedded systemis somewhat less clear. In general, an embedded system
is a special-purpose computer system built into a larger device and is usually not programmable by the
end user.
11-1
The major areas of difference between general purpose and embedded computer systems are cost,
performance, and power consumption: often, embedded systems are mass produced, thus reducing their
unit cost is an important design goal; in addition, mobile battery-powered embedded systems, such as
cellular phones, have severe power budget constraints to enhance their battery life.
Both these constraints have a profound impact on system performance, because they entail simplifying
the overall hardware architecture, reducing clock speed, andkeeping memory requirements toa minimum.
Moreover, embedded computer systems often lack traditional peripheral devices, such as a disk drive,
and interface with application-specic hardware instead.
Another common requirement for embedded computer systems is some kind of real-time behavior;
the strictness of this requirement varies with the application, but it is so common that, for example, many
operating system vendors often use the two terms interchangeably, and refer to their products either as
embedded operating systems or real-time operating systems for embedded applications.
In general, the termembeddedis preferred when referring to smaller, uniprocessor computer systems,
and real-time is generally used when referring to larger appliances, but the todays rapid increase
of available computing power and hardware features in embedded systems contributes to shade this
distinction.
Recent examples of real-time systems include many kinds of widespread computer systems, from
large appliances like phone switches to mass-market consumer products such as printers and digital
cameras.
Therefore, a real-time operating system must not only manage system resources and offer a well-
dened set of services to application programs, like any other operating system does, but must also
provide guarantees about the timeliness of such services and honor them, that is, its behavior must be
predictable. Thus, for example, the maximum time the operating system will take to perform any service
it offers must be known in advance.
This proves to be a tight constraint, and implies that real-time does not have the same meaning as real
fast, because it often conicts with other operating systems goals, such as good resource utilization and
coexistence of real-time and nonreal-time jobs, and adds further complexity to the operating system duty.
Also, it is highly desirable that a real-time operating systemoptimize some operating systemparameters,
mainly context switch time and interrupt latency, which have a profound inuence on the overall response
time of the system to external events; moreover, in embedded systems, the operating system footprint,
that is, its memory requirements, must be kept to a minimum to reduce costs.
Last, but not the least, due to the increasing importance of open systemarchitectures in software design,
the operating system services should be made available to the real-time application through a standard
application programming interface. This approach promotes code reuse, interoperability, and portability,
and reduces the software maintenance cost.
The chapter is organized as follows: Section 11.2 gives a brief refresher on the main design and archi-
tectural issues of operating systems and on how these concepts have been put in practice. Section 11.3
discusses the mainset of international standards concerning real-time operating systems and their applica-
tion programming interface; this section also includes some notes on mechanisms seldom mentioned in
operating system theory but of considerable practical relevance, namely real-time signals, asynchronous
I/O operations, and timers.
Next, Section 11.4 gives a short description of some widespread open-source real-time operating
systems, another recent andpromising source of novelty inembeddedsystemsoftware design. Inparticular,
since open-source operating systems have nopurchasing cost andare inherently royalty free, their adoption
can easily cut down the cost of an application.
At the end of the chapter, Section 11.5 presents an overview of the operating principle and goals
of a seldom-mentioned class of operating systems, namely operating systems based on virtual machines.
Although they are perceived to be very difcult to implement, these operating systems look very promising
for embedded applications, in which distinct sets of applications, each with its own requirements in term
of real-time behavior and security, are executed on the same physical processor; hence, they are an active
area of research.
Real-Time Embedded Operating Systems: Standards and Perspectives 11-3
11.2 Operating System Architecture and Functions
The main goal of this section is to give a brief overview on the architecture of operating systems of interest
to real-time application developers, and on the functions they accomplish on behalf of the applications
that run on them. See, for example, References 1 and 2 for more general information, and Reference 3 for
an in-depth discussion about the internal architecture of the inuential Unix operating system.
11.2.1 Overall System Architecture
An operating system is a very complex piece of software; accordingly, its internal architecture can be
built around several different designs. Some designs that have been tried in practice and are in common
use are:
Monolithic systems. This is the oldest design, but it is still popular for very small real-time executives
intended for deeply embedded applications, and for the real-time portion of more complex systems, due
to its simplicity and very low processor and memory overhead.
In monolithic systems, the operating system as a whole runs in privileged mode, and the only internal
structure is usually induced by the way operating system services are invoked: applications, running in
user mode, request operating system services by executing a special trapping instruction, usually known
as the system call instruction. This instruction brings the processor into privileged mode and transfers
control to the system call dispatcher of the operating system. The system call dispatcher then determines
which service must be carried out, and transfers control to the appropriate service procedure. Service
procedures share a set of utility procedures, which implement generally useful functions on their behalf.
Interrupt handling is done directly in the kernel for the most part and interrupt handlers are not full-
edged processes. As a consequence, the interrupt handling overhead is very small because there is no full
task switching at interrupt arrival, but the interrupt handling code cannot invoke most system services,
notably blocking synchronization primitives. Moreover, the operating system scheduler is disabled while
interrupt handling is in progress, and only hardware prioritization of interrupt requests is in effect, hence
the interrupt handling code is implicitly executed at a priority higher than the priority of all other tasks
in the system.
To further reduce processor overhead on small systems, it is also possible to run the application as a
whole in supervisor mode. In this case, the application code can be bound with the operating systemat link
time and systemcalls become regular function calls. The interface between application code and operating
system becomes much faster, because no user-mode state must be saved on system call invocation and no
trap handling is needed. On the other hand, the overall control that the operating system can exercise on
bad application behavior is greatly reduced and debugging may become harder.
In this kind of systems, it is usually impossible to upgrade individual software components, for example,
an application module, without replacing the executable image as a whole and then rebooting the system.
This constraint can be of concern in applications where software complexity demands the frequent
replacement of modules, and no system down time is allowed.
Layered systems. Arenement and generalization of the monolithic systemdesign consists of organizing
the operating system as a hierarchy of layers at system design time. Each layer is built upon the services
offered by the one below it and, in turn, offers a well-dened and usually richer set of services to the layer
above it. Operating system interface and interrupt handling are implemented like in monolithic systems;
hence, the corresponding overheads are very similar.
Better structure and modularity make maintenance easier, both because the operating system code is
easier to read and understand, and because the inner structure of a layer can be changed at will without
interfering with other layers, provided the interlayer interface does not change.
Moreover, the modular structure of the operating system enables the ne-grained conguration of its
capabilities, to tailor the operating system itself to its target platform and avoid wasting valuable memory
space for operating systemfunctions that are never used by the application. As a consequence, it is possible
to enrich the operating system with many capabilities, for example, network support, without sacricing
its ability to run on very small platforms when these features are not needed. A number of operating
systems in use today evolved into this structure, often starting from a monolithic approach, and offer
sophisticated build or link-time conguration tools.
Microkernel systems. This design moves many operating system functions from the kernel up into
operating system server processes running in user mode, leaving a minimal microkernel and reducing to
an absolute minimum the amount of privileged operating system code. Applications request operating
system services by sending a message to the appropriate operating system server and waiting for a reply.
The main purpose of the microkernel is to handle the communication between applications and servers,
to enforce an appropriate security policy on such communication, and to perform some critical operating
system functions, such as accessing I/O device registers, that would be difcult, or inefcient, to do from
user-mode processes.
This kind of design makes the operating system easier to manage and maintain. Also, the message-
passing interface between user processes and operating system components encourages modularity and
enforces a clear and well-understood structure on operating system components.
Moreover, the reliability of the operating system is increased: since the operating system servers run in
user mode, if one of themfails some operating systemfunctions will no longer be available, but the system
will not crash. Moreover, the failed component can be restarted and replaced without shutting down the
whole system.
Last, the design is easily extended to distributed systems, where operating system functions are split
across a set of distinct machines connected by a communication network. This kind of systems are very
promising in terms of performance, scalability, and fault tolerance, especially for large and complex
real-time applications.
By contrast, making the message-passing communication mechanism efcient can be a critical issue,
especially for distributed systems, and the system call invocation mechanism induces more overhead than
in monolithic and layered systems.
Interrupt requests are handledby transforming themintomessages directedtothe appropriate interrupt
handling taskas soonas possible: the interrupt handler proper runs ininterrupt service mode andperforms
the minimum amount of work strictly required by the hardware, then synthesizes a message and sends it
to an interrupt service task. In turn, the interrupt service task concludes interrupt handling running in
user mode. Being an ordinary task, the interrupt service task can, at least in principle, invoke the full range
of operating system services, including blocking synchronization primitives, and must not concern itself
with excessive usage of the interrupt service processor mode. On the other hand, the overhead related to
interrupt handling increases, because the activation of the interrupt service task requires a task switch.
Virtual machines. The internal architecture of operating systems based on virtual machines revolves
around the basic observation that an operating system must perform two essential functions: multi-
programming and system services.
Accordingly, those operating systems fully separate these functions and implement themas two distinct
operating system components: a virtual machine monitor that runs in privileged mode, implements
multiprogramming, and provides many virtual processors identical in all respects to the real processor it
runs on, and one or more guest operating systems that run on the virtual processors, and implement system
services.
Different virtual processors can run different operating systems, and they must not necessarily be
aware of being run in a virtual machine. In the oldest approach to virtual machine implementation,
guest operating systems are given the illusion of running in privileged mode, but are instead constrained
to operate in user mode; in this way, the virtual machine monitor is able to intercept all privileged
instructions issued by the guest operating systems, check them against the security policy of the system,
and then perform them on behalf of the guest operating system itself.
Interrupt handling is implemented in a similar way: the virtual machine monitor catches all interrupt
requests and then redirects them to the appropriate guest operating system handler, reverting to user
mode in the process; thus, the virtual machine monitor can intercept all privileged instructions issued by
the guest interrupt handler, and again check and perform them as appropriate.
The full separation of roles and the presence of a relatively small, centralized arbiter of all interactions
between virtual machines has the advantage of making the enforcement of security policies easier. The
isolation of virtual machines fromeach other also enhances reliability because, even if one virtual machine
fails, it does not bring down the system as a whole. In addition, it is possible to run a distinct operating
system in each virtual machine thus supporting, for example, the orderly coexistence between a real-time
and a general-purpose operating system.
By contrast, the perfect implementation of virtual machines requires hardware assistance, both to make
it feasible and to be able to emulate privileged instructions with a reasonable degree of efciency.
A variant of this design adopts an interpretive approach, and allows the virtual machines to be different
from the physical machine. For example, Java programs are compiled into byte-code instructions suitable
for execution by an abstract Java virtual machine. On the target platform an interpreter executes the
byte code on the physical processor, thus implementing the virtual machine.
More sophisticated approaches to virtual machine implementation are also possible, the most common
one being on-the-y code generation, also known as just-in-time compilation.
11.2.2 Process and Thread Model
Aconvenient and easy to understand way to design real-time software applications is to organize themas a
set of cooperating sequential processes. A process is an activity, namely the activity of executing a program,
and encompasses the program being executed, the data it operates on, and the current processor state,
including the program counter and registers. In particular, each process has its own address space.
In addition, it is often convenient to support multiple threads of control within the same process,
sharing the same address space. Threads can be implemented for the most part in user mode, without
the operating systems kernel intervention; moreover, when the processor is switched between threads,
the address space remains the same and must not be switched. Both these facts make processor switching
between threads very fast with respect to switching between processes. On the other hand, since all threads
within a process share the same address space, there can be only a very limited amount of protection
among them with respect to memory access; hence, for example, a thread is allowed to pollute by mistake
another threads data and the operating system has no way to detect errors of this kind.
As a consequence, many small operating systems for embedded applications only support threads
to keep overheads and hardware requirements to a minimum, while larger operating systems for more
complex real-time applications offer the user a choice between a single or multiple process model to
enhance the reliability of complex systems.
Another important operating system design issue is the choice between static and dynamic creation
of processes and threads: some operating systems, usually oriented toward relatively simple embedded
applications, only support static tasks, that is, all tasks in the system are known in advance and it is not
possible to create and destroy tasks while the system is running; thus, the total number of tasks in the
systemstays constant for all its life. Other operating systems allowus to create and destroy tasks at runtime,
by means of a system call.
Dynamic task creation has the obvious advantage of making the application more exible, but it
increases the complexity of the operating system, because many operating system data structures, rst
of all the process table, must be allocated dynamically and their exact size cannot be known in advance.
In addition, the application code requires a more sophisticated error-handling strategy, with its associated
overheads, because it must be prepared to cope with the inability of the operating system to create a new
task, due to lack of resources.
11.2.3 Processor Scheduling
The scheduler is one of the most important components of a real-time operating system, as it is responsible
for deciding towhichrunnable threads the available processors must be assigned, andfor howlong. Among
dynamic schedulers that is, schedulers that perform scheduling computations at runtime while the
application is running, several algorithms are in use and offer different tradeoffs between real-time
predictability, implementation complexity, and overhead.
Since the optimum compromise often depends on the applications characteristics, most real-time
operating systems support multiple scheduling policies simultaneously and the responsibility of a correct
choice falls on the application programmer. The most common scheduling algorithms supported by
real-time operating systems and specied by international standards are:
First in, rst out with priority classes. Under this algorithm, also known as xed priority scheduling,
there is a list of runnable threads for each priority level. When a processor is idle, the scheduler takes the
runnable thread at the head of the highest-priority, nonempty thread list and runs it.
When the scheduler preempts a running thread, because a higher-priority task has become runnable,
the preempted thread becomes the head of the thread list for its priority; when a blocked thread becomes
runnable again, it becomes the tail of the thread list for its priority.
The rst in, rst out scheduler never changes threadpriorities at runtime; hence, the priority assignment
is fully static; a well-known approach to static priority assignment for periodic tasks is the rate monotonic
policy, in which task priorities are inversely proportional to their periods.
In order to ensure that none of the threads can monopolize the processor, when multiple, runnable
threads share the same priority level, the basic algorithm is often enhanced with the additional con-
straint that, when a running thread has been executing for more than a maximum amount of time, the
quantum, that thread is forcibly returned to the tail of its thread list and a new thread is selected for
execution; this approach is known as round-robin scheduling.
Earliest deadline rst. The earliest deadline rst scheduler assigns thread priorities dynamically.
In particular, this scheduler always executes the thread with the nearest deadline. It can be shown that this
algorithm is optimal for uniprocessor systems, and supports full processor utilization in all situations.
However, its performance under overload can be poor and dynamically updating thread priorities on
the base of their deadlines may be computationally expensive, especially when this scheduler is layered on
a xed-priority lower-level scheduler.
Sporadic server. This scheduling algorithm was rst introduced in Reference 4, where a thorough
description of the algorithm can be found.
The sporadic server algorithmis suitable for aperiodic event handling where, for timeliness, events must
be handled at a certain, usually high, priority level, but lower-priority threads with real-time requirements
could suffer from excessive preemption if that priority level were maintained indenitely. It acts on the
base of two main scheduling parameters associated with each thread: the execution capacity and the
replenishment period.
Informally, the execution capacity of a thread represents the maximum amount of processor time that
the thread is allowed to consume at high priority in a replenishment period.
The execution capacity of a thread is preserved until an aperiodic request for that task occurs, thus
making it runnable; then, thread execution depletes its execution capacity.
The sporadic server algorithmreplenishes the threads execution capacity after some or all of its capacity
is consumed by thread execution; the schedule for replenishing the execution capacity is based on the
threads replenishment period.
Should the thread reach its processor usage upper limit, its execution capacity becomes zero and it
is demoted to a lower-priority level, thus avoiding excessive preemption against other threads. When
replenishments have restored the execution capacity of the thread above a certain threshold level, the
scheduler promotes the thread to its original priority again.
11.2.4 Interprocess Synchronization and Communication
An essential function of a multiprogrammed operating system is to allow processes to synchronize and
exchange information; as a whole, these functions are knownas InterProcess Communication(IPC). Many
interprocess synchronization and communication mechanisms have been proposed and were objects of
extensive theoretical study in the scientic literature. Among them, we recall:
Semaphores. A semaphore, rst introduced by Dijkstra in 1965, is a synchronization device with an
integer value, and on which the following two primitive, atomic operations are dened:
The P operation, often called DOWN or WAIT, checks if the current value of the semaphore is
greater than zero. If so, it decrements the value and returns to the caller; otherwise, the invoking
process goes into the blocked state until another process performs a V on the same semaphore.
The V operation, also called UP, POST, or SIGNAL, checks whether there is any process currently
blocked on the semaphore. In this case, it wakes exactly one of them up, allowing it to complete
its P; otherwise, it increments the value of the semaphore. The V operation never blocks.
Semaphores are a very low-level IPC mechanism; therefore, they have the obvious advantage of being
simple to implement, at least on uniprocessor systems, and of having a very low overhead. By contrast,
they are difcult to use, especially in complex applications.
Also related to mutual exclusion with semaphores is the problemof priority inversion. Priority inversion
occurs when a high-priority process is forced to wait for a lower-priority process to exit a critical region,
a situation in contrast with the concept of relative task priorities. Most real-time operating systems take
this kind of blocking into account and implement several protocols to bound it; among these we recall
the priority inheritance and the priority ceiling protocols.
Monitors. To overcome the difculties of semaphores, in 1974/1975, Hoare and Brinch Hansen
introduced a higher-level synchronization mechanism, the monitor.
A monitor is a set of data structures and procedures that operate on them; data structures are shared
among all processes that can use the monitor, and it effectively hides the data structures it contains; hence,
the only way to access the data associated with the monitor is through the procedures in the monitor itself.
In addition, all procedures in the same monitor are implicitly executed in mutual exclusion. Unlike
semaphores, the responsibility of ensuring mutual exclusion falls on the compiler, not on the programmer,
because monitors are a programming language construct, and the language compiler knows about them.
Inside a monitor, condition variables can be used to wait for events. Two atomic operations are dened
on a condition variable:
The WAIT operation releases the monitor and blocks the invoking process until another process
performs a SIGNAL on the same condition variable.
The SIGNAL operation unblocks exactly one process waiting on the condition variable. Then,
to ensure that mutual exclusion is preserved, the invoking process is either forced to leave the
monitor immediately (Brinch Hansens approach), or is blocked until the monitor becomes free
again (Hoares approach).
Message passing. Unlike all other IPC mechanisms described so far, message passing supports explicit
data transfer between processes; hence, it does not require a shared memory and lends itself well to be
extended to distributed systems. This IPC method provides for two primitives:
The SEND primitive sends a message to a given destination.
Symmetrically, RECEIVE receives a message from a given source.
Many variants are possible on the exact semantics of these primitives; they mainly differ in the way
messages are addressed and buffered.
A commonplace addressing scheme is to give to each process/thread in the system a unique address,
and to send messages directly to processes. Otherwise, a message can be addressed to a mailbox, a message
container whose maximum capacity is usually specied upon creation; in this case, message source and
destination addresses are mailbox addresses.
When mailboxes are used, they also provide some amount of message buffering, that is, they hold
messages that have been sent but have not been received yet. Moreover, a single task can own multiple
mailboxes and use them to classify messages depending on their source or priority. A somewhat contrary
approach to message buffering, simpler to implement but less exible, is the rendezvous strategy: the
system performs no buffering. Hence, when using this scheme the sender and the receiver are forced to
run in lockstep because the SEND does not complete until another process executes a matching RECEIVE
and, conversely, the RECEIVE waits until a matching SEND is executed.
11.2.5 Network Support
There are two basic approaches to implement network support in a real-time operating system and to
offer it to applications:
The POSIX standard (Portable Operating System Interface for Computing Environments) [5]
species the socket paradigm for uniform access to any kind of network support that many real-
time operating systems provide them. Sockets, fully described in Reference 3, were rst introduced
in the Berkeley Unix operating system and are now available on virtually all general-purpose
operating systems; as a consequence, most programmers are likely to be procient with them.
The main advantage of sockets is that they support in a uniformway any kind of communication
network, protocol, naming conventions, hardware, and so on. Semantics of communication and
naming are captured by communication domains and socket types, both specied upon socket
creation. For example, communication domains are used to distinguish between IPv4 and X25
networkenvironments, whereas the socket type determines whether communicationwill be stream-
based or datagram-based and also implicitly selects which network protocol a socket will use.
Additional socket characteristics can be setup after creation through abstract socket options;
for example, socket options provide a uniform, implementation-independent way toset the amount
of receive buffer space associated with a socket.
Some operating systems, mostly focused on a specic class of embedded applications, offer network
support through a less general, but more rich and efcient, application programming interface.
For example, Reference 6 is an operating system specication oriented to automotive applica-
tions; it species a communication environment (OSEK/VDX COM) less general than sockets and
oriented to real-time message-passing networks, such as the Controller Area Network (CAN).
In this case, for example, the application programming interface allows applications to easily set
message lters and performout-of-order receives, thus enhancing their timing behavior; both these
functions are supported with difculty by sockets, because they do not t well with the general
socket paradigm.
In both cases, network device drivers are usually supplied by third-party hardware vendors and conform
to a well-dened interface dened by the operating system vendor.
The network software itself, although it is often bundled with the operating system and provided by the
same vendor, can be obtained from third-party software houses, too. Often, these products are designed
to run on a wide variety of hardware platforms and operating systems, and come with source code; hence,
it is also possible to port them to custom operating systems developed in-house, and they can be extended
and enhanced by the end user.
11.2.6 Additional Functions
Even if real-time operating systems sometimes do not implement several major functions that are now
commonplace in general-purpose operating systems, such as demand paging, swapping, and lesystem
access, they must be concerned with other, less well-known functions that ensure or enhance system
predictability, for example:
Asynchronous, real-time signals and cancellation requests, to deal with unexpected events, such
as software and hardware failures, and to gracefully degrade the systems performance should a
processor overload occur.
High-resolution clocks and timers, to give real-time processes an accurate notion of elapsed time.
Asynchronous I/O operations, to decouple real-time processes from the inherent unpredictability
of many I/O devices.
11.3 The POSIX Standard
The original version of the Portable Operating System Interface for Computing Environments, better
known as the POSIX standard, was rst published between 1988 and 1990, and denes a standard way
for applications to interface with the operating system. The set now includes over 30 individual standards,
and covers a wide range of topics, from the denition of basic operating system services, such as process
management, to specications for testing the conformance of an operating system to the standard itself.
Among these, of particular interest is the System Interfaces (XSH) Volume of IEEE Std 1003.1-2001
[5], which denes a standard operating systeminterface and environment, including real-time extensions.
The standard contains denitions for system service functions and subroutines, language-specic system
services for the C programming language, and notes on portability, error handling, and error recovery.
This standard has been constantly evolving since it was rst published in 1988; the latest developments
have been crafted by a joint working group of members of the IEEE Portable Applications Standards
Committee, members of The Open Group, and members of ISO/IEC Joint Technical Committee 1. The
joint working group is known as the Austin Group, after the location of the inaugural meeting held at the
IBM facility in Austin, Texas in September 1998.
The Austin Group formally began its work in July 1999, after the subscription of a formal agreement
between The Open Group and the IEEE, with the main goal of revising, updating, and combining
into a single document the following standards: ISO/IEC 9945-1, ISO/IEC 9945-2, IEEE Std 1003.1,
IEEE Std 1003.2, and the Base Specications of The Open Group Single UNIX Specication Version 2.
For real-time systems, the latest version of IEEE Std 1003.1 [5] incorporates the real-time exten-
sion standards listed in Table 11.1. Since embedded systems can have strong resource limitations, the
IEEE Std 1003.13-1998 [7] prole standard groups functions from the standards mentioned above into
units of functionality. Implementations can then choose the prole most suited to their needs and to the
computing resources of their target platforms.
Operating systems invariably experience a delay betweenthe adoptionof a standardandits implementa-
tion, hence functions dened earlier in time are usually supported across a wider number of operating
systems. For this reason, in this section we concentrate only on the functions that are both related to
real-time software development and are actually available on most real-time operating systems at the date
of writing, including multithreading support. In addition, we assume that the set of functions common
to both the POSIX and the ISO C [8] standards is well known to readers, hence we will not describe it.
Table 11.2 summarizes the functional groups of IEEE Std 1003.1-2001 that will be discussed next.
11.3.1 Attribute Objects
Attribute objects are a mechanism devised to support future standardization and portable extension of
some entities specied by the POSIX standard, such as threads, mutual exclusion devices, and condition
variables, without requiring that the functions operating on them be changed.
TABLE 11.1 Real-Time Extensions Incorporated into IEEE
Std 1003.1-2001
Standard Description
1003.1b Basic real-time extensions; rst published in 1993
1003.1c Threads extensions; published in 1995
1003.1d Additional real-time extensions; published in 1999
1003.1j Advanced real-time extensions; published in 2000
TABLE 11.2 Basic Functional Groups of IEEE Std 1003.1-2001
Functional group Main functions
Multiple threads pthread_create pthread_exit pthread_join pthread_detach
pthread_equal pthread_self
Process and sched_setscheduler sched_getscheduler sched_setparam
thread scheduling sched_getparam pthread_setschedparam
pthread_getschedparam pthread_setschedprio
pthread_attr_setschedpolicy pthread_attr_getschedpolicy
pthread_attr_setschedparam pthread_attr_getschedparam
sched_yield sched_get_priority_max
sched_get_priority_min sched_rr_get_interval
Real-time signals sigqueue pthread_kill sigaction sigaltstack sigemptyset
sigfillset sigaddset sigdelset sigismember sigwait
sigwaitinfo sigtimedwait
Interprocess mq_open mq_close mq_unlink mq_send mq_receive
synchronization mq_timedsend mq_timedreceive mq_notify mq_getattr
and communication mq_setattr
sem_init sem_destroy sem_open sem_close sem_unlink
sem_wait sem_trywait sem_timedwait sem_post sem_getvalue
pthread_mutex_destroy pthread_mutex_init
pthread_mutex_lock pthread_mutex_trylock
pthread_mutex_timedlock pthread_mutex_unlock
pthread_cond_init pthread_cond_destroy pthread_cond_wait
pthread_cond_timedwait pthread_cond_signal
pthread_cond_broadcast
shm_open close shm_unlink mmap munmap
Thread-specic data pthread_key_create pthread_getspecific
pthread_setspecific pthread_key_delete
Mem. management mlock mlockall
munlock munlockall mprotect
Asynchronous and aio_read aio_write lio_listio
list directed I/O aio_error aio_return aio_fsync aio_suspend aio_cancel
Clocks and timers clock_gettime clock_settime clock_getres
timer_create timer_delete timer_getoverrun
timer_gettime timer_settime
Cancellation pthread_cancel pthread_setcancelstate
pthread_setcanceltype pthread_testcancel
pthread_cleanup_push pthread_cleanup_pop
In addition, they provide for a clean isolation of the congurable aspects of said entities. For example,
the stack address (i.e., the location in memory of the storage to be used for the threads stack) is an
important attribute of a thread, but it cannot be expressed portably and must be adjusted when the
program is ported to a different architecture.
The use of attribute objects allows the programmer to specify threads attributes in a single place, rather
than spread them across every instance of thread creation; moreover, the same set of attributes can be
shared for multiple objects of the same kind so as, for example, to set up classes of threads with similar
attributes.
Figure 11.1 shows how attribute objects are created, manipulated, used to congure objects when
creating them, and nally destroyed. As an example, the function and attribute names given in the gure
are those used for threads, but the same general architecture is also used for the attributes of mutual
exclusion devices (Section 11.3.5.3) and condition variables (Section 11.3.5.4).
Inorder to be used, anattribute object must be rst initializedby means of the pthread_attr_init
function; this function also lls the attribute object with a default value for all attributes dened by the
Get/Set individual attributes, e.g.,
pthread_attr_getstackaddr()
pthread_attr_setstackaddr()
Initialize attribute object
pthread_attr_init()
Destroy attribute object
pthread_attr_destroy()
Use attribute object to create
one or more threads
pthread_create()
.
.
.
.
.
.
stackaddr
Attribute object with named attributes
Name Value
0x24000
FIGURE 11.1 Attribute objects in POSIX.
implementation. When it is no longer needed, an attribute object should be destroyed by invoking the
pthread_attr_destroy function on it.
An attribute object holds one or more named attributes; for each attribute, the standard species
a pair of functions that the user can call to get and set the value of that attribute. For example, the
pthread_attr_getstackaddr and pthread_attr_setstackaddr functions get and set the
stack address attribute, respectively.
After setting the individual attributes of an attribute object to the desired value, the attribute object
can be used to congure one or more entities specied by the standard. Hence, for example, the thread
attribute object being described can be used as an argument to the pthread_create function which,
in turn, creates a thread with the given set of attributes.
Last, it should be noted that the attribute objects are dened as opaque types, so they shall be accessed
only by the functions just presented, and not by manipulating their representation directly, even if it
is known, because doing this is not guaranteed to be portable across different implementations of the
standard.
11.3.2 Multithreading
The multithreading capability specied by the POSIX standard includes functions to populate a process
with new threads. In particular, the pthread_create function creates a new thread within a process
and sets up a thread identier for it, to be used to operate on the thread in the future.
After creation, the thread immediately starts executing a function passed to pthread_create as an
argument; moreover, it is also possible to pass an argument to the function in order, for example, to share
the same function among multiple threads and nevertheless be able to distinguish them.
The pthread_create function also takes an optional reference to an attribute object as argument.
The attributes of a thread determine, for example, the size and location of the threads stack and its
scheduling parameters; the latter will be described further in Section 11.3.3.
A thread may terminate in three different ways:
By returning from its main function
By explicitly calling the pthread_exit function
By accepting a cancellation request (Section 11.3.10)
In any case, the pthread_join function allows the calling thread to synchronize with, that is wait
for, the termination of another thread. When the thread nally terminates, this function also returns to
the caller a summary information about the reason of the termination. For example, if the target thread
terminated itself by means of the pthread_exit function, pthread_join returns the status code
passed to pthread_exit in the rst place.
If this information is not desired, it is possible to save system resources by detaching a thread,
either dynamically by means of the pthread_detach function, or statically by means of a threads
attribute. In this way, the storage associated to that thread can be immediately reclaimed when the thread
terminates.
Additional utility functions are provided to operate on thread identiers; hence, for example,
the pthread_equal function checks whether two thread identiers are equal (i.e., whether they
refer to the same thread), and the pthread_self function returns the identier of the calling
thread.
11.3.3 Process and Thread Scheduling
Functions in this group allow the application to select a specic policy that the operating system must
follow to schedule a particular process or thread, and to get and set the scheduling parameters associated
with that process or thread.
In particular, the sched_setscheduler function sets both the scheduling policy and param-
eters associated to a process, and sched_getscheduler reads them back for examination. The
sched_setparam and sched_getparam functions are somewhat more limited, because they set
and get the scheduling parameters but not the policy. All functions take a process identier as argument,
to uniquely identify a process.
For threads, the pthread_setschedparam and pthread_getschedparam functions set and
get the scheduling policy and parameters associated with a thread; pthread_setschedprio directly
sets the scheduling priority of the given thread. All these functions take a thread identier as argument
and perform a dynamic access to the thread scheduling parameters; in other words, they can be used when
the thread already exists in the system.
On the other hand, the functions pthread_attr_setschedpolicy, pthread_attr_get-
schedpolicy, pthread_attr_setschedparam, and pthread_attr_getschedparam
store and retrieve the scheduling policy and parameters of a thread into and from an attribute object,
respectively; in turn, the attribute object can subsequently be used to create one or more threads. The
general mechanism of attribute objects in the POSIX standard has been discussed in Section 11.3.1.
An interesting and useful side effect of pthread_setschedprio is that, when the effect of the
function is to lower the priority of the target thread, the thread is inserted at the head of the thread list of
the new priority, instead of the tail. Hence, this function provides a way for an application to temporarily
raise its priority and then lower it again to its original value, without having the undesired side effect
of yielding to other threads of the same priority. This is necessary, for example, if the application is to
implement its own strategies for bounding priority inversion.
Last, the sched_yield function allows the invoking thread to voluntarily relinquish the CPU in
favor of other threads of the same priority; the invoking thread is linked to the end of the list of ready
processes for its priority.
In order to support the orderly coexistence of multiple scheduling policies, the conceptual scheduling
model dened by the standard and depicted in Figure 11.2 assigns a global priority to all threads in the
system and contains one ordered thread list for each priority; any runnable thread will be on the thread
list for that threads priority.
When appropriate, the scheduler shall select the thread at the head of the highest-priority, nonempty
thread list to become a running thread, regardless of its associated policy; this thread is then removed from
its thread list. When a running thread yields the CPU, either voluntarily or by preemption, it is returned
to the thread list it belongs to.
The purpose of a scheduling policy is then to determine how the operating system scheduler manages
the thread lists, that is, howthreads are moved between and within lists when they gain or lose access to the
CPU. Associatedwitheachscheduling policy is a priority range, whichmust spanat least 32 distinct priority
levels; all threads scheduled according to that policy must lie within that priority range, and priority ranges
belonging to different policies can overlap in whole or in part. The sched_get_priority_min and
Selection of the highest-priority thread
by the scheduler.
Threads are returned to their thread list
on yield or preemption
Thread lists, one for each global priority
in the system
L
o
c
a
l

p
r
i
o
r
i
t
y

r
a
n
g
e
L
o
c
a
l

p
r
i
o
r
i
t
y

r
a
n
g
e
The mapping between local and global
priority ranges is chosen at system
configuration time
G
l
o
b
a
l

p
r
i
o
r
i
t
y

r
a
n
g
e
Multiple scheduling policies;
each has its own local priority range
Each scheduling policy controls
the placement of threads in its priority range
.

.

.

.

.

FIGURE 11.2 Processor scheduling in POSIX.
sched_get_priority_max functions return the allowable range of priority for a given scheduling
policy.
The mapping between the multiple local priority ranges, one for each scheduling policy active in the
system, and the single global priority range is usually performed by a simple relocation and is either
xed or programmable at system conguration time, depending on the operating system. In addi-
tion, operating systems may reserve some global priority levels, usually the higher ones, for interrupt
handling.
The standard denes the following three scheduling policies, whose algorithms have been briey
presented in Section 11.2.3:
First in, rst out (SCHED_FIFO)
Round robin (SCHED_RR)
Optionally, a variant of the sporadic server scheduler (SCHED_SPORADIC)
Most operating systems set the execution time limit of the round-robin scheduling policy statically,
at system conguration time; the sched_rr_get_interval function returns the execution time
limit set for a given process; the standard provides no portable way to set the execution time limit
dynamically.
A fourth scheduling policy, SCHED_OTHER, can be selected to denote that a thread no longer needs a
specic real-time scheduling policy: general-purpose operating systems with real-time extensions usually
revert to the default, nonreal-time scheduler when this scheduling policy is selected.
Moreover, each implementation is free to redene the exact meaning of the SCHED_OTHER policy
and can provide additional scheduling policies besides those required by the standard, but any application
using them will no longer be fully portable.
11.3.4 Real-Time Signals and Asynchronous Events
Signals are a facility specied by the ISO C standard and widely available on most operating systems; they
provide a mechanism to convey information to a process or thread when it is not necessarily waiting for
input. The IEEE Std 1003.1-2001 further extends the signal mechanism to make it suitable for real-time
handling of exceptional conditions and events that may occur asynchronously with respect to the notied
process like, for example:
An error occurring during the execution of the process, for example, a memory reference through
an invalid pointer.
Various system and hardware failures, such as a power failure.
Explicit generation of an event by another process; as a special case, a process can also trigger a
signal directed to itself.
Completion of an I/O operation started by the process in the past, and for which the process did
not perform an explicit wait.
Availability of data to be read from a message queue.
The signal mechanism has a signicant historical heritage; in particular, it was rst designed when
multithreading was not yet in widespread use and its interface and semantics underwent many changes
since their inception.
Therefore, it owes most of its complexity to the need of maintaining compatibility with the historical
implementations of the mechanism made, for example, by the various avors of the inuential Unix
operating systems; however, in this section the compatibility interfaces will not be discussed for the sake
of clarity and conciseness.
With respect to the ISO C signal behavior, the IEEE Std 1003.1-2001 species two main enhancements
of interest to real-time programmers:
1. In the ISO C standard, the various kinds of signals are identied by an integer number (often
denoted by a symbolic constant in application code) and, when multiple signals of different kind
are pending, they are serviced in an unspecied order; the IEEE Std 1003.1-2001 continues to use
signal numbers but species that for a subset of their allowable range, between SIGRTMIN and
SIGRTMAX, a priority hierarchy among signals is in effect, so that the lowest-numbered signal has
the highest priority of service.
2. In the ISO C standard, there is no provision for signal queues, hence when multiple signals of the
same kind are raised before the target process had a chance of handling them, all signals but the rst
are lost; the IEEE Std 1003.1-2001 species that the system must be able to keep track of multiple
signals with the same number by enqueuing and servicing them in order. Moreover, it also adds the
capability of conveying a limited amount of information (a union sigval, capable of holding
either an integer or a pointer) with each signal request, so that multiple signals with the same signal
number can be distinguished from each other. The queueing policy is always FIFO, and cannot be
changed by the user.
As outlined above, each signal has a signal number associated to it, to identify its kind; for example, the
signal associated to memory access violations has the number SIGSEGV associated to it.
Figure 11.3 depicts the life of a signal from its generation up to its delivery. Depending on their kind
and source, signals may be directed to either a specic thread in the process, or to the process as a whole;
in the latter case, every thread belonging to the process is a candidate for the delivery of the signal, by the
rules described later.
It should also be noted that for some kinds of events, the POSIX standard species that the notication
can also be carried out by the execution of a handling function in a separate thread, if the applica-
tion so chooses; this mechanism is simpler and clearer than the signal-based notication, but requires
multithreading support on the system side.
1
2
Thread 1
Thread 2
Thread n
3
Process-level action
(may ignore the signal completely)
sigaction()
Per-thread signal masks
and explicit wait
pthread_sigmask()
Signal generation, directed to a specific
thread or to the process as a whole
Choice of a single target thread
(only for signals directed to the process)
The signal stays pending if there is no thread
suitable for immediate delivery
Execution of the action associated with the signal:
* Return of sigwait()
* Default action (with process-level side effects)
* Signal handler
sigqueue(), pthread_kill(),
event notification
FIGURE 11.3 Signal generation and delivery in POSIX.
11.3.4.1 Generation of a Signal
As outlined above, most signals are generated by the system rather than by an explicit action performed
by a process. For these, the POSIX standard species that the decision of whether the signal must be
directed to the process as a whole or to a specic thread within a process must be carried out at the time
of generation and must represent the source of the signal as closely as possible.
In particular, if a signal is attributable to an action carried out by a specic thread, for example,
a memory access violation, the signal shall be directed to that thread and not to the process. If such an
attribution is either not possible or not meaningful as it is the case, for example, of the power failure
signal, the signal shall be directed to the process.
Besides various error conditions, an important source of signals generated by the system relate to
asynchronous event notication and are always directed to the process; as an example, Section 11.3.8 will
describe the mechanism behind the notication of completion for asynchronous I/O operations.
On the other hand, processes have the ability of synthesizing signals by means of two main interfaces,
depending on the target of the signal:
The sigqueuefunction, given a process identier and a signal number, generates a signal directed
to that process; an additional argument, a union sigval, allows the caller to associate a limited
amount of information to the signal, provided that the SA_SIGINFO ag is set for that signal
number. Additional interfaces exist togenerate a signal directedtoa groupof processes, for example,
the killpg function. However, they have not been extended for real time and hence they do not
have the ability of associating any additional information to the signal.
The pthread_kill function generates a signal directed to a specic thread within the calling
process and identied by its thread identier. It is not possible to generate a signal directed to a
specic thread of another process.
11.3.4.2 Process-Level Action
For each kind of signal dened in the system, that is, for each valid signal number, processes may set up
an action by means of the sigaction function; the action may consist of either:
Ignore the signal completely
A default action performed by the operating system on behalf of the process, and possibly with
process-level side effects, such as the termination of the process itself
The execution of a signal handling function specied by the programmer
In addition, the same function allows the caller to set zero or more ags associated to the signal number.
Of the rather large set of ags specied by the POSIXstandard, the following ones are of particular interest
to real-time programmers:
The SA_SIGINFO ag, when set, enables the association of a limited amount of information to
each signal; this information will then be conveyed to the signaled process or thread. In addition, if
the action associated with the signal is the execution of a user-specied signal handler, setting this
ag extends the arguments passed to the signal handler to include additional information about the
reason why the signal was generated and about the receiving threads context that was interrupted
when the signal was delivered.
The SA_RESTART ag, when set, enables the automatic, transparent restart of interruptible
system calls when the system call is interrupted by the signal. If this ag is clear, system calls that
were interrupted by a signal fail with an error indication and must be explicitly restarted by the
application, if appropriate.
The SA_ONSTACK ag, when set, commands the switch of the process or thread to which the
signal is delivered to an alternate stack for the execution of the signal handler; the sigaltstack
function can be used to set the alternate stack up. If this ag is not set, the signal handler executes
on the regular stack of the process or thread.
It should be noted that the setting of the action associated with each kind of signal takes place at the
process level, that is, all threads within a process share the same set of actions; hence, for example, it is
impossible to set two different signal handling functions (for two different threads) to be executed in
response to the same kind of signal.
Immediately after generation, the systemchecks the process-level action associated with the signal in the
target process, and immediately discards the signal if that action is set to ignore it; otherwise, it proceeds
to check whether the signal can be acted on immediately.
11.3.4.3 Signal Delivery and Acceptance
Provided that the action associated to the signal at the process level does not specify to ignore the signal
in the rst place, a signal can be either delivered to or accepted by a thread within the process.
Unlike the action associated to each kind of signal discussed above, each thread has its own signal mask;
by means of the signal mask, each thread can selectively block some kinds of signals from being delivered
to it, depending on their signal number. The pthread_sigmask function allows the calling thread to
examine or change (or both) its signal mask. A signal mask can be set up and manipulated by means of
the functions:
sigemptyset: initializes a signal mask so that all signals are excluded from the mask.
sigfillset: initializes a signal mask so that all signals are included in the mask.
sigaddset: given a signal mask and a signal number, adds the specied signal to the signal mask;
it has no effect if the signal was already in the mask.
sigdelset: given a signal mask and a signal number, removes the specied signal from the signal
mask; it has no effect if the signal was not in the mask.
sigismember: given a signal mask and a signal number, checks whether the signal belongs to the
signal mask or not.
A signal can be delivered to a thread if and only if that thread does not block the signal; when a signal is
successfully delivered to a thread, that thread executes the process-level action associated with the signal.
On the other hand, a thread may perform an explicit wait for one or more kinds of signals, by means
of the sigwait function; that function stops the execution of the calling thread until one of the signals
passed as argument to sigwait is conveyed to the thread. When this occurs, the thread accepts the signal
and continues past the sigwait function. Since the standard species that signals in the range from
SIGRTMIN to SIGRTMAX are subject to a priority hierarchy, when multiple signals in this range are
pending, the sigwait shall consume the lowest-numbered one.
It should also be noted that for this mechanism to work correctly, the thread must block the signals
that it wishes to accept by means of sigwait (through its signal mask), otherwise signal delivery takes
precedence.
Two, more powerful, variants of the sigwait function exist: sigwaitinfo has an additional
argument used to return additional information about the signal just accepted, including the information
associated with the signal when it was rst generated; furthermore, sigtimedwaitalso allows the caller
to specify the maximum amount of time that shall be spent waiting for a signal to arrive.
The way in which the system selects a thread within a process to convey a signal depends on where the
signal is directed:
If the signal is directed toward a specic thread, only that thread is a candidate for delivery or
acceptance.
If the signal is directed to a process as a whole, any thread belonging to that process is a candidate
to receive the signal; in this case, the system selects exactly one thread within the process with the
appropriate signal mask (for delivery), or performing a suitable sigwait (for acceptance).
If there is no suitable thread to convey the signal when it is rst generated, the signal remains pending
until its delivery or acceptance becomes possible, by following the same rules outlined above, or the
process-level action associated with that kind of signal is changed and set to ignore it. In the latter case,
the system forgets everything about the signal, and all other signals of the same kind.
11.3.5 Interprocess Synchronization and Communication
The main interprocess synchronization and communication mechanisms offered by the standard are the
semaphore and the message queue, both described in Section 11.2.4. The blocking synchronization prim-
itives have a nonblocking and a timed counterpart, to enhance their real-time predictability. Moreover,
multithreading support also adds support for mutual exclusion devices, condition variables, and other
synchronization mechanisms. The scope of these mechanisms can be limited to threads belonging to the
same process to enhance their performance.
11.3.5.1 Message Queues
The mq_open function either creates or opens a message queue and connects it with the calling process;
in the system, each message queue is uniquely identied by a name, like a le. This function returns a
message queue descriptor that refers to and uniquely identies the message queue; the descriptor must be
passed to all other functions that operate on the message queue.
Conversely, mq_close removes the association between the message queue descriptor and its message
queue. As a result, the message queue descriptor is nolonger validafter successful returnfromthis function.
Last, the mq_unlink function removes a message queue, provided no other processes reference it; if this
is not the case, the removal is postponed until the reference count drops to zero.
The number of elements that a message queue is able to buffer, and their maximum size, are constant
for the lifetime of the message queue, and are set when the message queue is rst created.
The mq_send and mq_receive functions send and receive a message to and from a message queue,
respectively. If the message cannot be immediately stored or retrieved (e.g., when mq_sendis executed on
a full message queue) these functions block as long as appropriate, unless the message queue was opened
with the nonblocking option set; if this is the case, these functions return immediately if they are unable
to perform their job.
The mq_timedsend and mq_timedreceive functions have the same behavior, but allow the
caller to place an upper bound on the amount of time they may spend waiting.
The standard allows to associate a priority to each message, and species that the queueing policy of
message queues must obey the priority so that, for example, mq_receive retrieves the highest-priority
message that is currently stored in the queue.
The mq_notify function allows the caller to arrange for the asynchronous notication of message
arrival at an empty message queue, when the status of the queue transitions from empty to nonempty,
according to the mechanism described in Section 11.3.4. The same function also allows the caller to
remove a notication request it made previously.
At any time, only a single process may be registered for notication by a message queue. The registration
is removed implicitly when a notication is sent to the registered process, or when the process owning the
registrationexplicitly removes it; inboth cases, the message queue becomes available for a newregistration.
If both a notication request and an mq_receive call are pending on a given message queue, the
latter takes precedence, that is, when a message arrives at the queue, it satises the mq_receive and no
notication is sent.
Last, the mq_getattr and mq_setattr functions allow the caller to get and set, respectively, some
attributes of the message queue dynamically after creation; these attributes include, for example, the
nonblocking ag just described and may also include additional, implementation-specic ags.
11.3.5.2 Semaphores
Semaphores come in two avors: unnamed and named. Unnamed semaphores are created by the
sem_init function and must be shared among processes by means of the usual memory sharing
mechanisms provided by the system. On the other hand, named semaphores created and accessed by the
sem_open function exist as named objects in the system, like the message queues described above, and
can therefore be accessed by name. Both functions, when successful, associate the calling process with the
semaphore and return a descriptor for it.
Depending on the kind of semaphore, either the sem_destroy (for unnamed semaphores) of the
sem_close function (for named semaphores) must be used to remove the association between the
calling process and a semaphore.
For unnamed semaphores, the sem_destroy function also destroys the semaphore; instead, named
semaphores must be removed from the system with a separate function, sem_unlink.
For both kinds of semaphore, a set of functions implements the classic P and V primitives, namely:
The sem_wait function performs a P operation on the semaphore; the sem_trywait and
sem_timedwaitfunctions performthe same function in polling mode and with a user-specied
timeout, respectively.
The sem_post function performs a V operation on the semaphore.
The sem_getvalue function has no counterpart in the denition of semaphore found in
literature and returns the current value of a semaphore.
11.3.5.3 Mutexes
Amutex is a very specialized binary semaphore that canonly be used to ensure the mutual exclusionamong
multiple threads; it is therefore simpler and more efcient than a full-edged semaphore. Optionally, it is
possible to associate to each mutex a protocol to deal with priority inversion.
The pthread_mutex_init function initializes a mutex and prepares it for use; it takes an attribute
object as argument, working according to the general mechanism described in Section 11.3.1 and useful
to specify the attributes of the mutex like, for example, the priority inversion protocol to be used for it.
When default mutex attributes are appropriate, a static initialization technique is also available; in
particular, the macro PTHREAD_MUTEX_INITIALIZER can be used to initialize a mutex that the
application has statically allocated.
In any case, the pthread_mutex_destroy function destroys a mutex.
The following main functions operate on the mutex after creation:
The pthread_mutex_lock function locks the mutex if it is free; otherwise, it blocks until the
mutex becomes available and then locks it; the pthread_mutex_trylock function does the
same, but returns to the caller without blocking if the lock cannot be acquired immediately;
the pthread_mutex_timedlock function allows the caller to specify a maximum amount of
time to be spent waiting for the lock to become available.
The pthread_mutex_unlock function unlocks a mutex.
Additional functions are dened for particular avors of mutexes; for example, the
pthread_mutex_getprioceilingand pthread_mutex_setprioceilingfunctions allow
the caller to get and set, respectively, the priority ceiling of a mutex, and make sense only if the priority
ceiling protocol has been selected for the mutex, by means of a suitable setting of its attributes.
11.3.5.4 Condition Variables
A set of condition variables, in concert with a mutex, can be used to implement a synchronization mech-
anism similar to the monitor without requiring the notion of monitor to be known at the programming
language level.
A condition variable must be initialized before use by means of the pthread_cond_init function;
this functiontakes anattribute object as argument, whichcanbe usedto congure the conditionvariable to
be created, according to the general mechanism described in Section 11.3.1. When default attributes are
appropriate, the macro PTHREAD_COND_INITIALIZER is available to initialize a condition variable
that the application has statically allocated.
Then, the mutex and the condition variables can be used as follows:
Each procedure belonging to the monitor must be explicitly bracketed with a mutex lock at the
beginning, and a mutex unlock at the end.
To block on a condition variable, a thread must call the pthread_cond_wait function
giving both the condition variable and the mutex used to protect the procedures of the mon-
itor as arguments. This function atomically unlocks the mutex and blocks the caller on the
condition variable; the mutex will be reacquired when the thread is unblocked, and before
returning from pthread_cond_wait. To avoid blocking for a (potentially) unbound time,
the pthread_cond_timedwait function allows the caller to specify the maximum amount
of time that may be spent waiting for the condition variable to be signaled.
Inside a procedure belonging to the monitor, the pthread_cond_signal function, taking
a condition variable as argument, can be called to unblock at least one of the threads that are
blocked on the specied condition variable; the call has no effect if no threads are blocked on the
condition variable. The rather relaxed specication of unblocking at least one thread, instead of
exactly one, has been adopted by the standard to simplify the implementation of condition variables
on multiprocessor systems, and to make it more efcient, mainly because condition variables are
often used as the building block of higher-level synchronization primitives.
A variant of pthread_cond_signal, called pthread_cond_broadcast, is available to
unblock all threads that are currently waiting on a condition variable. As before, this function has
no effect if no threads are waiting on the condition variable.
When no longer needed, condition variables shall be destroyed by means of the
pthread_cond_destroy function, to save system resources.
11.3.5.5 Shared Memory
Except message queues, all IPC mechanisms described so far only provide synchronization among threads
and processes, and not data sharing.
Moreover, while all threads belonging to the same process share the same address space, so that they
implicitly and inherently share all their global data, the same is not true for different processes; therefore,
the POSIX standard species an interface to explicitly set up a shared memory object among multiple
processes.
The shm_open function either creates or opens a new shared memory object and associates it with a
le descriptor, which is then returned to the caller. In the system, each shared memory object is uniquely
identied by a name, like a le. After creation, the state of a shared memory object, in particular all data
it contains, persists until the shared memory object is unlinked and all active references to it are removed.
Instead, the standard does not specify whether a shared memory object remains valid after a reboot of the
system or not.
Conversely, close removes the association between a le descriptor and the corresponding shared
memory object. As a result, the le descriptor is no longer valid after successful return from this function.
Last, the shm_unlink function removes a shared memory object, provided no other processes reference
it; if this is not the case, the removal is postponed until the reference count drops to zero.
It should be noted that the association between a shared memory object and a le descriptor belonging
to the calling process, performed by shm_open, does not map the shared memory into the address space
of the process. In other words, merely opening a shared memory object does not make the shared data
accessible to the process.
In order to perform the mapping, the mmap function must be called; since the exact details of the
address space structure may be unknown to, and uninteresting for the programmer, the same function
also provides the capability of choosing a suitable portion of the callers address space to place the mapping
automatically. The function munmap removes a mapping.
11.3.6 Thread-Specic Data
All threads belonging to the same process implicitly share the same address space, so that they have shared
access to all their global data. As a consequence, only the information allocated on the threads stack, such
as function arguments and local variables, is private to each thread.
On the other hand, it is often useful in practice to have data structures that are private to a single thread,
but can be accessed globally by the code of that thread. The POSIX standard responds to this need by
dening the concept of thread-specic data, of which Figure 11.4 depicts the general usage.
The pthread_key_create function creates a thread-specic data key visible to, and shared by,
all threads in the process. The key values provided by this function are opaque objects used to access
thread-specic data.
In particular, the pair of functions pthread_getspecific and pthread_setspecific take
a key as argument and allow the caller to get and set, respectively, a pointer uniquely bound with the given
Thread T1
Thread T2
Thread-specific data key
pthread_key_create()
Thread-specific data key, created by
Thread-specific data of T1
Thread-specific data of T2
pthread_getspecific
pthread_getspecific
FIGURE 11.4 Thread-specic data in POSIX.
key, and private to the calling thread. The pointer bound to the key by pthread_setspecificpersists
for the life of the calling thread, unless it is replaced by a subsequent call to pthread_setspecific.
An optional destructor function may be associated with each key when the key itself is created. When
a thread exits, if a given key has a valid destructor, and the thread has a valid (i.e., not NULL) pointer
associated with that key, the pointer is disassociated and set to NULL, and then the destructor is called
with the previously associated pointer as argument.
When it is no longer needed, a thread-specic data key should be deleted by invoking the
pthread_key_delete function on it. It should be noted that, unlike in the previous case, this
function does not invoke the destructor function associated with the key, so that it is the responsibility of
the application to perform any cleanup actions for data structures related to the key being deleted.
11.3.7 Memory Management
The standard allows processes to lock parts or all of their address space in main memory by means of the
mlock and mlockall functions; in addition, mlockall also allows the caller to demand that all of
the pages that will become mapped into the address space of the process in the future must be implicitly
locked.
The lock operation both forces the memory residence of the virtual memory pages involved, and
prevents them from being paged out in the future. This is vital in operating systems that support demand
paging and must nevertheless support any real-time processing, because the paging activity could intro-
duce undue and highly unpredictable delays when a real-time process attempts to access a page that is
currently not in the main memory and must therefore be retrieved from secondary storage.
When the lock is no longer needed, the process can invoke either the munlock or the munlockall
function to release it and enable demand-paging again.
Other memory management functions, suchas the mmapfunctionalready described inSection11.3.5.5,
establish a mapping between the address space of the calling process and a memory object, possibly shared
betweenmultiple processes. The mapping facility is general enough; hence, it canalso be used to map other
kinds of objects, such as les and devices, into the address space of a process, provided both the hardware
and the operating system have this capability, which is not mandated by the standard. For example, once
a le is mapped, a process can access it simply by reading or writing the data at the address range to which
the le was mapped.
Finally, it is possible for a process to change the access protections of portions of its address space by
means of the mprotect function; in this case, it is assumed that protections will be enforced by the
hardware. For example, to prevent inadvertent data corruption due to a software bug, one could protect
critical data intended for read-only usage against write access.
11.3.8 Asynchronous and List Directed Input and Output
Many operating systems carry out I/O operations synchronously with respect to the process requesting
them. Thus, for example, if a process invokes a le read operation, it stays blocked until the operating
system has nished it, either successfully or unsuccessfully. As a side effect, any process can have at most
one pending I/O operation at any given time.
While this programming model is simple, intuitive, and perfectly adequate for general-purpose systems,
it shows its limits in a real-time environment, namely:
I/O device access timings can vary widely, especially when an error occurs; hence, it is not always
wise to suspend the execution of a process until the operation completes, because this would
introduce a source of unpredictability in the system.
It is often desirable, for example, to enhance system performance by exploiting I/O hardware
parallelism, to start more than one I/O operation simultaneously, under the control of a single
process.
To satisfy these requirements, the standard denes a set of functions to start one or more I/O requests,
to be carried out in parallel with process execution, and whose completion status can be retrieved
asynchronously by the requesting process.
Asynchronous and list-directed I/O functions revolve around the concept of asynchronous I/O control
block, struct aiocb; this structure contains all the information needed to describe an I/O operation,
and contains members to:
Specify the operation to be performed, read or write.
Identify the le on which the operation must be carried out, by means of a le descriptor.
Determine what portion of the le the operation will operate upon, by means of a le offset and a
transfer length.
Locate a data buffer in memory to be used to store or retrieve the data read from, or to be written
to, the le.
Give a priority classication to the operation.
Request the asynchronous notication of the completion of the operation, either by a signal or by
the asynchronous execution of a function, as described in Section 11.3.4.
Then, the following functions are available:
The aio_read and aio_write functions take an I/O control block as argument and schedule
a read or a write operation, respectively; both return to the caller as soon as the request has been
queued for execution.
As an extension, the lio_listio function schedules a list of (possibly asynchronous) I/O
requests, each described by an I/O control block, with a single function call.
The aio_error and aio_return functions allow the caller to retrieve the error and status
information associated with an I/O control block, after the corresponding I/O operation has been
completed.
The aio_fsync function asynchronously forces all I/O operations associated with the le indi-
cated by the I/O control block passed as argument and currently queued to the synchronized I/O
completion state.
The aio_suspend function can be used to block the calling thread until at least one of the I/O
operations associated with a set of I/O control blocks passed as argument completes, or up to a
maximum amount of time.
The aio_cancel function cancels an I/O operation that has not been completed yet.
11.3.9 Clocks and Timers
Real-time applications very often rely on timing information to operate correctly; the POSIX standard
species support for one or more timing bases, called clocks, of known resolution and whose value can
be retrieved at will. In the system, each clock has its own unique identier. The clock_gettime and
clock_settime functions get and set the value of a clock, respectively, while the clock_getres
function returns the resolution of a clock. Clock resolutions are implementation-dened and cannot be
set by a process; some operating systems allow the clock resolution to be set at system generation or
conguration time.
In addition, applications can set one or more per-process timers, using a specied clock as a timing
base, by means of the timer_create function. Each timer has a current value and, optionally, a reload
value associated with it. The operating system decrements the current value of timers according to their
clock and, when a timer expires, it noties the owning process with an asynchronous notication of timer
expiration; as described in Section 11.3.4, the notication can be carried out either by a signal, or by
awakening a thread belonging to the process. On timer expiration, the operating system also reloads the
timer with its reload value, if it has been set, thus possibly realizing a repetitive timer.
When a timer is no longer needed, it shall be removed by means of the timer_delete function, that
both stops the timer and frees all resources allocated to it.
Since, due to scheduling or processor load constraints, a process could lose one or more notica-
tions of expiration, the standard also species a way for applications to retrieve, by means of the
timer_getoverrun function, the number of missednotications, that is, the number of extra timer
expirations that occurred between the time at which a given timer expired and when the notication
associated with the expiration was eventually delivered to, or accepted by, the process.
At any time, it is also possible to store a new value into, or retrieve the current value of a timer by means
of the timer_settime and timer_gettime functions, respectively.
11.3.10 Cancellation
Any thread may request the cancellation of another thread in the same process by means of the
pthread_cancel function. Then, the target threads cancelability state and type determines whether
and when the cancellation takes effect. When the cancellation takes effect, the target thread is terminated.
Each thread can atomically get and set its own way to react to a cancellation request by means of
the pthread_setcancelstate and pthread_setcanceltype functions. In particular, three
different settings are possible:
The thread can ignore cancellation requests completely.
The thread can accept the cancellation request immediately.
The thread can be willing to accept the cancellation requests only when its execution ow crosses
a cancellation point. A cancellation point can be explicitly placed in the code by calling the
pthread_testcancel function. Also, it should be remembered that many functions specied
by the POSIX standard act as implicit cancellation points.
The choice of the most appropriate response to cancellation requests depends on the application and
is a trade-off between the desirable feature of really being able to cancel a thread, and the necessity of
avoiding the cancellation of a thread while it is executing in a critical section of code, both to keep the
guarded data structures consistent and to ensure that any IPC object associated with the critical section,
for example, a mutex, is released appropriately; otherwise, the critical region would stay locked forever,
likely inducing a deadlock in the system.
As an aid to do this, the POSIX standard also species a mechanism that allows any thread to register a
set of cleanup handlers on a stack to be executed, in LIFO order, when the thread either exits voluntarily,
or accepts a cancellation request. The pthread_cleanup_push and pthread_cleanup_pop
functions push and pop a cleanup handler into and from the handler stack; the latter function also has
the ability to execute the handler it is about to remove.
11.4 Real-Time, Open-Source Operating Systems
Althoughinthe general-purpose operating systemcampa handful of products dominates the market, there
are more than 30 real-time operating systems available for use today, both commercial and experimental,
and new ones are still being developed. This is due both to the fact that there is much research in the
latter area, and that real-time embedded applications are inherently less homogeneous than general-
purpose applications, such as those found in ofce automation; hence, ad hoc operating system features
are often needed. In addition, the computing power and the overall hardware architecture of embedded
systems are much more varied than, for example, those of personal computers.
This sectionfocuses onopen-source real-time operating systems, a recent source of novelty inembedded
system software design. Open-source operating systems are especially promising, for two main reasons:
1. The source code of an open-source real-time operating system can be used both to develop
real-world applications, and for study and experimentation. Therefore, open-source operating
systems often implement the most advanced, state-of-the-art architectures and algorithms because
researchers can play with them at will and their work can immediately be reected into real
applications.
2. Open-source operating systems have no purchasing cost and are inherently royalty free, so their
adoption can cut down the costs of an application. Moreover, one of the most well-known
issues of open-source operating systems, namely the lack of ofcial technical support, has
recently found its way to a solution with more and more consulting rms specializing in their
support.
Among the open-source operating systems that have found their way into commercial products we
recall:
eCos. The development of eCos [9] is coordinated by Red Hat Inc. and is based on a modular, layered
real-time kernel. The most important innovationineCos is its extensive congurationsystemthat operates
at kernel build time with a large number of conguration points placed at source code level, and allows
a very ne-grained adaptation of the kernel itself to both application needs and hardware characteristics.
The output of the conguration process is an operating system library that can then be linked with the
application code.
Its application programming interface is compatible with the POSIX standard, but it does not support
multiple processes with independent address spaces, even when a Memory Management Unit (MMU) is
available.
Clinux. It is a stripped-down version of the well-known Linux operating system. The most interesting
features of Clinux [10] are the ability to run on microcontrollers that lack an MMU and its small size,
compared with a standard Linux kernel. As is, Clinux does not have any real-time capability, because
it inherits its standard processor scheduler from Linux; however, both RT-Linux [11] and RTAI [12]
real-time extensions are available for it.
RT-Linux and RTAI. These are hard real-time-capable extensions to the Linux operating system;
they are similar, and their architecture was rst outlined in 1997 [1113]. The main design feature of
both RT-Linux and RTAI is the clear separation between the real-time and nonreal-time domains: in
RT-Linux and RTAI, a small monolithic real-time kernel runs real-time tasks, and coexists with the Linux
kernel.
As a consequence, nonreal-time tasks running on the Linux kernel have the sophisticated services of a
standard time-sharing operating system at their disposal, whereas real-time tasks operate in a protected,
predictable and low-latency environment.
The real-time kernel performs rst-level real-time scheduling and interrupt handling, and runs the
Linux kernel as its lowest-priority task. In order to keep changes in the Linux kernel to an absolute
minimum, the real-time kernel provides for an emulation of the interrupt control hardware. In particular,
any interrupt disable/enable request issued by the Linux kernel is not passed to the hardware, but is
emulated in the real-time kernel instead; thus, for example, when Linux disables interrupts, the hardware
interrupts actually stay enabled and the real-time kernel queues and delays the delivery of any interrupt of
interest to the Linux kernel. Real-time interrupts are not affected at all, and are handled as usual, without
any performance penalty.
To handle communication between real-time and nonreal-time tasks, RT-Linux and RTAI implement
lock-free queues and shared memory. In this way, real-time applications can rely on Linux system services
for nontime-critical operations, such as lesystem access and graphics user interface.
RTEMS. The development of RTEMS [14] is coordinated by On-Line Applications Research
Corporation, which also offers paid technical support. Its application development environment is
based on open-source GNU tools, and has a monolithic/layered architecture, commonly found in
high-performance real-time executives.
RTEMS complies with the POSIX 1003.1b application programming interface and supports multiple
threads of execution, but does not implement a multiprocess environment with independent application
address spaces. It also supports networking and a lesystem.
11.5 Virtual-Machine Operating Systems
According to its most general denition, a virtual machine is an abstract system, composed of one or more
virtual processors and zero or more virtual devices. The implementation of a virtual machine can be carried
out in a number of different ways, such as, for example, interpretation, (partial) just-in-time compilation,
and hardware-assisted instruction-level emulation. Moreover, these techniques can be, and usually are,
combined to obtain the best compromise between complexity of implementation and performance for a
given class of applications.
In this section, we focus on perfect virtualization, that is, the implementation of virtual machines whose
processors and I/O devices are identical in all respects to their counterpart in a physical machine, that is,
also the machine on which the virtualization software runs. The implementation of virtual machines is
carried out by means of a peculiar kind of operating system kernel; hardware assistance keeps overheads
to a minimum.
As described in Section 11.2.1, the internal architecture of an operating system based on virtual
machines revolves around the basic observation that an operating system must perform two essential
functions:
Multiprogramming
System services
Accordingly, those operating systems fully separate these functions and implement themas two distinct
operating system components:
A virtual machine monitor
A guest operating system
11.5.1 Related Work
A particularly interesting application of the layered approach to operating system design, the virtual
machines, were rst introduced by Meyer and Seawright in the experimental CP-40 and, later, CP-67
system [15] on an IBM 360/67. This early system was peculiar because it provided each user with a virtual
IBM 360/65 (not 67) including I/O devices. So, processor and I/O virtualization was perfect, but MMU
virtualization was not attempted at all, and the virtual machines thus lacked an MMU.
Later on, MMU virtualization was added by A. Auroux, and CP-67 evolved into a true IBM product,
VM/370. Offsprings of VM/370 are still in use today on IBM mainframes; for example, z/VM [16] runs on
IBMzSeries mainframes and supports the execution of multiple z/OS and Linux operating systemimages,
each in its own virtual machine.
An extensive, early discussion of virtual machines and their properties can be found in References 17
and 18.
More recently, microcode support for virtual machines was added to the 680x0 microprocessor family
in the transition between the 68000 and the 68010 [19]; in particular, the privilege level required to execute
some instructions was raised to make processor virtualization feasible.
Commercial virtualization products now available include, for example, the VMware virtualization
software for the Intel Pentium architecture [20]. For example, in Reference 21 this product was used to
give a prototype implementation of a virtual machine-based platform for trusted computing.
More advanced attempts at virtualization in various forms can be found in References 22 and 23.
In particular, Reference 22 discusses in detail the trade-off betweenperfect virtualization and efciency.
11.5.2 Views of Processor State
The execution of each machine instruction both depends on, and affects, the internal processor state.
In addition, the semantics of an instruction depend on the processor execution mode the processor was in
when the instruction itself was executed, since the processor mode directly determines the privilege level
of the processor itself. For example:
On the ARMV5 [24] processor, the execution of the ADD R13, R13, #1 instruction increments by
one the contents of register R13.
The outcome of the BEQ label instruction depends on the current value of the Z processor state
ag, and conditionally updates the contents of the program counter.
The view that machine code has of the internal processor state depends on the mode the processor is
running in. In particular, let us dene two, somewhat simplied, views of the processor state:
User-mode view. It is the portion of the processor state that is accessible through machine instructions,
with either read-only or read-write access rights, when the processor is in user mode. In other words, it is
the portion of the processor state that can be accessed by unprivileged machine instructions.
Privileged-mode view. It is the portion of the processor state that is accessible through machine instruc-
tions, with either read-only or read-write access rights, when the processor is in privileged mode. It usually
is a superset of the user-mode state and, if the processor supports a single privileged mode, coincides with
full access to the processor state as a whole.
When the processor supports either privilege rings or multiple, independent privileged modes, the
denition of privileged-mode view becomes more complicated, and involves either:
A nested set of views when the processor supports privilege rings. In this case, the inner view,
corresponding to the most privileged processor mode, encompasses the machine state as a whole
with the most powerful access rights; outer, less privileged modes have more restricted views of
the processor state. Above, nested means that the outer view either has no visibility of a processor
state item that is visible from the inner view, or that the outer view has less powerful access rights
than the inner view on one or more processor state items.
A collection of independent views when the processor supports multiple, independent privileged
modes. In this case, it should be noted that the intersection among views can be, and usually is, not
empty; for example, in the ARM V5 processor, user registers R0 through R7 are accessible from,
and common to all unprivileged and privileged processor modes. In addition, registers R8 through
R12 are common to all but one processor mode. It should be noted also that only the union of
all views give full access to the processor state: in general, no individual view can do the same,
not even the view corresponding to the most privileged privileged mode, even if the processor
specication contains such a hierarchical classication of privileged modes.
Continuing the rst example above, if the processor implements multiple, mode-dependent instances
of register R13, the execution of the ADD instruction presented above in user mode will update the
user-mode view of R13, but will not affect the view of the same register in any other mode.
As customary, we dene a process as the activity of executing a program; a process therefore encompasses
both the program text and a view of the current processor state when the execution takes place. The latter
includes the notion of execution progress that could be captured, for example, by a program counter.
11.5.3 Operating Principle
The most important concept behind processor virtualization is that a low-level system software compo-
nent, the Virtual Machine Monitor (or VMM for short), also historically known as the Control Program
(or CP), performs the following functions, likely with hardware assistance:
It gives to a set of machine code programs, running either in user or privileged mode, their own,
independent view of the processor state; this gives rise to a set of independent sequential processes.
Each view is correct from the point of view of the corresponding program, in the sense that the
view is indistinguishable from the view the program would have when run on a bare machine,
without a VMM in between. This requirement supports the illusion that each process runs on its
own processor, and that processor is identical to the physical processor below the VMM.
It is able to switch the processor among the processes mentioned above; in this way, the VMM
implements multiprogramming. Switching the processor involves both a switch among possibly
different program texts, and among distinct processor state views.
The key difference of the VMM approach with respect to traditional multiprogramming is that a
traditional operating system connes user-written programs to run in user mode only, and to access
privileged mode by means of a system call. Thus, each service request made by a user program traps
to the operating system and switches the processor to privileged mode; the operating system, running
in privileged mode, performs the service on behalf of the user program and then returns control to it,
simultaneously bringing the processor back to user mode.
A VMM system, instead, supports virtual processors that can run either in user or in privileged mode,
just like the real, physical processor they mimic; the real processor must necessarily execute the processes
inside the virtual machine also called the virtual machine guest code in user mode, to keep control
of the system as a whole.
We must therefore distinguish between:
The real, physical processor mode, that is, the processor mode the physical processor is running in.
At each instant, each physical processor in the systemis characterized by its current processor mode.
The virtual processor mode, that is, the processor mode each virtual processor is running in.
At each instant, each virtual machine is characterized by its current processor mode, and it does
not necessarily coincide with the physical processor mode, even when the virtual machine is being
executed.
11.5.4 Virtualization by Instruction Emulation
The classic approach to processor virtualization is based on privileged instruction emulation; with this
approach, the VMM maintains a virtual machine control block (or VMCB for short) for each virtual
machine.
Among other things, the VMCB holds the full processor state, both unprivileged and privileged, of a
virtual processor; therefore, it contains state information belonging to, and accessible from, distinct views
of the processor state, with different levels of privilege.
When a virtual machine is being executed by a physical processor, the VMMtransfers part of the VMCB
into the physical processor state; when the VMMassigns the physical processor to another virtual machine,
the physical processor state is transferred back into the VMCB.
It is important to notice that virtual machines are always executed with the physical processor in user
mode, regardless of the virtual processor mode. Most virtual machine instructions are executed directly
by the physical processor, with zero overhead; however, some instructions must be emulated by the VMM
to incur a trap handling overhead. In particular:
1. Unprivileged instructions act on, and depend on, the current view of the processor state only, and
are executed directly by the physical processor. Two subcases are possible, depending on the current
virtual processor mode:
(a) Both the virtual and the physical processor are in user mode. In this case, virtual and physical
instruction execution and their corresponding processor state views fully coincide, and no
further manipulation of the processor state is necessary.
(b) The virtual processor is running in a privileged mode and the physical processor is in user
mode. So, instruction execution acts on the user-mode view of the processor state, and the
intended effect is to act on one of the privileged views. To compensate for this, the VMM must
update the contents of the user-mode view in the physical processor from the appropriate
portion of the VMCB whenever the virtual processor changes state. Even in this case, the
overhead incurred during actual instruction execution is zero.
2. Privileged instructions act on one of the privileged views of the processor state. So, when the
execution of a privileged instruction is attempted in physical user mode, the physical processor
takes a trap to the VMM. In turn, the VMMmust emulate either the trap or the trapped instruction,
depending on the current virtual privilege level, and reect the outcome of the emulation in the
virtual processor state stored in the VMCB:
(a) If the virtual processor was in user mode when the trap occurred, the VMM must emulate
the trap. Actual trap handling will be performed by the privileged software inside the virtual
machine, in virtual privileged mode, because the emulation of the trap, among other things,
switches the virtual processor into privileged mode. The virtual machine privileged software
actually receives the emulated trap.
(b) If the virtual processor was in privileged mode, and the trap was triggered by lack of the
required physical processor privilege level, the VMM must emulate the privileged instruction;
in this case, the VMM itself performs trap handling and the privileged software inside the
virtual machine does not receive the trap at all. Instead, it sees the outcome of the emulated
execution of the privileged instruction.
(c) If the virtual processor was in privileged mode, and the trap was triggered by other reasons,
the VMM must emulate the trap; the actual trap handling, if any, will be performed by the
privileged software inside the virtual machine, in virtual privileged mode. It should be noted
in fact that, in most simple processor architectures and operating systems, a trap occurring
in a privileged mode is usually considered to be a fatal condition, and triggers the immediate
shutdown of the operating system itself.
In the rst and third case above, the behavior of the virtual processor exactly matches the
behavior of a physical processor in the same situation, except that the trap enter mechanism is
emulated in software instead of being performed either in hardware or in microcode.
In the second case, the overall behavior of the virtual processor still matches the behavior of a
physical processor in the same situation, but the trap is kept invisible to the virtual machine guest
software because, in this case, the trap is instrumental for the VMM to properly catch and emulate
the privileged instruction.
3. Athird class of instructions includes unprivileged instructions whose outcome depends on a physical
processor state item belonging to privileged processor state views only.
The third and last class of instructions is anomalous and problematic in nature from the point
of view of processor virtualization, because these instructions allow a program to infer something
about a processor state item that would not be accessible from its current view of the processor state
itself.
The presence of instructions of this kind hampers the privileged instruction emulation approach to
processor virtualization just discussed, because it is based on the separation between physical processor
state and virtual processor state, and enforces this separation by trapping (and then emulating in the
virtual processor context) all instruction that try to access privileged processor state views. Instructions
of this kind are able to bypass this mechanism as a whole, because they generate no trap, so the VMM is
unable to detect and emulate them; instead, they take information directly from the physical processor
state.
11.5.5 Processor-Mode Change
A change in the mode of a virtual processor may occur for several reasons, of which the main ones are:
When the execution of an instruction triggers a trap; in this case, trap handling is synchronous
with respect to the code being executed.
When an interrupt request for the virtual machine comes in, and is accepted; the interrupt request
is asynchronous with respect to the code being executed, but the hardware implicitly synchronizes
interrupt handling with instruction boundaries.
In both cases, the VMM takes control of the physical processor and then implements the mode change
by manipulating the virtual machines VMCB and, if the virtual machine was being executed when the
mode change was requested, a portion of the physical processor state.
In particular, to implement the processor-mode change the VMM must perform the following
actions:
Save the portion of the physical processor state pertaining to the processor state view of the old
processor state into the VMCB; for example, if the processor was running in user mode, the
user-mode registers currently loaded in the physical processor registers must be saved.
Update the VMCB to reect the effects of the mode change to the virtual processor state; for
example, a system call instruction will likely modify the program counter of the virtual processor
and load a new processor status word.
Load the physical processor state pertaining to the state view of the new processor state from the
VMCB; for example, if the processor is being switched to a privileged mode, the privileged-mode
registers must be transferred fromthe VMCBinto the user-mode registers of the physical processor.
Notice that the mode of the registers saved in the VMCB and the accessibility mode of the physical
processor registers into which they are loaded do not coincide.
11.5.6 Privileged Instruction Emulation
To perform its duties, the VMM must be able to receive and handle traps on behalf of a virtual machine,
and these traps can be triggered for a variety of reasons. When using the privileged instruction emulation
approach to processor virtualization, the most common trap reason is the request to emulate a privileged
instruction.
The VMMmust performprivileged instruction emulation when a virtual processor attempts to execute
a legal privileged instruction while in virtual privileged mode. In this case, the physical processor (running
in physical user mode) takes a privileged instruction trap that would have not been taken if it were in
privileged mode as the virtual machine software expects it to be.
The main steps of the instruction emulation sequence are:
Save into the VMCB all registers in the view corresponding to the current virtual processor mode.
This both freezes the virtual machine state for subsequent instruction emulation, and frees the
physical processor state for VMM use.
Locate and decode the instruction to be emulated in the virtual processor instruction stream. This
operation may involve multiple steps because, for example, on superscalar or deeply pipelined
architecture, the exact value of the program counter at the time of the trap can not be easy to
compute.
Switch the physical processor into the appropriate privileged mode for instruction emulation, that
is, to the processor mode of the virtual processor. The trap handling mechanism of the physical
processor always switches the processor into a privileged mode, but if the processor supports
multiple privileged modes then the privileged mode might not coincide with the actual privileged
mode of the virtual processor.
Emulate the instruction using the VMCB as the reference machine state for the emulation, and
reect its outcome into the VMCB itself. Notice that the execution of a privileged instruction may
update both the privileged and the unprivileged portion of the virtual processor state, so the VMCB
as a whole is involved. Also, the execution of a privileged instruction may change the processor
mode of the virtual processor.
Update the virtual program counter in the VMCB to the next instruction in the instruction stream
of the virtual processor.
Restore the virtual processor state from the updated VMCB and return from the trap.
In the last step above, the virtual processor state can in principle be restored either:
From the VMCB of the virtual machine that generated the trap in the rst place, if the processor
scheduler of the VMM is not invoked after instruction emulation; this is the case just described.
From the VMCB of another virtual machine, if the processor scheduler of the VMM is invoked
after instruction emulation.
11.5.7 Exception Handling
When any synchronous exception other than a privileged instruction trap, occurs in either virtual user or
virtual privileged modes, the VMM, and not the guest operating system of the virtual machine, receives
the trap in the rst place.
When the trap is not instrumental to the implementation of virtual machines, as it happens in
most cases, the VMM must simply emulate the trap mechanism itself inside the virtual machine, and
appropriately update the VMCB to reect the trap back to the privileged virtual machine code.
Another situation in which the VMM must simply propagate the trap is the occurrence of a privileged
instruction trap when the virtual processor is in virtual user mode. This occurrence usually, but not always,
indicates a bug in the guest software: an easy counterexample can be found when a VMMis running inside
the virtual machine.
A special case of exception is that generated by the system call instruction, whenever it is implemented.
However, from the point of view of the VMM, this kind of exception is handled exactly as all others; only
the interpretation given to the exception by the guest code running in the virtual machine is different.
It should also be noted that asynchronous exceptions, such as interrupt requests, must be handled in a
different and more complex way, as described in the following section.
11.5.8 Interrupt Handling
We distinguish between three kinds of interrupt; each of them requires a different handling strategy by
the VMM:
Interrupts triggered by, and destined to, the VMMitself, for example, the VMMprocessor scheduler
timeslice interrupt, and the VMM console interrupt; in this case, no virtual machine ever notices
the interrupt.
Interrupts destined to a single virtual machine, for example, a disk interrupt for a physical disk
permanently assigned to a virtual machine.
Interrupts synthesized by the VMM, either by itself or as a consequence of another interrupt, and
destined to a virtual machine, for example, a disk interrupt for a virtual disk emulated by the VMM,
or a network interrupt for a virtual communication channel between virtual machines.
In either case, the general approach to interrupt handling is the same, and the delivery of an interrupt
request to a virtual machine implies at least the following steps:
If the processor was executing in a virtual machine, save the status of the current virtual machine,
if any, into the corresponding VMCB; then, switch the processor onto the VMM context and stack,
and select the most privileged processor mode. Else, the processor was already executing the VMM;
the processor already is in the VMM context and stack, and runs at the right privilege level. In both
cases, after this phase, the current virtual machine context has been secured in its VMCB and the
physical processor can freely be used by VMM code; this is also a good boundary for the transition
between the portion of the VMM written in assembly code and the bulk of the VMM, written in a
higher-level programming language.
Determine the type of interrupt request and to which virtual machine it must be dispatched; then,
emulate the interrupt processing normally performed by the physical processor in the correspond-
ing VMCB. An additional complication arises if the target virtual machine is the current virtual
machine and the VMM was in active execution, that is, it was emulating an instruction on behalf
of the virtual machine itself, when the request arrived. In this case, the simplest approach, which
also adheres most to the behavior of the physical processor, is to defer interrupt emulation to the
end of the current emulation sequence. To implement the deferred handling mechanismefciently,
some features of the physical processor, such as Asynchronous System Traps (ASTs) and deferrable
software interrupts, may be useful; unfortunately, they are now uncommon on RISC (Reduced
Instruction Set Computer) machines.
Return either to the VMM or the virtual machine code that was being executed when the interrupt
request arrived. Notice that, at this point, no actual interrupt handling took place yet, and that some
devices may require some limited intervention before returning from their interrupt handler, for
example, to release their interrupt request line. In this case, it may be necessary to incorporate this
low-level interrupt handling in the VMMdirectly, and at the same time ensure that it is idempotent
when repeated by the virtual machine interrupt handler.
The opportunity and the relative advantages and disadvantages of invoking the VMM processor
scheduler after each interrupt request will be discussed in Section 11.5.10.
11.5.9 Trap Redirection
A problem common to privileged instruction emulation, exception handling, and interrupt handling is
that the VMM must be able to intercept any exception the processor takes while executing on behalf of a
virtual machine and direct it toward its own handler.
Most modern processors use an unied trap vector or dispatch table for all kinds of trap, exception,
and interrupt. Each trap type has its own code that is used as an index in the trap table to fetch the address
in memory at which the corresponding trap handler starts. A slightly different approach is to execute
the instruction inside the table directly (in turn, the instruction will usually be an unconditional jump
instruction), but the net effect is the same. A privileged register, usually called the trap table base register,
gives the starting address of the table.
In either case, all vectors actually used by the physical processor when handling a trap reside in the
privilegedaddress space, andare accessedafter the physical processor has beenswitchedintoanappropriate
privileged mode. The VMM must have full control on these vectors, because it relies on them to intercept
traps at runtime.
On the other hand, the virtual machine guest code should be able to set its own trap table, with any
vectors it desires; the latter table resides in the virtually privileged address space of the virtual machine,
and must be accessible to the virtual machine guest code in read and write mode. The content of this table
is not used by the physical processor, but by the VMM to compute the target address to which to redirect
traps via emulation.
The simplest approach to accommodate these conicting needs, when it is not possible to map the
same virtual address to multiple, distinct physical addresses depending on the processor mode without
software intervention which is quite a common restriction on simple MMUs is to reserve in the
addressing space of each virtual machine a phantom page that is not currently in use by the guest code,
grant read-only access to it only when the processor is in physical privileged mode to keep it invisible to
the guest code, and store the actual trap table there and direct the processor to use it by setting its trap
table base register appropriately.
The initialization of the actual trap table is made by the VMM for each virtual machine dur-
ing the initial instantiation of the virtual machine itself. Since the initialization is performed by the
VMM (and not by the virtual machine), the read-only access restriction described above does not
apply.
The VMM must then intercept any access made by the virtual machine guest code to the trap table
base register, in order to properly locate the virtual trap table and be able to compute the target address to
which to redirect traps.
In other words, the availability of a trap table base register allows the guest code to set up its own, virtual
trap table, and the VMM to set up the trap table obeyed by the physical processor, without resorting to
virtual/physical address mapping functions that depend on the processor mode.
Two complementary approaches can be followed to determine the exact location of the phantom page
in the address space of the virtual machine. In principle, the two approaches are the same; the difference
between themis the compromise between ease of implementation, overheads at runtime, and exibility:
If the characteristics of the guest code, in particular of the guest operating systems, are well known,
the location of the phantom page can be xed in advance. This is the simpler choice, and has also
the least runtime overhead.
If one does not want to make any assumption at all about the guest code, the location of the
phantom page must be computed dynamically from the current contents of the page table set up
by the guest code and may change with time (e.g., when the guest code decides to map the location
in which the phantom page currently resides for its own use). When the location of the phantom
page changes, the VMM must update the trap table base register accordingly; moreover, it must
ensure that the rst-level trap handlers contain only position-independent code.
11.5.10 VMM Processor Scheduler
The main purpose of the VMM, with respect to processor virtualization, is to emulate the privileged
instructions issued by virtual processors in virtual privileged mode. Code fragments implementing the
emulation are usually short and often require atomic execution, above all with respect to interrupt requests
directed to the same virtual machine.
The processor scheduling activity carried out by the VMM by means of the VMM processor scheduler
is tied to the emulation of privileged instructions quite naturally, because the handling of a privileged
instruction exception is a convenient scheduling point.
On the other hand, the main role of the VMM in interrupt handling is to redirect each interrupt to
the virtual machine(s) interested in it; this action, too, must be completed atomically with respect to
instruction execution on the same virtual machine, like the physical processor itself would do.
This suggests to disable the rescheduling of the physical processor if the processor was executing the
VMM when the interrupt arrived, and to delay the rescheduling until the end of the current VMM
emulation path. The advantage of this approach is twofold:
The state of the VMM must never be saved when switching between different virtual machines.
The VMM must not be reentrant with respect to itself, because a VMM/VMM context switch
cannot occur.
By contrast, the processor allocation latency gets worse because the maximum latency, not taking
higher-priority activities into account, becomes the sum of:
The longest emulation path in the VMM.
The longest sequence of instructions in the VMM to be executed with interrupts disabled, due to
synchronization constraints.
The scheduling time.
The virtual machine context save and restore time.
In a naive implementation, making the VMM not preemptable seems promising, because it is concep-
tually simple and does impose a negligible performance penalty in the average case, if it is assumed that
the occurrence of an interrupt that needs an immediate rescheduling while VMM execution is in progress
is rare.
Also, some of the contributions to the processor allocation latency described earlier, mainly the length
of instruction emulation paths in the VMM and the statistical distribution of the different instruction
classes to be emulated in the instruction stream, will be better known only after some experimentation
because they also depend on the behavior of the guest operating systems and their applications.
It must also be taken into account that making the VMM preemptable will likely give additional
performance penalties in the average case:
The switch between virtual machines will become more complex, because their corresponding
VMM state must be switched as well.
The preemption of some VMM operations, for example, the propagation of an interrupt request,
may be ill-dened if it occurs while a privileged instruction of the same virtual machine is being
emulated.
So, at least for soft real-time applications, implementing preemption of individual instruction emula-
tion in the VMM should be done only if it is strictly necessary to satisfy the latency requirements of the
application, and after extensive experimentation.
From the point of view of the scheduling algorithms, at least in a naive implementation, a xed-priority
scheduler with a global priority level assigned to each virtual machine is deemed to be the best choice,
because:
It easily accommodates the common case in which nonreal-time tasks are conned under the
control of a general-purpose operating system in a virtual machine, and real-time tasks either run
under the control of a real-time operating system in another virtual machine, or have a private
virtual machine each.
The sensible selection of a more sophisticated scheduling algorithm can be accomplished only
after extensive experimentation with the actual set of applications to be run, and when a detailed
model of the real-time behavior of the application itself and of the devices it depends on is
available.
The choice of algorithms to be used when multiple, hierarchical schedulers are in effect in a
real-time environment has not yet received extensive attention in the literature.
References
[1] A. Silberschatz, P.B. Galvin, and G. Gagne. Applied Operating Systems Concepts. John Wiley & Sons,
Hoboken, NJ, 1999.
[2] A.S. Tanenbaum and A.S. Woodhull. Operating Systems Design and Implementation. Prentice
Hall, Englewood Cliffs, NJ, 1997.
[3] M.K. McKusick, K. Bostic, M.J. Karels, and J.S. Quarterman. The Design and Implementation of the
4.4BSD Operating System. Addison-Wesley, Reading, MA, 1996.
[4] B. Sprunt, L. Sha, and J.P. Lehoczky. Aperiodic Task Scheduling for Hard Real-Time Systems.
Real-Time Systems, 1, 2760, 1989.
[5] IEEE Std 1003.1-2001. The Open Group Base Specications Issue 6. The IEEE and The Open
Group, 2001. Also available online, at http://www.opengroup.org/.
[6] OSEK/VDX. OSEK/VDX Operating System Specication. Available online, at http://www.
osek-vdx.org/.
[7] IEEE Std 1003.13-1998. Standardized Application Environment Prole POSIX Realtime
Application Support (AEP). The IEEE, New York, 1998.
[8] ISO/IEC 9899:1999. Programming Languages C. International Standards Organization,
Geneva, 1999.
[9] Red Hat Inc. eCos User Guide. Available online, at http://sources.redhat.com/ecos/.
[10] Arcturus Networks Inc. Clinux Documentation. Available online, at http://www.
uclinux.org/.
[11] FSMLabs, Inc. RTLinuxPro Frequently Asked Questions. Available online, at http://www.
rtlinux.org/.
[12] Politecnico di Milano, Dip. di Ingegneria Aerospaziale. The RTAI Manual. Available online, at
http://www.aero.polimi.it/rtai/.
[13] M. Barabanov and V. Yodaiken. Real-Time Linux. Linux Journal, February 1997. Also available
online, at http://www.rtlinux.org/.
[14] On-Line Applications Research. RTEMS Documentation. Available online, at http://www.
rtems.com/.
[15] R.A. Meyer and L.H. Seawright. A Virtual Machine Time-Sharing System. IBM Systems Journal, 9,
199218, 1970.
[16] International Business Machines Corp. z/VM General Information, GC24-5991-05. Also available
online, at http://www.ibm.com/.
[17] R. Goldberg. Architectural Principles for Virtual Computer Systems, Ph.D. thesis, Harvard
University, 1972.
[18] R. Goldberg. Survey of Virtual Machine Research. IEEE Computer Magazine, 7, 3445, 1974.
[19] Motorola Inc. M68000 Programmers Reference Manual, M68000PM/AD, Rev. 1.
[20] VMware Inc. VMware GSX Server Users Manual. Also available online, at http://www.
vmware.com/.
[21] T. Garnkel, B. Plaff, J. Chow, M. Rosenblum, and D. Boneh. Terra: A Virtual Machine-Based
Platformfor Trusted Computing. In Proceedings of the 19th ACM Symposium on Operating Systems
Principles, October 2003.
[22] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, A. Ho, R. Neugerbauer, I. Pratt, and
A. Wareld. Xen and the Art of Virtualization. In Proceedings of the 19th ACM Symposium on
Operating Systems Principles, October 2003.
[23] E. Bugnion, S. Devine, and M. Rosenblum. Disco: Running Commodity Operating Systems
on Scalable Multiprocessors. In Proceedings of the 16th ACM Symposium on Operating Systems
Principles, October 1997.
[24] D. Seal, Ed. ARM Architecture Reference Manual, 2nd ed. Addison-Wesley, Reading, MA, 2001.
12
Real-Time Operating
Systems: The
Scheduling and
Resource
Management Aspects
Giorgio C. Buttazzo
University of Pavia
12.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-1
Achieving Predictability
12.2 Periodic Task Handling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-2
Timeline Scheduling Rate Monotonic Scheduling Earliest
Deadline First Tasks with Deadlines Less than Periods
12.3 Aperiodic Task Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-6
12.4 Protocols for Accessing Shared Resources . . . . . . . . . . . . . . 12-9
Priority Inheritance Protocol Priority Ceiling Protocol
Schedulability Analysis
12.5 New Applications and Trends . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-11
12.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12-12
12.1 Introduction
Often, people say that real-time systems must react fast to external events. Such a denition, however, is
not precise, because processing speed does not provide any information on the actual capability of the
system to react timely to events. In fact, the effect of controller actions in a system can only be evaluated
when considering the dynamic characteristics of the controlled environment. A more precise denition
would say that a real-time systemis a systemin which performance depends not only on the correctness of
the single controller actions, but also on the time at which actions are produced [1]. The main difference
between a real-time task and a nonreal-time task is that a real-time task must complete within a given
deadline. In other words, a deadline is the maximum time allowed for a computational process to nish
its execution. In real-time applications, a result produced after its deadline is not only late, but can
be dangerous. Depending on the consequences caused by a missed deadline, real-time activities can
be classied into hard and soft tasks [2]. A real-time task is said to be hard if missing a deadline may have
catastrophic consequences in the controlled system. A real-time task is said to be soft if missing a deadline
12-1
causes a performance degradation, but does not jeopardize correct system behavior. An operating system
able to manage hard tasks is called a hard real-time system [3, 4].
In general, hard real-time systems have to handle both hard and soft activities. In a control applica-
tion, typical hard tasks include sensory data acquisition, detection of critical conditions, motor actuation,
and action planning. Typical soft tasks include user command interpretation, keyboard input, message
visualization, system status representation, and graphical activities. The great interest in real-time systems
is motivated by the growing diffusion they have in our society in several application elds, including
chemical and nuclear power plants, ight control systems, trafc monitoring systems, telecommunication
systems, automotive devices, industrial automation, military systems, space missions, and robotic systems.
Despite this large application domain, most of todays real-time control systems are still designed using ad
hoc techniques and heuristic approaches. Very often, control applications with stringent time constraints
are implemented by writing large portions of code in assembly language, programming timers, writing
low-level drivers for device handling, and manipulating task and interrupt priorities. Although the code
produced by these techniques can be optimized to run very efciently, this approach has several disadvant-
ages. First of all, the implementation of large and complex applications in assembly language is much more
difcult and time consuming than using high-level programming. Moreover, the efciency of the code
strongly depends on the programmers ability. In addition, assembly code optimization makes a program
more difcult to comprehend, complicating software maintenance. Finally, without the support of specic
tools and methodologies for code and schedulability analysis, the verication of time constraints becomes
practically impossible. The major consequence of this state of affairs is that control software produced
by empirical techniques can be highly unpredictable. If all critical time constraints cannot be veried a
priori and the operating system does not include specic features for handling real-time tasks, the system
apparently works well for a period of time, but may collapse in certain rare, but possible, situations. The
consequences of a failure can sometimes be catastrophic and may injure people or cause serious damage
to the environment. A trustworthy guarantee of system behavior under all possible operating conditions
can only be achieved by adopting appropriate design methodologies and kernel mechanisms specically
developed for handling explicit timing constraints.
12.1.1 Achieving Predictability
The most important property of a real-time system is not high speed, but predictability. In a predictable
systemwe shouldbe able todetermine inadvance whether all the computational activities canbe completed
within their timing constraints. The deterministic behavior of a system typically depends on several
factors, ranging from the hardware architecture to the operating system, up to the programming language
used to write the application. Architectural features that have major inuence on task execution include
interrupts, direct memory access (DMA), cache, and prefetching mechanisms. Although such features
improve the average performance of the processor, they introduce a nondeterministic behavior in process
execution, prolonging the worst-case response times. Other factors that signicantly affect task execution
are due to the internal mechanisms used in the operating system, such as the scheduling algorithm,
the synchronization mechanisms, the memory management policy, and the method used to handle
I/O devices. The programming language has also an important impact on predictability, through the
constructs it provides to handle the timing requirements specied for computational activities.
12.2 Periodic Task Handling
Periodic activities represent the major computational load in a real-time control system. For example,
activities such as actuator regulation, signal acquisition, ltering, sensory data processing, action plan-
ning, and monitoring, need to be executed with a frequency derived from the application requirements.
A periodic task is characterized by an innite sequence of instances, or jobs. Each job is characterized
by a request time and a deadline. The request time r(k) of the kth job of a task represents the time at
Real-Time Operating Systems 12-3
which the task becomes ready for execution for the kth time. The interval of time between two consec-
utive request times is equal to the task period. The absolute deadline of the kth job, denoted with d(k),
represents the time within which the job has to complete its execution, and r(k) < d(k) r(k +1).
12.2.1 Timeline Scheduling
Timeline Scheduling (TS), also known as a cyclic executive, is one of the most used approaches to handle
periodic tasks in defense military systems and trafc control systems. The method involves dividing the
temporal axis into slices of equal length, in which one or more tasks can be allocated for execution, in such
a way to respect the frequencies derived from the application requirements. A timer synchronizes the
activation of the tasks at the beginning of each time slice. In order to illustrate this method, consider
the following example, in which three tasks, A, B, and C, need to be executed with a frequency of 40, 20,
and 10 Hz, respectively. By analyzing the task periods, it is easy to verify that the optimal length for the
time slice is 25 msec, which is the greatest common divisor of the periods. Hence, to meet the required
frequencies, task A needs to be executed in every time slice, task B every two slices, and task C every
four slices. A possible scheduling solution for this task set is illustrated in Figure 12.1.
The duration of the time slice is also called a minor cycle, whereas the minimum period after which
the schedule repeats itself is called a major cycle. In general, the major cycle is equal to the least common
multiple of all the periods (in the example it is equal to 100 msec). In order to guarantee a priori that
a schedule is feasible on a particular processor, it is sufcient to know the task worst-case execution times
and verify that the sum of the executions within each time slice is less than or equal to the minor cycle.
In the example shown in Figure 12.1, if C
A
, C
B
, and C
C
denote the execution times of the tasks, it is
sufcient to verify that
C
A
+C
B
25 msec
C
A
+C
C
25 msec
The major relevant advantage of TS is its simplicity. The method can be implemented by programming
a timer to interrupt with a period equal to the minor cycle and by writing a main program that calls the
tasks in the order given in the major cycle, inserting a time synchronization point at the beginning of each
minor cycle. Since the task sequence is not decided by a scheduling algorithmin the kernel, but is triggered
by the calls made by the main program, there are no context switches, so the runtime overhead is very
low. Moreover, the sequence of tasks in the schedule is always the same, can be easily visualized, and it is
not affected by jitter (i.e., task start times and response times are not subject to large variations). In spite
of these advantages, TS has some problems. For example, it is very fragile during overload conditions. If a
task does not terminate at the minor cycle boundary, we can either let it continue or abort it. In both cases,
however, the system may enter in a risky situation. In fact, if we leave the failing task in execution, it can
cause a domino effect on the other tasks, breaking the entire schedule (timeline break). On the other hand,
if the failing task is aborted, the system may be left in an inconsistent state, jeopardizing correct system
behavior. Another big problem of the TS technique is its sensitivity to application changes. If updating a
Major cycle
Minor cycle
0 25 50 75 100 125 150
t
Task B
Task A
Task C
FIGURE 12.1 Example of TS.
task requires an increase of its computation time or its activation frequency, the entire scheduling sequence
may need to be reconstructed from scratch. Considering the previous example, if task B is updated to B
and the code change is such that C

A
+C
B
> 25 msec, then we have to divide B
in two or more pieces to

be allocated in the available intervals of the timeline. Changing the task frequencies may cause even more
radical changes in the schedule. For example, if the frequency of task B changes from 20 to 25 Hz, the
previous schedule is not valid any more, because the newminor cycle is equal to 10 msec and the newmajor
cycle is equal to 200 msec. Finally, another limitation of the TS is that it is difcult to handle aperiodic
activities efciently without changing the task sequence. The problems outlined above can be solved by
using priority-based scheduling algorithms.
12.2.2 Rate Monotonic Scheduling
The rate monotonic (RM) algorithm assigns each task a priority directly proportional to its activation
frequency, so that tasks with shorter period have higher priority. Since a period is usually kept constant
for a task, the RM algorithm implements a static priority assignment, in the sense that task priorities are
decided at task creation and remain unchanged for the entire application run. RM is typically preemptive,
although it can also be used in a non-preemptive mode. In 1973, Liu and Layland [5] showed that RM is
optimal among all static scheduling algorithms, in the sense that if a task set is not schedulable by RM,
then the task set cannot be feasibly scheduled by any other xed priority assignment. Another important
result proved by the same authors is that a set =
1
, . . . ,
n
of n periodic tasks is schedulable by RM if
n
i=1
C
i
T
i
n(2
1/n
1) (12.1)
where C
i
and T
i
represent the worst-case computation time and the period of task i, respectively.
The quantity
U =
n
i=1
C
i
T
i
represents the processor utilization factor and denotes the fraction of time used by the processor to execute
the entire task set. Table 12.1 shows the values of n(2
1/n
1) for n from 1 to 10. As can be seen, the factor
decreases with n and, for large n, it tends to the following limit value:
lim
n
n(2
1/n
1) = ln 2 0.69
TABLE 12.1 Maximum
Processor Utilization for the
RM Algorithm
n U
lub
1 1.000
2 0.828
3 0.780
4 0.757
5 0.743
6 0.735
7 0.729
8 0.724
9 0.721
10 0.718
We note that the test by Liu and Layland only gives a sufcient condition for guaranteeing a feasible
schedule under the RM algorithm. Hence, a task set can be schedulable by RM even though the utilization
condition is not satised. Nevertheless, we can certainly state that a periodic task set cannot be feasibly
scheduled by any algorithm if U > 1. A statistical study carried out by Lehoczky et al. [6] on randomly
generated task sets showed that the utilization bound of the RM algorithm has an average value of 0.88,
and becomes 1 for periodic tasks with harmonic period relations. Necessary and sufcient schedulability
tests for RM have been proposed [6,10,11,29], but they have pseudo-polynomial complexity. Recently,
Bini and Buttazzo derived a sufcient polynomial time test, the Hyperbolic Bound [28], capable of
accepting more tasks than the Liu and Layland test. In spite of the limitation on the schedulability bound,
which in most cases prevents the full processor utilization, the RM algorithm is widely used in real-time
applications, mainly for its simplicity. At the same time, being a static scheduling algorithm, it can be
easily implemented on top of commercial operating systems, using a set of xed priority levels. Moreover,
in overload conditions, the highest priority tasks are less prone to missing their deadlines. For all these
reasons, the Software Engineering Institute of Pittsburgh has prepared a sort of user guide for the design
and analysis of real-time systems based on the RM algorithm [7]. Since the RM algorithm is optimal
among all xed priority assignments, the schedulability bound can only be improved through a dynamic
priority assignment.
12.2.3 Earliest Deadline First
The earliest deadline rst (EDF) algorithm entails selecting (among the ready tasks) the task with the
earliest absolute deadline. The EDF algorithm is typically preemptive, in the sense that, a newly arrived
task can preempt the running task if its absolute deadline is shorter. If the operating system does not
support explicit timing constraints, EDF (as RM) can be implemented on a priority-based kernel, where
priorities are dynamically assigned to tasks. A task will receive the highest priority if its deadline is the
earliest among those of the ready tasks, whereas it will receive the lowest priority if its deadline is the latest
one. A task gets a priority that is inversely proportional to its absolute deadline. The EDF algorithm is
more general than RM, since it can be used to schedule both periodic and aperiodic task sets, because
the selection of a task is based on the value of its absolute deadline, which can be dened for both types
of tasks. Typically, a periodic task that completed its execution is suspended by the kernel until its next
release, coincident with the end of the current period. Dertouzos [8] showed that EDF is optimal among
all online algorithms, while Liu and Layland [5] proved that a set =
1
, . . . ,
n
of n periodic tasks is
schedulable by EDF if and only if
n
i=1
C
i
T
i
1
It is worth noting that the EDF schedulability condition is necessary and sufcient to guarantee
a feasible schedule. This means that, if it is not satised, no algorithm is able to produce a feasible
schedule for that task set.
The dynamic priority assignment allows EDF to exploit the full processor, reaching up to 100 utilization
factor less than one, the residual fraction of time can be efciently used to handle aperiodic requests
activated by external events. In addition, compared with RM, EDF generates a lower number of context
switches, thus causing less runtime overhead. On the other hand, RM is simpler to implement on a xed
priority kernel and is more predictable in overload situations, because higher priority tasks are less viable
to miss their deadlines.
12.2.4 Tasks with Deadlines Less than Periods
Using RM or EDF, a periodic task can be executed at any time during its period. The only guarantee
provided by the schedulability test is that each task will be able to complete its execution before the next
release time. In some real-time applications, however, there is the need for some periodic task to complete
within an interval less than its period. The deadline monotonic (DM) algorithm, proposed by Leung and
Whitehead [9], extends RM to handle tasks with a relative deadline less than or equal to their period.
According to DM, at each instant the processor is assigned the task with the shortest relative deadline.
In priority-based kernels, this is equivalent to assigning each task a priority P
i
inversely proportional
to its relative deadline. With D
i
xed for each task, DM is classied as a static scheduling algorithm.
In the recent years, several authors [6, 10, 11] independently proposed a necessary and sufcient test to
verify the schedulability of a periodic task set. For example, the method proposed by Audsley et al. [10]
involves computing the worst-case response time R
i
of each periodic task. It is derived by summing its
computation time and the interference caused by tasks with higher priority:
R
i
= C
i
+
khp(i)
R
i
T
k
C
k
(12.2)
where hp(i) denotes the set of tasks having priority higher than task i and x denotes the ceiling of
a rational number, that is, the smaller integer greater than or equal to x. The equation above can be solved
by an iterative approach, starting with R
i
(0) = C
i
and terminating when R
i
(s) = R
i
(s 1). If R
i
(s) > D
i
for some task, then the task set cannot be feasibly scheduled by DM. Under EDF, the schedulability analysis
for periodic task sets with deadlines less than periods is based on the processor demand criterion, proposed
by Baruah et al. [12]. According to this method, a task set is schedulable by EDF if and only if, in every
interval of length L (starting at time 0), the overall computational demand is no greater than the available
processing time, that is, if and only if
L > 0,
n
i=1
L +T
i
D
i
T
i
C
i
L (12.3)
This test is feasible, because L can only be checked for values equal to task deadlines no larger than
the least common multiple of the periods. A detailed analysis of EDF has been presented by Stankovic,
Ramamritham, Spuri and Buttazzo [30] under several workload conditions.
12.3 Aperiodic Task Handling
Although in a real-time system most acquisition and control tasks are periodic, there exist computational
activities that must be executed only at the occurrence of external events (typically signaled through
interrupts), which may arrive at irregular times. When the system must handle aperiodic requests of
computation, we have to balance two conicting interests: on the one hand, we would like to serve an event
as soon as possible to improve system responsiveness; on the other hand, we do not want to jeopardize the
schedulability of periodic tasks. If aperiodic activities are less critical than periodic tasks, then the objective
of a scheduling algorithm should be to minimize their response time, while guaranteeing that all periodic
tasks (although being delayed by the aperiodic service) complete their executions within their deadlines.
If some aperiodic task has a hard deadline, we should try to guarantee its timely completion ofine. Such a
guarantee can only be done by assuming that aperiodic requests, although arriving at irregular intervals,
do not exceed a maximum given frequency, that is, they are separated by a minimum interarrival time.
An aperiodic task characterized by a minimum interarrival time is called a sporadic task. Let us consider
an example in which an aperiodic job J
a
of 3 units of time must be scheduled by RM along with two
periodic tasks, having computation times C
1
= 1, C
2
= 3 and periods T
1
= 4, T
2
= 6, respectively.
As shown in Figure 12.2, if the aperiodic request is serviced immediately (i.e., with a priority higher than
that assigned to periodic tasks), then task
2
will miss its deadline.
The simplest technique for managing aperiodic activities while preserving the guarantee for periodic
tasks is to schedule them in background. This means that an aperiodic task executes only when the
t
1
t
2
J
a
0 12 4 6 2 8 10
4 8 12
6 12 0
0
Deadline miss
FIGURE 12.2 Immediate service of an aperiodic task. Periodic tasks are scheduled by RM.
t
1
t
2
0 12 4 6 2 8 10
4 8 12
6 12 0
0
J
a
FIGURE 12.3 Background service of an aperiodic task. Periodic tasks are scheduled by RM.
processor is not busy with periodic tasks. The disadvantage of this solution is that, if the computational
load due to periodic tasks is high, the residual time left for aperiodic execution can be insufcient for
satisfying their deadlines. Considering the same task set as before, Figure 12.3 illustrates how job J
a
is
handled by a background service.
The response time of aperiodic tasks can be improved by handling them through a periodic server
dedicated to their execution. As any other periodic task, a server is characterized by a period T
s
and
an execution time C
s
, called the server capacity (or budget). In general, the server is scheduled using the
algorithm adopted for periodic tasks and, once activated, it starts serving the pending aperiodic requests
within the limit of its current capacity. The order of service of the aperiodic requests is independent
of the scheduling algorithm used for the periodic tasks, and it can be a function of the arrival time,
computation time, or deadline. During the last years, several aperiodic service algorithms have been
proposed in the real-time literature, differing in performance and complexity. Among the xed priority
algorithms we mention the Polling Server, the Deferrable Server [13,14], the Sporadic Server [15], and the
Slack Stealer [16]. Among those servers using dynamic priorities (which are more efcient on the average),
we recall the Dynamic Sporadic Server [17,18], the Total Bandwidth Server [19], the Tunable Bandwidth
Server [20], and the Constant Bandwidth Server [21]. In order to clarify the idea behind an aperiodic
server, Figure 12.4 illustrates the schedule produced, under EDF, by a Dynamic Deferrable Server with
capacity C
s
= 1 and period T
s
= 4. We note that, when the absolute deadline of the server is equal to the
t
1
t
2
J
a
C
s
4 12 8
6 12 0
0
0 12 4 2 6 8 10
0 12 4 2 6 8 10
FIGURE 12.4 Aperiodic service performed by a Dynamic Deferrable Server. Periodic tasks, including the server,
are scheduled by EDF. C
s
is the remaining budget available for J
a
.
t
1
t
2
0 12 4 6 2 8 10
4 8 12
6 1 2 0
0
J
a
FIGURE 12.5 Optimal aperiodic service under EDF.
one of a periodic task, priority is given to the server in order to enhance aperiodic responsiveness. We also
observe that the same task set would not be schedulable under a xed priority system.
Although the response time achieved by a server is less than that achieved through the background
service, it is not the minimum possible. The minimum response time can be obtained with an optimal
server (TB
) that assigns each aperiodic request the earliest possible deadline which still produces a feasible
EDF schedule [20]. The schedule generated by the optimal TB
algorithm is illustrated in Figure 12.5,

where the minimum response time for job J
a
is equal to 5 units of time (obtained by assigning the job
a deadline d
a
= 7). As for all the efcient solutions, the better performance is achieved at the price of a
larger runtime overhead (due to the complexity of computing the minimumdeadline). However, adopting
a variant of the algorithm, called the Tunable Bandwidth Server [20], overhead cost and performance can
be balanced in order to select the best service method for a given real-time system. An overview of the
Normal execution
Critical section
t
2
t
2
t
1
t
3
Blocked
t
0
t
1
t
3
t
4
t
6
t
5
t
7
FIGURE 12.6 Example of priority inversion.
most common aperiodic service algorithms (both under xed and dynamic priorities) can be found in
Reference 3.
12.4 Protocols for Accessing Shared Resources
When two or more tasks interact through shared resources (e.g., shared memory buffers), the direct use of
classical synchronization mechanisms, such as semaphores or monitors, can cause a phenomenon known
as priority inversion: a high priority task can be blocked by a low priority task for an unbounded interval
of time. Such a blocking condition can create serious problems in safety critical real-time systems, since
it can cause deadlines to be missed. For example, consider three tasks,
1
,
2
, and
3
, having decreasing
priority (
1
is the task with highest priority), and assume that
1
and
3
share a data structure protected
by a binary semaphore S. As shown in Figure 12.6, suppose that at time t
1
task
3
enters its critical section,
holding semaphore S. During the execution of
3
, at time t
2
, assume
1
becomes ready and preempts
3
.
At time t
3
, when
1
tries to access the shared resource, it is blocked on semaphore S, since the resource is
used by
3
. Since
1
is the highest priority task, we would expect it to be blocked for an interval no longer
than the time needed by
3
to complete its critical section. Unfortunately, however, the maximumblocking
time for
1
can become much larger. In fact, task
3
, while holding the resource, can be preempted by
mediumpriority tasks (such as
2
), which will prolong the blocking interval of
1
for their entire execution!
The situation illustrated in Figure 12.6 can be avoided by simply preventing preemption inside critical
sections. This solution, however, is appropriate only for very short critical sections, because it could cause
unnecessary delays for high priority tasks. For example, a low priority task inside a long critical section
would prevent the execution of a high priority task, even though they do not share any resource. A more
efcient solutionis to regulate the access to shared resource through the use of specic concurrency control
protocols [22], designed to limit the priority inversion phenomenon.
12.4.1 Priority Inheritance Protocol
An elegant solution to the priority inversion phenomenon caused by mutual exclusion is offered by
the priority inheritance protocol (PIP) [23]. Here, the problem is solved by dynamically modifying the
priorities of tasks that cause a blocking condition. In particular, when a task
a
blocks on a shared resource,
it transmits its priority to the task
b
that is holding the resource. In this way,
b
will execute its critical
Normal execution
Critical section
t
2
t
7
t
5
t
6
t
2
t
1
t
2
t
0
t
1
t
3
t
4
Push-through blocking
Direct blocking
FIGURE 12.7 Schedule produced using Priority Inheritance on the task set of Figure 12.6.
section with the priority of task
a
. In general,
b
inherits the highest priority among the tasks it blocks.
Moreover, priority inheritance is transitive, thus if task
c
blocks
b
, which in turn blocks
a
, then
c
will
inherit the priority of
a
through
b
.
Figure 12.7 illustrates how the schedule shown in Figure 12.6 is changed when resources are accessed
using the PIP. Until time t
3
the system evolution is the same as the one shown in Figure 12.6. At time t
3
,
the high priority task
1
blocks after attempting to enter the resource held by
3
(direct blocking). In this
case, however, the protocol imposes that
3
inherits the maximumpriority among the tasks blocked onthat
resource, thus it continues the execution of its critical section at the priority of
1
. Under these conditions,
at time t
4
, task
2
is not able to preempt
3
, hence it blocks until the resource is released (push-through
blocking).
In other words, although
2
has a nominal priority greater than
3
, it cannot execute, because
3
inherited the priority of
1
. At time t
5
,
3
exits its critical section, releases the semaphore and recovers its
nominal priority. As a consequence,
1
can proceed until its completion, which occurs at time t
6
. Only
then
2
can start executing.
The PIP has the following property [23]: given a task , if n is the number of tasks with lower priority
sharing a resource with a task with priority higher than or equal to and m is the number of semaphores
that could block , then can be blocked for at most the duration of min(n, m) critical sections.
Although the PIP limits the priority inversion phenomenon, the maximum blocking time for high
priority tasks can still be signicant, due to possible chained blocking conditions. Moreover, deadlock can
occur if semaphores are not properly used in nested critical sections.
12.4.2 Priority Ceiling Protocol
The priority ceiling protocol (PCP) [23] provides a better solution for the priority inversion phenomenon,
also avoiding chained blocking and deadlock conditions. The basic idea behind this protocol is to ensure
that, whenever a task enters a critical section, its priority is the highest among those that can be inherited
from all the lower priority tasks that are currently suspended in a critical section. If this condition is not
satised, is blocked and the task that is blocking inherits s priority. This idea is implemented by
assigning each semaphore a priority ceiling equal to the highest priority of the tasks using that semaphore.
Then, a task is allowed to enter a critical section only if its priority is strictly greater than all priority
ceilings of the semaphores held by the other tasks. As for the PIP, the inheritance mechanism is transitive.
The PCP, besides avoiding chained blocking and deadlocks, has the property that each task can be blocked
for at most the duration of a single critical section.
12.4.3 Schedulability Analysis
The importance of the protocols for accessing shared resources in a real-time system derives from the
fact that they can bound the maximum blocking time experienced by a task. This is essential for
analyzing the schedulability of a set of real-time tasks interacting through shared buffers or any other
non-preemptable resource, for example, a communication port or bus. To verify the schedulability of
task
i
using the processor utilization approach, we need to consider the utilization factor of task
i
,
the interference caused by the higher priority tasks and the blocking time caused by lower priority tasks.
If B
i
is the maximum blocking time that can be experienced by task
i
, then the sum of the utilization
factors due to these three causes cannot exceed the least upper bound of the scheduling algorithm, that is:
i, 1 i n,
khp(i)
C
k
T
k
+
B
i
T
i
i(2
1/i
1) (12.4)
where hp(i) denotes the set of tasks withpriority higher than
i
. The same test is validfor boththe protocols
described above, the only difference being the amount of blocking that each task may experience.
12.5 New Applications and Trends
In the last years, real-time system technology has been applied to several application domains, where
computational activities have less stringent timing constraints and occasional deadline misses are typically
tolerated. Examples of such systems include monitoring, multimedia systems, ight simulators, and, in
general, virtual reality games. In such applications, missing a deadline does not cause catastrophic effects
on the system, but just a performance degradation. Hence, instead of requiring an absolute guarantee for
the feasibility of the schedule, such systems demand an acceptable quality of service (QoS). It is worth
observing that, since some timing constraints needtobe handledanyway (althoughnot critical), a nonreal-
time operating system, sucha Linux or Windows, is not appropriate: rst of all, suchsystems donot provide
temporal isolation among tasks, thus a sporadic peak load on a task may negatively affect the execution
of other tasks in the system. Furthermore, the lack of concurrency control mechanisms that prevent
priority inversion makes these systems unsuitable for guaranteeing a desired QoS level. On the other
hand, a hard real-time approach is also not well suited for supporting such applications, because resources
would be wasted due to static allocation mechanisms and pessimistic design assumptions. Moreover, in
many multimedia applications, tasks are characterized by highly variable execution times (consider, for
instance, an MPEG player), thus providing precise estimations on task computation times is practically
impossible, unless one uses overly pessimistic gures. In order to provide efcient as well as predictable
support for this type of real-time applications, several new approaches and scheduling methodologies
have been proposed. They increase the exibility and the adaptability of a system to online variations.
For example, temporal protection mechanisms have been proposed to isolate task overruns and reduce
reciprocal task interference [21, 24]. Statistical analysis techniques have been introduced to provide a
probabilistic guarantee aimed at improving system efciency [21]. Other techniques have been devised to
handle transient and permanent overload conditions in a controlled fashion, thus increasing the average
computational load in the system. One method absorbs the overload by regularly aborting some jobs
of a periodic task, without exceeding a maximum limit specied by the user through a QoS parameter
describing the minimum number of jobs between two consecutive abortions [25, 26]. Another technique
handles overloads through a suitable variation of periods, managed to decrease the processor utilization
up to a desired level [27].
12.6 Conclusions
This paper surveyed some kernel methodologies aimed at enhancing the efciency and the predict-
ability of real-time control applications. In particular, the paper presented some scheduling algorithms
and analysis techniques for periodic and aperiodic task sets. Two concurrency control protocols have
been described to access shared resources in mutual exclusion while avoiding the priority inversion
phenomenon. Each technique has the property to be analyzable, so that an ofine guarantee can be
provided for feasibility of the schedule within the timing constraints imposed by the application. For
soft real-time systems, such as multimedia systems or simulators, the hard real-time approach can be too
rigid and inefcient, especially when the application tasks have highly variable computation times. In
these cases, novel methodologies have been introduced to improve average resource exploitation. They
are also able to guarantee a desired QoS level and control performance degradation during overload
conditions. In addition to research efforts aimed at providing solutions to more complex problems,
a concrete increase in the reliability of future real-time systems can only be achieved if the mature
methodologies are actually integrated in next generation operating systems and languages, dening new
standards for the development of real-time applications. At the same time, programmers and software
engineers need to be educated about the appropriate use of the available technologies.
References
[1] J. Stankovic, A Serious Problem for Next-Generation Systems. IEEE Computer. 1019, 1988.
[2] J. Stankovic and K. Ramamritham, Tutorial on Hard Real-Time Systems, IEEE Computer Society
Press, Washington, 1988.
[3] G.C. Buttazzo, Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and
Applications, Kluwer Academic Publishers, Boston, MA, 1997.
[4] J. Stankovic, M. Spuri, M. Di Natale, and G. Buttazzo, Implications of Classical Scheduling Results
for Real-Time Systems. IEEE Computer, 28, 1625, 1995.
[5] C.L. Liu and J.W. Layland, Scheduling Algorithms for Multiprogramming in a Hard Real-Time
Environment. Journal of the ACM, 20, 4061, 1973.
[6] J.P. Lehoczky, L. Sha, and Y. Ding, The Rate-Monotonic Scheduling Algorithm: Exact Character-
ization and Average Case Behaviour. In Proceedings of the IEEE Real-Time Systems Symposium,
pp. 166171, 1989.
[7] M.H. Klein et al., A Practitioners Handbook for Real-Time Analysis: Guide to Rate Monotonic
Analysis for Real-Time Systems, Kluwer Academic Publishers, Boston, MA, 1993.
[8] M.L. Dertouzos, Control Robotics: the Procedural Control of Physical Processes. Information
Processing, Vol. 74, North-Holland, Amsterdam, 1974.
[9] J. Leung and J. Whitehead, On the Complexity of Fixed Priority Scheduling of Periodic Real-Time
Tasks. Performance Evaluation, 2, 237250, 1982.
[10] N.C. Audsley, A. Burns, M. Richardson, K. Tindell, and A. Wellings, Applying New Scheduling
Theory to Static Priority Preemptive Scheduling. Software Engineering Journal, 8, 284292, 1993.
[11] M. Joseph and P. Pandya, Finding Response Times in a Real-Time System. The Computer Journal,
29, 390395, 1986.
[12] S.K. Baruah, R.R. Howell, and L.E. Rosier, Algorithms and Complexity Concerning the Preemptive
Scheduling of Periodic Real-Time Tasks on One Processor. Real-Time Systems, 2, 301324, 1990.
[13] J.P. Lehoczky, L. Sha, and J.K. Strosnider, Enhanced Aperiodic Responsiveness in Hard Real-Time
Environments. In Proceedings of the IEEE Real-Time Systems Symposium, pp. 261270, 1987.
[14] J.K. Strosnider, J.P. Lehoczky, and L. Sha, The Deferrable Server Algorithmfor EnhancedAperiodic
Responsiveness in Hard Real-Time Environments. IEEE Transactions on Computers, 44, 1995.
[15] B. Sprunt, L. Sha, and J. Lehoczky, Aperiodic Task Scheduling for Hard Real-Time System. Journal
of Real-Time Systems, 1, 2760, 1989.
[16] J.P. Lehoczky and S. Ramos-Thuel, An Optimal Algorithm for Scheduling Soft-Aperiodic
Tasks in Fixed-Priority Preemptive Systems. In Proceedings of the IEEE Real-Time Systems
Symposium, 1992.
[17] T.M. Ghazalie and T.P. Baker, Aperiodic Servers in a Deadline Scheduling Environment Real-Time
Systems, 9(1), 3167, 1995.
[18] M. Spuri and G.C. Buttazzo, Efcient Aperiodic Service under Earliest Deadline Scheduling.
In Proceedings of IEEE Real-Time System Symposium, San Juan, PR, December 1994.
[19] M. Spuri and G. Buttazzo, Scheduling Aperiodic Tasks in Dynamic Priority Systems. Real-Time
Systems, 10(2), 179210, 1996.
[20] G. Buttazzo and F. Sensini, Optimal Deadline Assignment for Scheduling Soft Aperiodic Tasks in
Hard Real-Time Environments. IEEE Transactions on Computers, 48(10), 10351052, 1999.
[21] L. Abeni and G. Buttazzo, Integrating Multimedia Applications in Hard Real-Time Systems.
In Proceedings of the IEEE Real-Time Systems Symposium, Madrid, Spain, December 1998.
[22] R. Rajkumar, Synchronous Programming of Reactive Systems, Kluwer Academic Publishers,
Boston, MA, 1991.
[23] L. Sha, R. Rajkumar, and J.P. Lehoczky, Priority Inheritance Protocols: An Approach to Real-Time
Synchronization. IEEE Transactions on Computers, 39, 11751185, 1990.
[24] I. Stoica, H-Abdel-Wahab, K. Jeffay, S. Baruah, J.E. Gehrke, and G.C. Plaxton, A Proportional
Share Resource Allocation Algorithm for Real-Time Timeshared Systems. In Proceedings of IEEE
Real-Time Systems Symposium, December 1996.
[25] G. Buttazzo and M. Caccamo, Minimizing Aperiodic Response Times in a Firm Real-Time
Environment. IEEE Transactions on Software Engineering, 25, 2232, 1999.
[26] G. Koren and D. Shasha, Skip-Over: Algorithms and Complexity for Overloaded Systems that
Allow Skips. In Proceedings of the IEEE Real-Time Systems Symposium, 1995.
[27] G. Buttazzo, G. Lipari, M. Caccamo, and L. Abeni, Elastic Scheduling for Flexible Workload
Management. IEEE Transactions on Computers, 51, 289302, 2002.
[28] E. Bini, G.C. Buttazzo, andG.M. Buttazzo, AHyperbolic Boundfor the Rate Monotonic Algorithm.
In Proceedings of the 13th Euromicro Conference on Real-Time Systems, Delft, The Netherlands,
pp. 5966, June 2001.
[29] E. Bini and G.C. Buttazzo, The Space of Rate Monotonic Schedulability. In Proceedings of the 23rd
IEEE Real-Time Systems Symposium, Austin, TX, December 2002.
[30] J. Stankovic, K. Ramamritham, M. Spuri, and G. Buttazzo, Deadline Scheduling for Real-Time
Systems, Kluwer Academic Publishers, Boston, MA, 1998.
13
Quasi-Static
Scheduling of
Concurrent
Specications
Alex Kondratyev
Cadence Berkeley Laboratories and
Luciano Lavagno
Claudio Passerone
Yosinori Watanabe
13.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-1
Quasi-Static Scheduling A Simple Example
13.2 Overview of Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-4
13.3 QSS for PNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-5
Denitions Specication Model Schedulability
Analysis Algorithmic Implementation
13.4 QSS for Boolean Dataow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-10
Denitions Schedulability Analysis Comparison to
PN Model
13.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13-14
13.1 Introduction
13.1.1 Quasi-Static Scheduling
The phenomenal growth of complexity and breadth of use of embedded systems can be managed only by
raising the level of abstraction at which design activities start and most design space exploration occurs.
This enables greater reuse potential, but requires signicant tool support for efcient analysis, mapping,
andsynthesis. Inthis chapter we deal withmethods aimedat providing designers withefcient methods for
uniprocessor software synthesis, from formal models that explicitly represent the available concurrency.
These methods can be extended to multiprocessor support and to hardware synthesis, however these
advanced topics are outside the scope of this chapter.
Concurrent specications, such as dataow networks [1], Kahn process networks [2], Communicating
Sequential Processes [3], synchronous languages [4], and graphical state machines [5], are interest-
ing because they expose the inherent parallelism in the application, which is much harder to recover
a posteriori by optimizing compilers. In such a specication, the application is described as a set of
processes that sequentially execute operations and communicate with each other. In considering an
13-1
implementation of the application, it is often necessary to analyze how these processes interact with each
other. This analysis is used for evaluating how often a process will be invoked during an execution of
the system, or how much memory will be required for implementing the communication between the
processes.
Quasi-Static Scheduling (QSS) is a technique for nding sequences of operations to be executed across
the processes that constitute a concurrent specication of the application. Several approaches have been
proposed [611], where they use certain mathematical models to abstract the specication and aim
to compute graphs of nite size such that the sequences are given by traversing the graphs. We call
the sequences of operations, or the graph which represents them, a schedule of the specication. The
schedule is static in the sense that it statically commits to a particular execution order of operations of
the processes. In general, there exist more than one possible order of operations to be executed, with
a different implementation cost for each. On the other hand, by committing to a particular sequence,
a static schedule allows a more rigorous analysis of the interaction among the processes than dynamic
schedules, because one can precisely observe how the operations fromdifferent processes are interleaved
to constitute the systemexecution.
The reason to start froma concurrent specication is twofold. First of all, coarse-grained parallelismis
very difcult to recover froma sequential specication, except in relatively simple cases (e.g., nested loops
with afne memory accesses [12]). Second, parallel specications offer a good model to performsystem-
level partitioning experiments, aimed at nding the best mixed hardware/software implementation on a
complex SOC platform. The reason to look for a sequencing of the originally concurrent operations is
that we are considering in this chapter embedded software implementations, for which context switching
implied by a multithreaded concurrent implementation would be very expensive, whenever concurrency
can be resolved at compile time.
This resolution is especially difcult if the specication involves data-dependent conditional constructs,
such as if-then-elsewitha data-dependent condition, because different sets of operations may be executed
depending upon how the constructs are resolved. For such a specication, the static scheduling produces
in principle a sequence of operations for each possible way of resolving the constructs (in practice, these
multiple sequences are collapsed as much as possible, in order to reduce code size). Note that these
constructs are resolved based on the data values, and therefore some of the resolutions of the constructs
may not happen at the runtime in a particular execution of the system. The information about data
values is not available to the static scheduling algorithm, because the latter runs statically at compile time.
In this sense, scheduling for a specication with such constructs is called quasi-static. It is responsible for
providing a sequence of operations to be executed for each possible runtime resolution of data-dependent
choices.
After a simple motivating example, we present an overview of some approaches proposed in the
literature. In Section 13.2, we consider two questions that one is concerned with in QSS, and briey
describe how these questions are addressed in two different models that have been proposed in the
literature. One of the models is Petri nets (PNs), and the other is Boolean Dataow (BDF) Graphs. They
model a given concurrent specication in different ways, and thus the expressiveness of the models
and issues that need to be accounted for to solve the scheduling problem are different. These two
models and issues on their scheduling approaches are presented more in detail in Sections 13.3 and 13.4,
respectively.
13.1.2 A Simple Example
Figure 13.1 illustrates how QSS works. In Figure 13.1(a), there are two processes, each with a sequential
program associated. The one shown on the left reads a value from port DATA into variable d, computes
a value for the variable D and writes it to the port PORT, and then goes back to the beginning. The other
process reads a value for variable Nfromport START, and then iterates the body of the for-loop Ntimes.
For each iteration, it reads two values from port IN, and sets them to x[0] and x[1], respectively. Here,
the third argument of the read function designates the number of data items to be read at a time. Since
QSS of Concurrent Specications 13-3
while (1) {
read(DATA, d, 1);
D = d*d;
write(PORT, D, 1);
}
while (1) {
read(START, N, 1);
for (i=0, y=0; i<N; i++) {
read(IN, x, 2);
y = y + x[0] + 2*x[1];
}
write(OUT, y, 1);
}
DATA
START
goto DE;
Start: read(START, N, 1); i=0; y=0;
DE: if (i<N) {
read(DATA, d, 1); D = d*d; x[1] = D;
y = y + x[0] + 2*x[1]; i++;
else
write(OUT, y, 1);
goto Start;
read(DATA, d, 1); D = d*d; x[0] = D;
} {
}
DATA START
OUT
PORT
OUT
IN
(a)
(b)
FIGURE 13.1 A simple example: (a) initial specication, (b) result of the schedule.
IN is connected with PORT, it means that the process on the left needs to produce the necessary data.
However, it writes only one data item at a time to PORT, and therefore it needs to iterate the body
of the while-loop twice in order to provide enough data items required by this read statement. Once
the values of x have been set, a value for variable y is computed. At the end of the for-loop, the result
assigned to y is written to the port OUT, and this whole process is repeated. Throughout this paper,
we assume that the communication between processes is made through a point-to-point nite buffer
with the rst in rst out (FIFO) semantics. Therefore, a read operation can be completed only when the
requested number of data items is available at the corresponding buffer. Similarly, a write operation can
be completed only if the number of data items does not exceed the predened bound of the buffer after
writing.
A result of scheduling this system is shown in Figure 13.1(b). It is a single process that interleaves the
statements of the original two processes. Note that the resulting process does not have ports for PORTand
IN, which were originally used for connecting the two processes, because the read and write operations
to these ports are replaced by assignments of the variable D to x[0] and x[1]. In this way, scheduling
uses data assignments to realize the communication between the original processes, which is often more
efcient to implement. Further, it repeats the same set of operations given by read; D = d*d; x[i] = D;,
making explicit the fact that one of the original processes needs to iterate twice for each iteration of the
for-loop of the other process. Such a repetition could be effectively exploited in general to realize an
efcient implementation, but it can be identied only by analyzing how the original processes interact
with each other, and therefore is not taken into account when implementing each process individually. The
effectiveness of this kind of scheduling is shown by case studies such as [13], where the QSS was applied
to a part of the MPEG video decoding algorithm and the speed of the scheduled design was improved
by 45%. The improvement was mainly due to the replacement of the explicit communication among the
processes by data assignments, and also due to a reduction of the number of processes, which in turn
reduced the amount of context switches.
13.2 Overview of Related Work
When solving the scheduling problem two main questions are usually of interest:
1. Does the specication have a bounded-length cyclic schedule? By length, we mean the number
of steps in a schedule required to return the specication to its initial state. This question is
important if the specication is to be scheduled with a hard real-time constraint.
2. Can the specication be scheduled within bounded memory? This means that at every state
of a schedule one can compute and move to the next step using a nite amount of memory
for communication buffers, and eventually return to the original state. A bounded-length cyclic
schedule implies bounded memory but not vice versa, as will be discussed more in detail in
Section 13.4.
Depending on the descriptive power of a model used to represent the specication, these questions
have different answers. One such model is dataow graphs, which are commonly used for digital signal
processing applications. In Static Dataow (SDF) Graphs [1], the number of tokens produced by a
process
1
on an output port, or consumed by a process from an input port, is xed and known statically,
or at compile time. Computationally efcient algorithms exist to answer questions 1 and 2 for any SDF
specication [1]. Furthermore, all schedulable graphs have bounded length schedules and the required
memory is bounded.
When the specication contains data-dependent conditional constructs, SDF graphs are insufcient to
model it. An extension of SDF to handle such constructs could be done differently: (1) by associating data
values with token ows, or (2) by introducing nondeterminism structurally (see Figure 13.2).
Examples of the rst modeling approach can be found in a rich body of research on BDF Graphs and
their derivatives/extensions [6, 7, 9, 11]. A similar modeling mechanism is also exploited in scheduling
approaches starting fromnetworks of nite state machine-like models [10, 14].
Interestingly, answering the question about the existence of bounded length schedules for an arbitrary
BDF graph can be done nearly as simply as for SDF. However, the status of the bounded memory problem
in BDF is very different. Annotating tokens with values makes the BDF model Turing complete and the
problemof nding a bounded memory schedule becomes undecidable [6]. For this reason, papers devoted
to BDF scheduling propose heuristic approaches for schedulability analysis.
An example of this is given in Reference 9. The proposed method initially sets a bound on the size of
each buffer based on the structural information of the specication, and tries to nd a schedule within
the bound. If a schedule is not found, the procedure heuristically increases the sizes of some buffers, and
repeats the search. In order to claimthe absence of a schedule within a given bound, the reachability space
of the systemdened for the bound is completely analyzed.
Other heuristics exploit clustering algorithms, which in case of success derive a bounded memory
schedule, while in case of failure leave the question open [6].
The work given in Reference 8 employs a PN as the underlying model for the systemspecication and
searches for a schedule in its reachability space. It abstracts data-dependent conditional constructs of the
specication using nondeterministic choices (see Figure 13.2). This abstraction in general helps improve
the efciency of the scheduling procedure, while it makes the approach conservative. The PN model is
not Turing complete and there are only a few problems that are undecidable for PNs [15]. Nevertheless,
decidability of the problem of nding bounded memory schedules for a PN has not been proven or
disproven. However for the important subclass of equal-choice PNs (an extension of free-choice PNs),
bounded memory schedules are found efciently (if they exist).
Alist of modeling approaches andthe complexity of their schedulability problems is showninTable 13.1,
where O(|Cycle_seq|) denotes that the problem is decidable and its computational complexity is linear
1
In the terminology of dataow graphs, a process is often called actor, and we may use these terms interchangeably
in this chapter.
if (x > 0)
else
A: y=x;
B: y=x*x;
if (x > 0)
else
A: y=x;
B: y=x*x;
Static
scheduling
SDF = Marked graphs
Data dependence
Out=d*x[0]*x[1];
read(DATA,x,2);
read(COEF,d,1);
OUT
COEF DATA
2
Structural
representation
Value
representation
SWITCH
x
>0
F T
B A
A B
BDF
x
PN
QSS
FIGURE 13.2 The PN and BDF modeling approaches.
TABLE 13.1 Models for Embedded Systems and the Complexity of Scheduling Problems
PN
SDF graph BDF graph Equal-choice General
Modeling data dependence No Yes Yes Yes
Bounded length schedule O(|Cycle_seq|) O(|Cycle_seq|) O(|Cycle_seq|) O(|Cycle_seq|)
Bounded memory schedule O(|Cycle_seq|) Undecidable O(|Cycle_seq|) Unknown
in the length of the sequence that brings the specication back to its initial state (called cyclic sequence).
Note that, however, the size of this cyclic sequence can be exponential in the size of the SDF graph.
We will reviewscheduling approaches basedonPNs andonBDFmore indetail inSections 13.3 and13.4,
respectively.
13.3 QSS for PNs
13.3.1 Denitions
A PN is dened by a tuple (P, T, F, M
0
), where P and T are sets of places and transitions respectively.
F is a function from (P T) (T P) to nonnegative integers. A marking M is another function from
P to nonnegative integers, where M[p] denotes the number of tokens at p in M. M
0
is the initial marking.
A PN can be represented by a directed bipartite graph, where an edge [u, v] exists if F(u, v) is positive,
which is called the weight of the edge. A transition t is enabled at a marking M, if M[p] F( p, t ) for
all p of P. In this case, one may re the transition at the marking, which yields a marking M
given by
M
[p] = M[p] F( p, t ) + F(t , p) for each p of P. In the sequel, M

t
M
denotes the fact that a

transition t is enabled at a marking M and M
is obtained by ring t at M. A transition t is said to be a

source, if F(p, t ) = 0 for all p of P. A marking M
is said to be reachable from M if there is a sequence

of transitions reable from M that leads to M
. The set of markings reachable fromthe initial marking is

denoted by R(M
0
).
A transition systemis a tuple A = (S, , , s
in
), where S is a set of states, is an alphabet of symbols,
S S is the transition relation and s
in
is the initial state. Given s S, t is said to be reable
at s if there exists s
S such that (s, t , s
) , and we denote it with s

t
s
. If corresponds to the set

of transition T of a PN, a transition systemcan be used to represent a portion of the reachability space of
that PN. This concept will be used in the denition of schedule in Section 13.3.3.
A key notion for dening schedules is that of equal conict sets. A pair of transitions t
i
and t
j
is said to
be in equal conict, if F( p, t
i
) = F( p, t
j
) for all p of P. These transitions are in conict in the sense that
t
i
is enabled at a given marking if and only if t
j
is enabled and it is not possible to re both transitions
at the marking. Equal conict is an equivalence relation dened on the set of transitions, and each
equivalence class is called equal conict set (ECS). By denition, if one transition of an ECS is enabled at
a given marking, all the other transitions of the ECS are also enabled. Thus, we may say that this ECS is
enabled at the marking. A place p is said to be a choice place if it has more than one successor transition.
A choice place is equal choice (a generalization of free choice [16]) if all the successor transitions are in
the same ECS. A PN is said to be equal choice if all choice places are equal.
13.3.2 Specication Model
Aspecication is given as a set of processes communicating via FIFOs, as described in Section 13.1.2. APN
is then used to model the specication by employing a specic abstraction. The PNfor the example shown
in Figure 13.1(a) is shown in Figure 13.3. Communication operations on ports and the computation
operations of the processes are modeled by transitions. Places are used to represent both sequencing
within processes (a single token models the program counter) and FIFO communication (the tokens
model the presence of the data items, while hiding their values). For an input (output) port connected
to the environment, a source (sink) transition is connected to the place for the port. Overall, the PN
represents the control ow of the programs of the specication. Place p
7
is a choice place, while the
transitions D and E form an ECS, as dened in the previous section. This choice place models at the
data-dependent control construct given by the termination condition of the for-loop in the specication
(Figure 13.1[a]). We note that the PN does not model the termination condition of the for-loop, that is,
it abstracts away the condition on data with which control constructs are resolved.
An automatic procedure to produce this model of PNs has been developed for the Clanguage, extended
with constructs for implementing operations of the FIFO communication [8].
In addition to the read and write operations for FIFOs, the extended language supports another com-
munication construct called select [17]. It realizes synchronization-dependent control, which determines
the control ow of the program depending on the availability of data items on input ports. It accepts as
DATA
START
D
A B
OUT
F
C
E
p
3
p
4
p
7
p
8
p
9
p
6
p
5
p
1
p
2
2
FIGURE 13.3 PN model for the example of Figure 13.1.
argument a set of input ports, and nondeterministically selects one of them with data available at the port.
In case none of them has data, the process blocks until some is available.
The select operations introduce nondeterminism in the specication, which some may consider a bad
idea.
2
However, they are essential in order to model efciently systems whose control ow depends on the
availability of data at some ports. For example, when specifying a component of a multimedia application,
parameters such as the coefcients used in a ltering algorithm are typically provided nondeterministically
at a control port, since they depend on the quality and size of the images being processed, which can be
dynamically changed by a user of the device or by a global quality-of-service manager. In this case, the
process looks at the control port as well as the data ports, and if new data are available at the control port,
the process uses them for the ltering while otherwise it will simply proceed using the current values for
the coefcients. Always polling for new coefcient values at every new image would lead to unnecessary
loss of performance [17].
An example of a specication including select is shown in Figure 13.4(a), which has two processes,
two input ports (IN and COEF) and one output port (OUT). The processes communicate to each other
through the channel DATA. The process Filter extracts data inserted by Get_Data, multiplies them by
a coefcient and sends them to the environment through the port OUT, where the availability of the
coefcient is tested by the select statement. Figure 13.4(b) shows a PN for this specication.
In formally dening a schedule based on this PN model, it is necessary to clarify how the environ-
ment is modeled. This work, following the notation introduced in Reference 1, uses source transitions,
as depicted by T
in
and T
coef
in Figure 13.4(b). However, the basic model of Reference 1 is extended by
distinguishing between two types of inputs, and hence of source transitions. The types are called con-
trollable and uncontrollable respectively. The uncontrollable inputs are the stimuli to the system being
scheduled, that is, the system execution takes place as a reaction to events provided by this type of inputs.
The objective of the scheduling is then to nd a sequence of operations to be executed in each such
reaction. The scheduling problem is dened under the assumption that all the uncontrollable inputs are
independent with respect to each other and with respect to the execution of the system. This means that
the system cannot tell when the stimuli are provided by the environment or how they are related, and
thus no such information can be assumed when schedules are sought. Therefore, a schedule must be
designed so that when the system is ready to react to a stimulus from one uncontrollable input, it must
be ready to react to a stimulus from any other uncontrollable input. In Figure 13.4, all the inputs are
uncontrollable.
Controllable inputs, on the other hand, represent data from the environment that the system can
acquire whenever it decides to do so. It follows that schedules can be sought under the assumption that if
the read operation is executed on a controllable input, then the operation will always succeed in reading
the specied amount of data from the input without blocking the execution of the system. For example,
in the specication given in Figure 13.1, the input to the port START was specied as uncontrollable
while the port DATA was given as controllable.
This environment model covers the one used in dataow models such as SDF and BDF, in which one
input (the rst one selected by the scheduler) can be considered as uncontrollable, while all other inputs
are controllable.
The problem to be solved is to nd a schedule for a given PN and to generate a software code that
implements the schedule. A schedule is formally dened as a transition systemthat satises the following
properties with respect to the given PN.
2
An essential property of dataownetworks and Kahn process networks is the fact that the result of the computation
is independent of the order in which the processes are scheduled for execution. This property no longer holds if the
select construct is added.
PROCESS Filter (InPort DATA,
float c,d; int j;
c=1; j=0;
while (1) {
SELECT (DATA, COEF) {
case DATA:
READ (DATA,d,1);
if ( j == N) {
j=0; d = d*c; WRITE (OUT,d,1);
} else j++;
break;
case COEF: READ(COEF,c,1); break;
} }}
InPort COEF,OutPort OUT) {
PROCESS GetData (InPort IN,
float sample,sum; int i;
while (1) {
sum = 0;
for (i=0; i<N; i++) {
READ (IN,sample,1);
sum += sample;
WRITE (DATA, sample,1);
}
WRITE (DATA, sum/N,1);
OutPort DATA) {
} }
DATA
OUT COEF
IN
(a)
(b)
sum=0; i =0; WRITE(DATA,sum/N,1)
READ(IN,sample,1);
sum += sample;
WRITE(DATA,sample,1)
i++
j =0; d=d
c;
WRITE(OUT,d,1)
c=1; j =0
READ(DATA,d,1) READ(COEF,c,1)
j==N
j++
IN
T
in
p
1
p
5
t
1
t
2
t
6
p
6
t
7
t
3
t
5
t
8
t
4
p
7
t
9
t
10
p
4
p
3
p
2
true
False
DATA
COEF
T
coef
false
true
i <N
FIGURE 13.4 Systemspecication and corresponding PN.
Denition 13.1 (Sequential schedule). Given a Petri net N = (P, T, F, M
0
), a sequential schedule of N is
a transition systemSch = (S, T, , s
0
) with the following properties:
1. S is nite and there is a mapping : S R(M
0
), with (s
0
) = M
0
.
3
2. If transition t is reable in state s, with s
t
s
, then (s)
t
(s
) in N.
3. If t
1
is reable in s, then t
2
is reable in s if and only if t
2
ECS(t
1
).
4. For each state s S, there is a path s

s
t
for each uncontrollable source transition t of N.
Property 2 implies trace containment between Sch and N (any feasible trace in the schedule is feasible
in the original PN). Property 3 indicates that one ECS is scheduled at each state. Finally, the existence of
the path in property 4 ensures that any input event fromthe environment will be eventually served.
3
This mapping is required in order to enable the same state to be visited multiple times with different termination
criteria.
Intuitively, scheduling can be deemed as a game between the scheduler and the environment. The rules
of the game are the following:
The environment makes a move by ring any of the uncontrollable source transitions.
The scheduler picks up any of the enabled transitions to re with two exceptions:
(a) It has no control over choosing which one of the uncontrollable source transitions to re.
(b) It cannot resolve choice for data-dependent constructs, which are described by equal-choice
places.
In cases (a) and (b) the scheduler must explore all possible branches during the traversal of the
reachability space, that is, re all the transitions from the same ECS. However, it can decide the
moment for serving the source transitions or for resolving an equal choice, because it can nitely
postpone these by choosing some other enabled transitions to re.
The goal of the game is to process any input from the environment while keeping the traversed space
(and hence the amount of memory required to implement the communication buffers) nite. In case of
success, the result is to both classify the original PN as schedulable and derive the set of states (schedule)
that the scheduler can visit while serving an arbitrary mix of source transitions.
Under the assumption that the environment is sufciently slow to allow the scheduler to re all
nonsource transitions, the schedule is an upper approximation of the set of states visited during the
real-time execution of the specication. This is due to the fact that the scheduler is constructed taking
into account the worst possible conditions, since it has no knowledge about the environment behavior
and data dependencies.
13.3.4 Algorithmic Implementation
In this section, we describe an algorithm for nding a schedule for each uncontrollable source transition a
of a given PN. The algorithm, which is fully described in Reference 8, gradually creates a rooted tree, and
a postprocessing step creates a cycle for each leaf to generate a schedule.
The algorithm initially creates a root node corresponding to the initial marking of the PN, and res
the source transition a, generating a new marking. From here, it tries to create a tree by ring-enabled
transitions. For each node that is added to the tree, it checks whether a termination condition is satised,
or if an ancestor with the same marking exists. In the latter case, the search along the path is stopped and
the branch is closed into a loop with the ancestor node. To avoid exploring the possibly innite reachability
space of the PN, the algorithm uses a heuristic to identify a boundary of that space so that it would not
need to search beyond it [8].
If a schedule is found, the corresponding code that implements the schedule must be generated.
Although a direct translation of the schedule into code is possible, it usually increases the code size, since
different paths of the schedule may be associated with the same sequence of transitions. Optimizations
are thus required to reduce the code size. Also, ports that originally belong to different processes might
become part of the same nal task, and therefore do not require any communication primitive, but
rather are implemented using assignments or circular buffers, whose size can be statically determined by
analyzing the schedule.
As an example, let us consider the system illustrated in Figure 13.1(a). The PN model for the two
processes is shown in Figure 13.3, where the source port START is uncontrollable, while the source port
DATA is controllable. The ports PORT and IN are connected through place p
4
. In the initial marking,
places p
2
and p
6
have a token each.
After creating the root node of the tree, the algorithm to nd the schedule res the only uncontrollable
source transition START, generating a new node in the schedule with marking p
2
p
5
p
6
. Then, either transi-
tion C or DATA are enabled, and we may decide to re C. In the newly created node with marking p
2
p
7
,
transitions D and E are both enabled, and they constitute an equal-choice set. Therefore, the algorithm
explores the two branches, until it can close loops for both of them. The nal schedule is shown in
Figure 13.5.
p
2
p
6
p
2
p
5
p
6
p
1
p
2
p
4
p
8
START
C
D
DATA
A
B
DATA A B
F
E
OUT
p
2
p
7
p
2
p
6
p
9
p
2
p
8
p
1
p
2
p
8
p
3
p
8
p
2
p
4
p
8
p
3
p
4
p
8
p
1
p
4
p
4
p
8
FIGURE 13.5 Schedule for the PN of Figure 13.3.
The last step is to generate the code, already shown in Figure 13.1(b). A node in the schedule with mul-
tiple outgoing arcs corresponds to an if-then-else, and loops are implemented using the goto statement.
Note that in this example no optimization has been performed to reduce code size. On the other hand,
the communication between the two processes has been implemented using assignments in the single task
that is generated.
13.4 QSS for Boolean Dataow
13.4.1 Denitions
An SDF graph [1] is a directed graph D = (A, E ) with actors A represented by nodes and arcs E representing
connections between the actors. These connections convey values between nodes, similar to the tokens in
PNs. Values arrive to actors respecting FIFO ordering.
Two mapping functions I and O are dened from E to nonnegative integers. They dene the consump-
tion and production rates of values for the connections between nodes, that is, for a connection e = (a, b )
from an actor a to an actor b, O (e ) (respectively I [e ]) shows how many tokens are produced at (consumed
from) e when the actor a (b) res.
The initial marking M
0
tells how many tokens reside on the arcs E before SDF starts an execution.
An actor a res if every input arc e carries at least I (e ) tokens. Firing an actor consumes I (e ) tokens
fromeach input arc and produces O(e) tokens on every output arc. A connection of an actor to its input
(or output) is denoted as input (or output) port.
A simple example of SDF graph is shown in Figure 13.6. In its initial marking only actor a is enabled.
Firing of a produces a token on each output port (arcs (a, c) and (a, b)). a needs to re twice to enable c
because I (a, c) = 2. The feasible sequence of actor rings a, a, c, b returns the graph to the original
marking.
An extension of SDF graphs to capture specications with data dependency results in adding to the
model dynamic actors [6] that satisfy the following properties:
1. An input port may be a conditional port, where the number of tokens consumed by the port
is given by a two-valued integer function of the value of a Boolean token received at the
a
b
c
1
2
1
2
1
1
FIGURE 13.6 SDF graph.
SWITCH
F T
T
SELECT
F
e
2
e
4
e
5
e
6
e
7
e
8
e
1
e
3
D
A B
E
b
e
1
e
2
e
3
e
4
e
5
e
6
e
7
e
8
C
}
A;
if ( b ) {
C;
else {
D;
}
E;
A SW D C SEL E B
=
0 0 1 0 0 0 0
0 0 0 1 0 p1 0
0 0 p 0 1 0 0
0 0 0 0 1 p 0
0 0 0 0 0 1 1
1 0 1 0 0 0 0
1 0 0 0 0 1 0
0 0 1p 1 0 0 0
(a) (b) (c)
FIGURE 13.7 If-then-else BDF graph.
special input port (the control port) of the same actor. One of the two values of the function
is zero.
2. Control ports are never conditional ports, and always transfer exactly one token per execution.
The canonical examples of this type of actors are SWITCH and SELECT
4
(e.g., see Figure 13.7[b]).
The SWITCHactor consumes an input token and a control token. If the control token is TRUE, the input
token is copied to the output labeled T; otherwise it is copied to the output labeled F. The SELECT actor
performs the inverse operation, reading a token fromthe T input if the control token is TRUE, otherwise
reading fromthe F input, and copying the token to the output.
Figure 13.7(b) shows an example of BDF that uses SWITCH and SELECT vectors to model the piece
of programin Figure 13.7(a).
13.4.2.1 Bounded Length Schedules
Deriving a bounded length schedule in a BDF graph reduces to the following two steps:
1. Finding a sequence of actors (cyclic sequence) that returns the graph to the initial marking.
2. Simulating the ring of a cyclic sequence to make sure that it is reable under the given initial
marking.
The rst task can be done through solving the so-called system of balance equations. This requires to
construct the incidence matrix of the BDF graph, which contains the integer O(e
i
) in position ( j, i)
if the ith actor produces O(e
i
) tokens on the jth arc and I (e
i
) if the ith actor consumes I (e
i
) tokens from
the jth arc (self-loop arcs are ignored, since their consistency checking is trivial). For dynamic actors the
number of produced and consumed tokens depends on control ports. This is represented in the incidence
matrix by using symbolic variables p
i
(one for each Boolean stream) that are interpreted as ratios of TRUE
4
Note that this is different from the select operation introduced in Section 13.3.1, because it is a deterministic
operation, depending not on the number of available input tokens, which in turn may depend on the scheduling order,
but rather on the value of the control port, which is guaranteed to be independent of the scheduling order.
values out of all values present in the stream (this ratio is [1 p
i
] for FALSE values). Then the system of
equations to be solved is:
r = 0
where 0 is a vector with all entries equal to 0, and r is the repetition vector with one entry, called r
i
, for
each actor, representing how many times actor i res in order to bring the BDF to the original marking.
If a nonzero solution for this system exists then the repetition vector shows how many times each actor
must re to return the graph to the initial marking.
Applying the above procedure to the incidence matrix in Figure 13.7(c) corresponding to the BDF
graph from Figure 13.7(b) one can nd the repetition vector r = [1 1 (1 p ) p 1 1 1]. Note, that the
existence of solution cannot depend on the value of p, since the values of the Boolean stream b are arbitrary.
By simulating the ring of actors according to the values of r for both p = 0 and p = 1, one can see
that the repetition vector indeed describes the reable sequence of actors, and the existence of a bounded
length schedule for BDF graph Figure 13.7(b) is proved. This procedure is effective for any arbitrary
BDF graph [6].
13.4.2.2 Bounded Memory Schedule
If a bounded length schedule is found, then it obviously provides a bounded memory schedule as well.
However, the converse is not true. There are specications that do not have a bounded length schedule, but
are perfectly schedulable with bounded memory. A common example is given by a loop with an unknown
bound on the number of iterations (e.g., see Figure 13.1). For such specications the length of the cyclic
sequence is unbounded, because it depends on the number of loop iterations.
The problem of nding bounded memory schedules in BDF graphs is undecidable [6], hence conser-
vative heuristic techniques, which may not nd a solution even if one exists, exactly like the algorithmof
Section 13.3.4, must be used. We describe two of them: clustering and state enumeration.
Clustering. The goal of the clustering algorithm is to map a BDF graph into the traditional control
structures used by high-level languages, such as if-then-else and do-while, whenever possible. The
subgraphs corresponding to these structures can then be treated as atomic actors.
At rst, adjacent actors with the same producing/consuming rates are merged into a single cluster, where
possible. Actors may not be merged if this would create deadlock, or if the resulting cluster would not be
a BDF actor (e.g., it may depend on a control arc that is hidden by the merge operation). Then clusters
are enclosed into conditional or loop constructs, as required in order to match the token production and
consumption rates of their neighbors. The procedure terminates when no more changes in the BDF graph
are possible. At this point, if the interior of each cluster has a schedule of bounded length, and the top-level
cluster does as well, then the entire graph can be scheduled with bounded memory.
State Enumeration. One can enumerate the states that the systemcan reach by simulating the execution
of the BDF graph, similar to the scheduling approach described in Section 13.3.4. If the graph cannot
be scheduled in bounded memory, however, a straightforward state enumeration procedure will not
terminate. One possible solution is to impose an upper bound to the number of tokens that may appear
on each arc, according to some heuristic, and to assume that there is a problemif this bound is exceeded.
A technique similar to this is used in Ptolemys dynamic dataow scheduler [18].
13.4.3 Comparison to PN Model
BooleanDataFlowgraphs being Turing complete provide a very powerful specicationmodel. It is remark-
able, that in spite of that, some important scheduling problems (like bounded length schedule) have
efcient and simple solutions for them. When a designer seeks for schedules with that kind of properties,
BDF graphs are an excellent choice. The attractive feature of BDF modeling is that keeping track about
the consistency of decisions made by different actors consuming the same Boolean streamis easy. This is
automatically ensured through the use of FIFO semantics in storing Boolean values at actor ports.
In PN modeling, on the other hand, data dependencies are fully abstracted as nondeterministic choices.
This makes the designer responsible to ensure that different choices are resolved consistently when they
stem from the same data.
Undecidability in the BDF case comes from the fact that establishing the equivalence of Boolean streams
is also undecidable. Therefore, ensuring the consistency of choices done by dynamic actors is possible only
when the stream is exactly the same (like for SWITCH and SELECT actors in Figure 13.7[b]), and hence
when a single p variable can be used to represent both.
Note that in such cases, that is, with syntactic equivalence, an improved version of the tool generating
a PN model from a C model could annotate the PN so as to make this equivalence explicit. However,
no such capability is available from the PN-based tools. They resort to the simple, but cumbersome
techniques described in Reference 13. Hence the BDF scheduling implementation in Ptolemy [18] is more
user-friendly in this respect.
The abstraction of data by nondeterministic choices in PNs is, however, of great importance when
solving more difcult scheduling problems. Applications very commonly contain computations with
unknown number of iterations. For them the most interesting scheduling problem is nding a bounded
memory schedule. Here the power of BDF model becomes a burden, and makes it very difcult to devise
efcient heuristics to solve the problem.
To illustrate this let us look at Figure 13.8, which represents a BDF graph corresponding to the example of
specication in Figure 13.1, where diamonds labeled with F denote the initial marking of the corresponding
arcs with Boolean value False.
Applying clustering to this example does not provide a conclusive decision about its schedulability,
because it cannot prove that the Boolean value False would ever be produced at the input arcs of the
SELECT actors, which is needed in order to return the network to the initial marking. This is an inherent
problem of the clustering approach: it is not clear how often it succeeds in doing scheduling analysis,
unless the specication was already devised so as to use only recognizable actor patterns.
To nd a schedule for the example in Figure 13.8 one must use the state enumeration approach.
However, contrary to the PN case, the state of a BDF graph must also include the values stored in the
FIFO queues for all Boolean streams of dynamic actors. This leads to signicant memory penalties when
SWITCH
F T
SWITCH
T F
T
SELECT
F F
SELECT
T
DATA
A
B
Process 1 Process 2
2
>0?
0
OUT
1
START
+
F F
FIGURE 13.8 BDF graph for the example of Figure 13.1.
storing the state graph. Even worse, it also signicantly reduces the capabilities of pruning the explored
reachability space based on different termination conditions. These conditions impose a partial order
between states and avoid generation of reachability space beyond ordered states [6, 8]. For PNs the
partial order is established purely by markings, while for BDF graphs in addition to markings it also
requires to consider values of Boolean streams. Due to this, state graphs of BDF have sparser ordering
relations and are signicantly larger.
Hence we feel that for bounded memory quasi-static schedulability analysis, the PNapproach is simpler
and more suitable, especially if the limitations of current translators fromC to PNs are addressed.
13.5 Conclusions
This chapter described modeling methods and scheduling algorithms that bridge the gap between
specication and implementation of reactive systems. From a specication given in terms of concur-
rent communicating processes, and by deriving intermediate representations based on PNs and dataow
graphs, one can(unfortunately not always) obtaina sequential schedule that canbe efciently implemented
on a processor.
Future work should consider better heuristic to nd such schedules, since the problem is undecidable
in general, once data-dependent choices come into play. Furthermore, it would be interesting to extend it
by considering sequential and concurrent implementations on several resources (e.g., CPUs and custom
datapaths) [19]. Another body of future research concerns the extension of the notion of schedule into
the time domain, in order to cope with performance constraints, while all the approaches considered in
this chapter assume innite processing speed with respect to the speed of the environment. For real-time
applications one would need to extend the scheduling frameworks by explicit annotation of systemevents
with delays, and by using timing driven algorithms for schedule construction.
References
[1] E.A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data ow graphs for digital
signal processing. IEEE Transactions on Computers, C-36(1), 2435, 1987.
[2] G. Kahn. The semantics of a simple language for parallel programming. In Proceedings of IFIP
Congress, August 1974.
[3] C.A.R. Hoare. Communicating Sequential Processes. International Series in Computer Science.
Prentice Hall, Hertfordshire, 1985.
[4] N. Halbwachs. Synchronous Programming of Reactive Systems. Kluwer Academic Publishers,
Boston, MA, 1993.
[5] D. Harel, H. Lachover, A. Naamad, A. Pnueli, M. Politi, R. Sherman, A. Shtull-Trauring, and
M. Trakhtenbrot. STATEMATE: a working environment for the development of complex reactive
systems. IEEE Transactions on Software Engineering, 16(4), 403414, 1990.
[6] J. Buck. Scheduling Dynamic Dataow Graphs with Bounded Memory Using the Token Flow
Model. Ph.D. thesis, University of California, Berkeley, 1993.
[7] J.T. Buck. Static scheduling and code generationfromdynamic dataowgraphs with integer valued
control streams. In Proceedings of the 28th Asilomar Conference on Signals, Systems, and Computer,
October 1994.
[8] J. Cortadella, A. Kondratyev, L. Lavagno, C. Passerone, and Y. Watanabe. Quasi-static scheduling
of independent tasks for reactive systems. IEEE Transactions on Computer-Aided Design, 24(9),
2004.
[9] T.M. Parks. Bounded Scheduling of Process Networks. Ph.D. thesis, Department of EECS,
University of California, Berkeley, 1995. Technical report UCB/ERL 95/105.
[10] K. Strehl, L. Thiele, D. Ziegenbein, R. Ernst et al. Scheduling hardware/software systems using
symbolic techniques. In International Workshop on Hardware/Software Codesign, 1999.
[11] P. Wauters, M. Engels, R. Lauwereins, andJ.A. Peperstraete. Cyclo-dynamic dataow. InProceedings
of the 4th EUROMICRO Workshop on Parallel and Distributed Processing, January 1996.
[12] T. Stefanov, C. Zissulescu, A. Turjan, B. Kienhuis, and E. Deprettere. System design using Kahn
process networks: the Compaan/Laura approach. In Proceedings of the Design Automation and Test
in Europe Conference, February 2004.
[13] G. Arrigoni, L. Duchini, L. Lavagno, C. Passerone, and Y. Watanabe. False path elimination in
quasi-static scheduling. In Proceedings of the Design Automation and Test in Europe Conference,
March 2002.
[14] F. Thoen, M. Cornero, G. Goossens, and H. De Man. Real-time multi-tasking in software synthesis
for informationprocessing systems. InProceedings of the International SystemSynthesis Symposium,
1995.
[15] Javier Esparza. Decidability and complexity of Petri net problems an introduction. In Lectures
on Petri Nets I: Basic Models, Advances in Petri Nets, Lecture notes on Computer Science, vol. 1491,
Petri Nets, 1996, pp. 374428.
[16] T. Murata. Petri nets: properties, analysis, andapplications. Proceedings of the IEEE, 77(4), 541580,
1989.
[17] E.A. de Kock, G. Essink, W.J.M. Smits, P. van der Wolf, J.-Y. Brunel, W.M. Kruijtzer, P. Lieverse,
and K.A. Vissers. YAPI: application modeling for signal processing systems. In Proceedings of the
37th Design Automation Conference, June 2000.
[18] Joseph Buck, Soonhoi Ha, Edward A. Lee, and David G. Messerschmitt. Ptolemy: a framework for
simulating and prototyping heterogenous systems. International Journal in Computer Simulation,
4(2), 1994.
[19] J. Cortadella, A. Kondratyev, L. Lavagno, A. Taubin, and Y. Watanabe. Quasi-static scheduling for
concurrent architectures. Fundamenta Informaticae, 62, 171196, 2004.
Timing and
Performance
Analysis
14 Determining Bounds on Execution Times
Reinhard Wilhelm
15 Performance Analysis of Distributed Embedded Systems
Lothar Thiele and Ernesto Wandeler
14
Determining Bounds
on Execution Times
Reinhard Wilhelm
Universitt des Saarlandes
14.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-2
Tool Architecture and Algorithm Timing Anomalies Contexts
14.2 Cache-Behavior Prediction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-6
Cache Memories Cache Semantics Abstract Semantics
14.3 Pipeline Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-10
Simple Architectures without Timing Anomalies Processors
with Timing Anomalies Algorithm Pipeline-Analysis
Pipeline Modeling Formal Models of Abstract Pipelines
Pipeline States
14.4 Path Analysis Using Integer Linear Programming. . . . . 14-17
14.5 Other Ingredients . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-18
Value Analysis Control Flow Specication and Analysis
Frontends for Executables
14.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-19
A (Partly) Dynamic Method Purely Static Methods
14.7 State of the Art and Future Extensions . . . . . . . . . . . . . . . . . 14-20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14-21
Run-time guarantees play an important role in the area of embedded systems and especially hard real-time
systems. These systems are typically subject to stringent timing constraints, which often result from
the interaction with the surrounding physical environment. It is essential that the computations are
completed within their associated time bounds; otherwise severe damages may result, or the system may
be unusable. Therefore, a schedulability analysis has to be performed which guarantees that all timing
constraints will be met. Schedulability analyses require upper bounds for the execution times of all
tasks in the system to be known. These bounds must be safe, that is, they may never underestimate the
real execution time. Furthermore, they should be tight, that is, the overestimation should be as small
as possible.
In modern microprocessor architectures, caches, pipelines, and all kinds of speculation are key features
for improving (average-case) performance. Unfortunately, they make the analysis of the timing behavior
of instructions very difcult, since the execution time of an instruction depends on the execution history.
A lack of precision in the predicted timing behavior may lead to a waste of hardware resources, which
would have to be invested in order to meet the requirements. For products which are manufactured
14-1
in high quantities, for example, in the automobile or telecommunications markets this would result in
intolerable expenses.
Subject of this chapter are one particular approach and the subtasks involved in computing safe and
precise bounds on the execution times for real-time systems.
14.1 Introduction
Hard real-time systems are subject to stringent timing constraints which are dictated by the surrounding
physical environment. We assume that a real-time system consists of a number of tasks, which realize
the required functionality. A schedulability analysis for this set of tasks and a given hardware has to
be performed in order to guarantee that all the timing constraints of these tasks will be met (timing
validation). Existing techniques for schedulability analysis require upper bounds for the execution times
of all the systems tasks to be known. These upper bounds are commonly called the worst-case execution
times (WCETs), a misnomer that causes a lot of confusion and will therefore not be adopted in this
presentation. In analogy, lower bounds on the execution time have been named best-case execution times
(BCET). These upper bounds (and lower bounds) have to be safe, that is, they must never underestimate
(overestimate) the real execution time. Furthermore, they should be tight, that is, the overestimation
(underestimation) should be as small as possible.
Figure 14.1 depicts the most important concepts of our domain. The system shows a certain variation
of execution times depending on the input data or different behavior of the environment. In general,
the state space is too large to exhaustively explore all possible executions and so determine the exact
worst-case and best-case execution times, WCET and BCET, respectively. Some abstraction of the system
is necessary to make a timing analysis of the system feasible. These abstractions loose information, and
thus are responsible for the distance between WCETs and upper bounds and between BCETs and lower
bounds.
How much is lost depends both on the methods used for timing analysis and on system properties,
such as the hardware architecture and the cleanness of the software. So, the two distances mentioned above,
termed upper predictability and lower predictability can be seen as a measure for the timing predictability
of the system. Experience has shown that the two predictabilities can be quite different, cf. Reference 1.
The methods used to determine upper bounds and lower bounds are the same. We will concentrate on
the determination of upper bounds unless otherwise stated.
Methods to compute sharp bounds for processors with xed execution times for each instruction have
long been established [2,3]. However, in modern microprocessor architectures caches, pipelines, and all
kinds of speculation are key features for improving (average-case) performance. Caches are used to bridge
the gap between processor speed and the access time of main memory. Pipelines enable acceleration
by overlapping the executions of different instructions. The consequence is that the execution time of
individual instructions, and thus the contribution of one execution of an instruction to the programs
t
0
Best
case
Worst
case
Upper
bound
Lower
bound
Variation of execution time
w.c. performance
w.c. guarantee
Predictability
FIGURE 14.1 Basic notions concerning timing analysis of systems.
Determining Bounds on Execution Times 14-3
No
30
1
3
1
3
6
6
6
41
6
4
Execute Fetch
ICache miss? Unit occupied? Multicycle? Pending instructions?
Issue Retire
19
Yes
FIGURE 14.2 Different paths through the execution of a multiply instruction. Unlabeled transitions take 1 cycle.
execution time can vary widely. The interval of execution times for one instruction is bounded by the
execution times of the following two cases:
The instruction goes smoothly through the pipeline; all loads hit the cache, no pipeline hazard
happens, that is, all operands are ready, no resource conicts with other currently executing
instructions exist.
Everything goes wrong, that is, instruction and/or operand fetches miss the cache, resources
needed by the instruction are occupied, etc.
Figure 14.2 shows the different paths through a multiply instruction of a PowerPC processor. The
instruction-fetch phase may nd the instruction in the cache (cache hit ), in which case it takes 1 cycle to
load it. In the case of a cache miss, it may take something like 30 cycles to load the memory block con-
taining the instruction into the cache. The instruction needs an arithmetic unit, which may be occupied
by a preceding instruction. Waiting for the unit to become free may take up to 19 cycles. This latency
would not occur, if the instruction fetch had missed the cache, because the cache-miss penalty of 30 cycles
has allowed any preceding instruction to terminate its arithmetic operation. The time it takes to multiply
two operands depends on the size of the operands; for small operands, 1 cycle is enough, for larger, three
are needed. When the operation has nished, it has to be retired in the order it appeared in the instruction
stream. The processor keeps a queue for instructions waiting to be retired. Waiting for a place in this queue
may take up to 6 cycles. On the dashed path, where the execution always takes the fast way, its overall
execution time is 4 cycles. However, on the dotted path, where it always takes the slowest way, the overall
execution time is 41 cycles.
We will call any increase in execution time during an instructions execution a timing accident and
the number of cycles by which it increases the timing penalty of this accident. Timing penalties for an
instruction can add up to several hundred processor cycles. Whether the execution of an instruction
encounters a timing accident depends on the execution state, for example, the contents of the cache(s), the
occupancy of other resources, and thus on the execution history. It is therefore obvious that the attempt
to predict or exclude timing accidents needs information about the execution history.
For certain classes of architectures, namely those without timing anomalies of Section 1, excluding
timing accidents means decreasing the upper bounds. However, for those with timing anomalies this
assumption is not true.
14.1.1 Tool Architecture and Algorithm
A more or less standard architecture for timing-analysis tools has emerged [46]. Figure 14.3 shows
one instance of this architecture. A rst phase, depicted on the left, predicts the behavior of processor
CFG builder
FIGURE 14.3 The architecture of the aiT timing-analysis tool.
components for the instructions of the program. It usually consists of a sequence of static program
analyses of the program. They altogether allow to derive safe upper bounds for the execution times of
basic blocks. A second phase, the column on the right, computes an upper bound on the execution
times over all possible paths of the program. This is realized by mapping the control ow of the pro-
gram to an Integer Linear Program and solving this by appropriate methods. This architecture has been
successfully used to determine precise upper bounds on the execution times of real-time programs run-
ning on processors used in embedded systems [1,710]. A commercially available tool, aiT by AbsInt,
cf. http://www.absint.de/wcet.htm, was implemented and is used in the aeronautics and
automotive industries.
The structure of the rst phase, processor-behavior prediction, often called microarchitecture analysis,
may vary depending on the complexity of the processor architecture. A rst, modular approach would be
the following:
1. Cache-behavior prediction determines statically and approximately the contents of caches at each
program point. For each access to a memory block, it is checked, whether the analysis can safely
predict a cache hit.
Information about cache contents can be forgotten after the cache analysis. Only the miss/hit
information is needed by the pipeline analysis.
2. Pipeline-behavior prediction analyzes, how instructions pass through the pipeline taking cache-hit
or miss information into account. The cache-miss penalty is assumed for all cases, where a cache
hit can not be guaranteed.
At the end of simulating one instruction, the pipeline analysis continues with only those states
that show the locally maximal execution times. All others can be forgotten.
14.1.2 Timing Anomalies
Unfortunately, this approach is not safe for many processor architectures. Most powerful microprocessors
have so-called timing anomalies. Timing anomalies are contra-intuitive inuences of the (local) execution
time of one instruction on the (global) execution time of the whole program [11]. The interaction of
several processor features can interact in such a way that a locally faster execution of an instruction can
lead to a globally longer execution time of the whole program.
For example, a cache miss contributes the cache-miss penalty to the execution time of a program. It was,
however, observed for the MCF 5307 [12], that a cache miss may actually speed up program execution.
Since the MCF 5307 has a unied cache and the fetch and execute pipelines are independent, the following
can happen: a data access that is a cache hit is served directly from the cache. At the same time, the
fetch pipeline fetches another instruction block from main memory, performing branch prediction and
replacing two lines of data in the cache. These may be reused later on and cause two misses. If the data
access was a cache miss, the instruction fetch pipeline may not have fetched those two lines, because the
execution pipeline may have resolved a misprediction before those lines were fetched.
The general case of a timing anomaly is the following. Different assumption about the processors
execution state, for example, the fact that the instruction is or is not in the instruction cache, will result in
a difference T
local
of the execution time of the instruction between these two cases. Either assumption
may lead to a difference T of the global execution time compared to the other one. We say that a timing
anomaly occurs if either
T
local
< 0 that is, the instruction executes faster, and
T < T
local
, the overall execution is accelerated by more than the acceleration of the
instruction, or
T > 0, the program runs longer than before.
T
local
> 0 that is, the instruction takes longer to execute, and
T > T
local
, that is, the overall execution is extended by more than the delay of the
instruction, or
T < 0 that is, the overall execution of the program takes less time to execute than before.
The case T
local
< 0 T > 0 is a critical case for our timing analysis. It makes it impossible to use
local worst cases for the calculation of the programs execution time. The analysis has to follow all possible
paths as will be explained in Section 14.3.
14.1.3 Contexts
The contribution of an individual instruction to the total execution time of a program may vary widely
depending on the execution history. For example, the rst iteration of a loop typically loads the caches, and
later iterations prot from the loaded memory blocks being in the caches. In this case, the execution of an
instruction in a rst iteration encounters one or more cache misses and pays with the cache-miss penalty.
Later executions, however, will execute much faster because they hit the cache. Asimilar observation holds
for dynamic branch predictors. They may need a few iterations until they stabilize and predict correctly.
Therefore, precision is increased if instructions are considered in their control-ow context, that is,
the way control reached them. Contexts are associated with basic blocks, that is, maximally long straight-
line code sequences that can be only entered at the rst instruction and left at the last. They indicate
through which sequence of function calls and loop iterations control arrived at the basic block. Thus,
when analyzing the cache behavior of a loop, precision can be increased by regarding the rst iteration of
the loop and all other iterations separately; more precisely, to unroll the loop once and then analyze the
resulting code.
1
Denition 14.1 Let p be a program with set of functions P = {p
1
, p
2
, . . . , p
n
] and set of loops
L = {l
1
, l
2
, . . . , l
n
]. A word c over the alphabet P L IN is called a context for a basic block b, if b
can be reached by calling the functions and iterating through the loops in the order given in c.
1
Actually, this unrolling transformation need not be really performed, but can be incorporated into the iteration
strategy of the analyzer. So, we talk of virtually unrolling the loops.
Even, if all loops have static loop bounds and recursion is also bounded, there are in general too many
contexts to consider themexhaustively. Aheuristics is used to keep relevant contexts apart and summarize
the rest conservatively, if their inuence on the behavior of instructions does not signicantly differ.
Experience has shown [10], that a few rst iterations and recursive calls are sufcient to stabilize the
behavior information, as the above example indicates, and that the right differentiation of contexts is
decisive for the precision of the prediction [13].
A particular choice of contexts transforms the call and the control ow graph into a context-extended
control-ow graph by virtually unrolling the loops and virtually inlining the functions as indicated by the
contexts. The formal treatment of this concept is quite involved and shall not be given here. It can be
found in Reference 14.
14.2 Cache-Behavior Prediction
Abstract Interpretation [15] is used to compute invariants about cache contents. How the behavior of
programs on processor pipelines is predicted follows in Section 14.3.
14.2.1 Cache Memories
A cache can be characterized by three major parameters:
Capacity is the number of bytes it may contain.
Line size (also calledblock size) is the number of contiguous bytes that are transferredfrommemory
on a cache miss. The cache can hold at most n = capacity/line size blocks.
Associativity is the number of cache locations where a particular block may reside.
n/associativity is the number of sets of a cache.
If a block can reside in any cache location, then the cache is called fully associative. If a block can reside in
exactly one location, then it is called direct mapped. If a block can reside in exactly A locations, then the
cache is called A-way set associative. The fully associative and the direct mapped caches are special cases
of the A-way set associative cache where A = n and A = 1, respectively.
In the case of an associative cache, a cache line has to be selected for replacement when the cache is
full and the processor requests further data. This is done according to a replacement strategy. Common
strategies are LRU (Least Recently Used), FIFO (First In First Out), and random.
The set where a memory block may reside in the cache is uniquely determined by the address of the
memory block, that is, the behavior of the sets is independent of each other. The behavior of an A-way set
associative cache is completely described by the behavior of its n/A fully associative sets. This holds also
for direct mapped caches where A = 1.
For the sake of space, we restrict our description to the semantics of fully associative caches with LRU
replacement strategy. More complete descriptions that explicitly describe direct mapped and A-way set
associative caches can be found in References 8 and 16.
14.2.2 Cache Semantics
In the following, we consider a (fully associative) cache as a set of cache Lines L = {l
1
, . . . , l
n
] and the
store as a set of memory blocks S = {s
1
, . . . , s
m
].
To indicate the absence of any memory block in a cache line, we introduce a newelement I ; S
/
= S {I ].
Denition 14.2 (concrete cache state) A (concrete) cache state is a function c : L S
/
.
C
c
denotes the set of all concrete cache states. The initial cache State c
I
maps all cache lines to I .
If c(l
i
) = s
y
for a concrete cache state c, then i is the relative age of the memory block according to the
LRU replacement strategy and not necessarily the physical position in the cache hardware.
z
y
t
x
z
y
x
z
t
x
s
z
t
x
s
s
[s]
young
old
Age
FIGURE 14.4 Update of a concrete fully associative (sub-) cache.
The update function describes the effect on the cache of referencing a block in memory. The referenced
memory block s
x
moves into l
1
if it was in the cache already. All memory blocks in the cache that had
been used more recently than s
x
increase their relative age by one, that is, they are shifted by one position
to the next cache line. If the referenced memory block was not yet in the cache, it is loaded into l
1
after all
memory blocks in the cache have been shifted and the oldest, that is, least recently used memory block,
has been removed from the cache if the cache was full.
Denition 14.3 (cache update) A cache update function U : C
c
S C
c
determines the new cache state
for a given cache state and a referenced memory block.
Updates of fully associative caches with LRU replacement strategy are pictured as in Figure 14.4.
14.2.2.1 Control Flow Representation
We represent programs by control ow graphs consisting of nodes and typed edges. The nodes represent
basic blocks. A basic block is a sequence (of fragments) of instructions in which control ow enters
at the beginning and leaves at the end without halt or possibility of branching except at the end. For
cache analysis, it is most convenient to have one memory reference per control ow node. Therefore,
our nodes may represent the different fragments of machine instructions that access memory. For non-
precisely determined addresses of data references, one can use a set of possibly referenced memory blocks.
We assume that for each basic block, the sequence of references to memory is known (This is appropriate
for instruction caches and can be too restricted for data caches and combined caches. See References 7
and 16 for weaker restrictions.), that is, there exists a mapping from control ow nodes to sequences of
memory blocks: L : V S
.
We can describe the effect of such a sequence on a cache with the help of the update function U. There-
fore, we extend U to sequences of memory references by sequential composition: U(c, s
x
1
, . . . , s
x
y
)) =
U(. . . (U(c, s
x
1
)), . . . , s
x
y
).
The cache state for a path (k
1
, . . . , k
p
) in the control ow graph is given by applying U to the
initial cache state c
I
and the concatenation of all sequences of memory references along the path:
U(c
I
, L(k
1
), . . . , L(k
p
)).
The Collecting Semantics of a program gathers at each program point the set of all execution states,
which the program may encounter at this point during some execution. A semantics on which to base
a cache analysis has to model cache contents as part of the execution state. One could thus compute the
collecting semantics and project the execution states onto their cache components to obtain the set of
all possible cache contents for a given program point. However, the collecting semantics is in general
not computable.
Instead, one restricts the standard semantics to only those program constructs, which involve the
cache, that is, memory references. Only they have an effect on the cache modelled by the cache update
function, U. This coarser semantics may execute program paths which are not executable in the start
semantics. Therefore, the Collecting Cache Semantics of a program computes a superset of the set of all
concrete cache states occurring at each program point.
Denition 14.4 (Collecting Cache Semantics) The Collecting Cache Semantics of a program is
C
coll
( p) = {U(c
I
, L(k
1
), . . . , L(k
n
))[(k
1
, . . . , k
n
) path in the CFG leading to p]
This collecting semantics would be computable, although often of enormous size. Therefore, another
step abstracts it into a compact representation, so called abstract cache states. Note that every information
drawn from the abstract cache states allows to safely deduce information about sets of concrete cache
states, that is, only precision may be reduced in this two step process. Correctness is guaranteed.
14.2.3 Abstract Semantics
The specication of a program analysis consists of the specication of an abstract domain and of the
abstract semantic functions, mostly called transfer functions. The least upper bound operator of the
domain combines information when control ow merges.
We present two analyses. The must analysis determines a set of memory blocks that are in the cache at
a given program point whenever execution reaches this point. The may analysis determines all memory
blocks that may be in the cache at a given program point. The latter analysis is used to determine the
absence of a memory block in the cache.
The analyses are used to compute a categorization for each memory reference describing its cache
behavior. The categories are described in Table 14.1.
The domains for our abstract interpretations consist of abstract cache states.
Denition 14.5 (abstract cache state) An abstract cache state c : L 2
S
maps cache lines to sets of
memory blocks.

C denotes the set of all abstract cache states.
The position of a line in an abstract cache will, as in the case of concrete caches, denote the relative age
of the corresponding memory blocks. Note, however, that the domains of abstract cache states will have
different partial orders and that the interpretation of abstract cache states will be different in the different
analyses.
The following functions relate concrete and abstract domains. An extraction function, extr, maps a
concrete cache state to an abstract cache state. The abstraction function, abstr, maps sets of concrete cache
states to their best representation in the domain of abstract cache states. It is induced by the extraction
function. The concretization function, concr, maps an abstract cache state to the set of all concrete cache
states represented by it. It allows to interpret abstract cache states. It is often induced by the abstraction
function, cf. Reference 17.
Denition 14.6 (extraction, abstraction, concretization functions) The extraction function extr:
C
c

C forms singleton sets from the images of the concrete cache states it is applied to, that is,
extr(c)(l
i
) = {s
x
] if c(l
i
) = s
x
.
The abstraction function abstr: 2
C
c

C is dened by abstr(C) = .{extr(c)[c C]
The concretization function concr:

C 2
C
c
is dened by concr( c) = {c[extr(c) _ c].
TABLE 14.1 Categorizations of Memory References and Memory Blocks
Category Abbreviation Meaning
Always hit ah The memory reference will always result in a cache hit.
Always miss am The memory reference will always result in a cache miss.
Not classied nc The memory reference could neither be classied as ah nor am.
So much of commonalities of all the domains to be designed. Note, that all the constructions are
parameterized in . and _.
The transfer functions, the abstract cache update functions, all denoted
U, will describe the effects of a

control ow node on an element of the abstract domain. They will be composed of two parts:
1. Refreshing the accessed memory block, that is, inserting it into the youngest cache line.
2. Aging some other memory blocks already in the abstract cache.
14.2.3.1 Termination of the Analyses
There are only a nite number of cache lines and for each program a nite number of memory blocks.
This means, that the domain of abstract cache states c : L 2
S
is nite. Hence, every ascending chain
is nite. Additionally, the abstract cache update functions,
U, are monotonic. This guarantees that all the

analyses will terminate.
14.2.3.2 Must Analysis
As explained above, the must analysis determines a set of memory blocks that are in the cache at a given
program point whenever execution reaches this point. Good information, in the sense of valuable for
the prediction of cache hits, is the knowledge that a memory block is in this set. The bigger the set, the
better. As we will see, additional information will even tell how long it will at least stay in the cache. This is
connected to the age of a memory block. Therefore, the partial order on the must -domain is as follows.
Take an abstract cache state c. Above c in the domain, that is, less precise, are states where memory blocks
from c are either missing or are older than in c. Therefore, the .-operator applied to two abstract cache
states c
1
and c
2
will produce a state c containing only those memory blocks contained in both, and will
give them the maximum of their ages in c
1
and c
2
(see Figure 14.5). The positions of the memory blocks
in the abstract cache state are thus the upper bounds of the ages of the memory blocks in the concrete
caches occurring in the collecting cache semantics. Concretization of an abstract cache state, c, produces
the set of all concrete cache states, which contain all the memory blocks contained in c with ages not older
than in c. Cache lines not lled by these are lled with other memory blocks.
We use the abstract cache update function depicted in Figure 14.6. Let us argue the correctness of this
update function. The following theorem formulates the soundness of the must-cache analysis.
Theorem 14.1 Let n be a program point, c
in
the abstract cache state at the entry to n, s a memory line in c
in
with age k.
(i) For each 1 k A there are at most k memory lines in lines 1, 2, . . . , k.
(ii) On all paths to n, s is in cache with age at most k.
The solution of the must analysis problem is interpreted as follows: let c be an abstract cache state
at some program point. If s
x
c(l
i
) for a cache line l
i
then s
x
will denitely be in the cache whenever
execution reaches this program point. A reference to s
x
is categorized as always hit (ah). There is even
a stronger interpretation of the fact that s
x
c(l
i
). s
x
will stay in the cache at least for the next n i
{d}
{a}
{e}
{c} {a}
{ }
{c, f}
{d}
{a, c}
{ }
{ }
{d}
intersection
+maximal age
FIGURE 14.5 Combination for must analysis.
young
old
Age
{x}
{s} {x}
{
{s, t}
{y} {y}
{t}
}
[s]
FIGURE 14.6 Update of an abstract fully associative (sub-) cache.
{d}
{a}
{a}
{d}
{e}
{c}
{c, f}
{a, c}
{e, f}
{ }
{d}
{ }
union
+minimal age
FIGURE 14.7 Combination for may analysis.
references to memory blocks that are not in the cache or are older than the memory blocks in c, whereby
s
a
is older than s
b
means: l
i
, l
j
: s
a
c (l
i
), s
b
c (l
j
), i > j.
14.2.3.3 May Analysis
To determine, if a memory block s
x
will never be in the cache, we compute the complimentary information,
that is, sets of memory blocks that may be in the cache. Good information is that a memory block is not
in this set, because this memory block can be classied as denitely not in the cache whenever execution
reaches the given program point. Thus, the smaller the sets are, the better. Additionally, the older blocks
will reach the desired situation to be removed from the cache faster than the younger ones. Therefore, the
partial order on this domain is as follows. Take some abstract cache state c. Above c in the domain, that is,
less precise, are those states which contain additional memory blocks or where memory blocks from c are
younger than in c. Therefore, the .-operator applied to two abstract cache states c
1
and c
2
will produce
a state c containing those memory blocks contained in c
1
or c
2
and will give them the minimum of their
ages in c
1
and c
2
(see Figure 14.7).
The positions of the memory blocks in the abstract cache state are thus the lower bounds of the ages of
the memory blocks in the concrete caches occurring in the collecting cache semantics.
The solution of the may analysis problem is interpreted as follows: the fact that s
x
is in the abstract
cache c means that s
x
may be in the cache during some execution when the program point is reached.
If s
x
is not in c(l
i
) for any l
i
then it will denitely be not in the cache on any execution. A reference to s
x
is categorized as always miss (am).
14.3 Pipeline Analysis
Pipeline analysis attempts to nd out how instructions move through the pipeline. In particular, it deter-
mines how many cycles they spend in the pipeline. This largely depends on the timing accidents the
instructions suffer. Timing accidents during pipelined executions can be of several kinds. Cache misses
during instruction or data load stall the pipeline for as many cycles as the cache miss penalty indicates.
Functional units that an instruction needs may be occupied. Queues into which the instruction may
have to be moved may be full, and prefetch queues, from which instructions have to be loaded, may be
empty. The bus needed for a pipeline phase may be occupied by a different phase of another instruction.
Again, for architectures without timing anomalies we can use a simplied picture, in which the task is to
nd out which timing accidents can be safely excluded, because each excluded accident allows to decrease
the bound for the execution time. Accidents that can not be safely excluded are assumed to happen.
Acache analysis as described in Section 14.2 has annotated the instructions with cache-hit information.
This information is used to exclude pipeline stalls at instruction or data fetches.
We will explainpipeline analysis ina number of steps starting withconcrete-pipeline execution. Apipeline
goes through a number of pipeline phases and consumes a number of cycles when it executes a sequence
of instructions; in general, a different number of cycles for different initial execution states. The execution
of the instructions in the sequence overlaps in the instruction pipeline as far as the data dependences
between instructions permit it and if the pipeline conditions are statised. Each execution of a sequence
of instructions starting in some initial state produces one trace, that is, sequence of execution states. The
length of the trace is the number of cycles this execution takes.
Thus, concrete execution can be viewed as applying a function
function exec (b : basic block, s : pipeline state) t : trace
that executes the instruction sequence of basic block b starting in concrete pipeline state s producing
a trace t of concrete states. last (t ) is the nal state when executing b. It is the initial state for the successor
block to be executed next.
So far, we talked about concrete execution on a concrete pipeline. Pipeline analysis regards abstract
execution of sequences of instructions on abstract (models of) pipelines. The execution of programs on
abstract pipelines produces abstract traces, that is, sequences of abstract states, where some information
contained in the concrete states may be missing. There are several types of missing information:
The cache analysis in general has incomplete information about cache contents.
The latency of an arithmetic operation, if it depends on the operand sizes, may be unknown.
It inuences the occupancy of pipeline units.
The state of a dynamic branch predictor changes over iterations of a loop and may be unknown
for a particular iteration.
Data dependences can not safely be excluded because effective addresses of operands are not always
statically known.
14.3.1 Simple Architectures without Timing Anomalies
In a rst step, we assume a simple processor architecture, with in-order execution and without timing
anomalies, that is, architectures, where local worst cases contribute to the programs global execution time,
cf. Section 14.1.2. Also, it is safe to assume the local worst cases for unknown information. For both of
them the corresponding timing penalties are added. For example, the cache miss penalty has to be added
for instruction fetch of an instruction in the two cases, that a cache miss is predicted or that neither a cache
miss nor a cache hit can be predicted.
The result of the abstract execution of an instruction sequence for a given initial abstract state is again
one trace; however, possibly of a greater length and thus an upper bound properly bounding the execution
time from above. Because worst cases were assumed for all uncertainties, this number of cycles is a safe
upper bound for all executions of the basic block starting in concrete states represented by this initial
abstract state.
The Algorithm for pipeline analysis is quite simple. It uses a function
function e xec (b : cache-annotated basic block, s : abstract pipeline state)

t : abstract trace
that executes the instruction sequence of basic block b, annotated with cache information, starting in the
abstract pipeline state s and producing a trace

t of abstract states.
This function is applied to each basic block b in each of its contexts and the empty pipeline state s
0
corresponding to a ushed pipeline. Therefore, a linear traversal of the cache-annotated context-extended
Basic-Block Graph sufces. The result is a trace for the instruction sequence of the block, whose length
is an upper bound for the execution time of the block in this context. Note, that it still makes sense to
analyze a basic block in several contexts because the cache information for them may be quite different.
Note, that this algorithm is simple and efcient, but not necessarily very precise. Starting with a ushed
pipeline at the beginning of the basic block is safe, but it ignores the potential overlap between consecutive
basic blocks.
A more precise algorithm is possible. The problem is with basic blocks having several predecessor
blocks. Which of their nal states should be selected as initial state of the successor block? First solution
involves working with sets of states for each pair of basic block and context. Then, one analysis of each
basic block and context would be performed for each of the initial states. The resulting set of nal states
would be passed on to successor blocks, and the maximum of the trace lengths would be taken as upper
bound for this basic block in this context.
Second solution would work with a single state per basic block and context and would combine the set
of predecessor nal states conservatively to the initial state for the successor.
14.3.2 Processors with Timing Anomalies
In the next step, we assume more complex processors, including those with out-of-order execution. They
typically have timing anomalies. Our assumptionabove, that is, that local worst cases contribute worst-case
times to the global execution times, is no more valid. This forces us to consider several paths, wherever
uncertainty in the abstract execution state does not allow to take a decision between several successor
states. Note, that the absence of information leads from the deterministic concrete pipeline to an abstract
pipeline that is non-deterministic. This situation is depicted in Figure 14.8. It demonstrates two cases
of missing information in the abstract state. First, the abstract state lacks the information whether the
instruction is in the I-cache. Pipeline analysis has to follow both cases in case of instruction fetch, because
it could turn out that the I-cache miss, in fact, is not the global worst case. Second, the abstract state does
not contain information about the size of the operands. We also have to follow both paths. The dashed
paths have to be explored to obtain the execution times for this instruction. Depending on the architecture,
we may be able to conservatively assume the case of large operands and surpress some paths.
The algorithm has to combine cache and pipeline analysis because of the interference between both,
which actually is the reason for the existence of the timing anomalies. For the cache analysis, it uses the
abstract cache states discussed in Section 14.2. For the pipeline part, it uses analysis states, which are
sets of abstract pipeline states, that is, sets of states of the abstract pipeline. The question arises whether
6
30
1
3
1
3
6
6
6
41
6
4
Execute Fetch
Icache miss? Unit occupied? Multicycle? Pending instructions?
Issue Retire
19
Yes
No
FIGURE 14.8 Different paths through the execution of a multiply instruction. Decisions inside the boxes can not be
deterministically taken based on the abstract execution state because of missing information.
an abstract cache state is to be combined with an analysis state ss or an individual one with each of the
abstract pipeline states in ss. So, there could be one abstract cache state for ss representing the concrete
cache contents for all abstract pipeline states in ss, or there could be one abstract cache state per abstract
pipeline state in ss. The rst choice saves memory during the analysis, but loses precision. This is because
different pipeline states may cause different memory accesses and thus cache contents, which have to
be merged into the one abstract state thereby losing information. The second choice is more precise
but requires more memory during the analysis. We choose the second alternative and thus dene a new
domain of analysis states
A of the following type:
A = 2
C
(14.1)
S = set of abstract pipeline states (14.2)
C = set of abstract cache states (14.3)

The Algorithm again uses a new function exc
c
.
function exc
c
(b : basic block, a : analysis state)
T : set of abstract trace,

which analyzes a basic block b starting in an analysis state a consisting of pairs of abstract pipeline states
and abstract cache states. As a result it will produce a set of abstract traces.
The algorithm is as follows.
14.3.3 Algorithm Pipeline-Analysis
Perform xpoint iteration over the context-extended Basic-Block Graph:
For each basic block b in each of its contexts c, and for the initial analysis state a, compute exc
c
(b, a)
yielding a set of traces {
t
1
,
t
2
, . . . ,
t
m
].
max ({[
t
1
[, [
t
2
[, . . . , [
t
m
[]) is the bound for this basic block in this context.
The set of output states {last (
t
1
), last (
t
2
), . . . , last (
t
m
)] will be passed on to the successor block(s) in
context c as initial states.
Basic blocks (in some context) having more than one predecessor receive the union of the set of output
states as initial states.
The abstraction we use as analysis states is a set of abstract pipeline states, since the number of possible
pipeline states for one instruction is not too big. Hence, our abstraction computes an upper bound to
the collecting semantics. The abstract update for an analysis state a is thus the application of the concrete
update on each abstract pipeline state in a extended with the possibility of multiple successor states in
case of uncertainties.
Figure 14.9 shows the possible pipeline states for a basic block in this example. Such pictures are shown
by aiT tool upon special demand. The large dark grey boxes correspond to the instructions of the basic
block, and the smaller rectangles in them stand for individual pipeline states. Their cyclewise evolution is
indicated by the strokes connecting them. Each layer in the trees corresponds to one CPU cycle. Branches
in the trees are caused by conditions that could not be statically evaluated, for example, a memory access
with unknown address in presence of memory areas with different access times. On the other hand,
two pipeline states fall together when details they differ in leave the pipeline. This happened, for instance,
at the end of the second instruction, reducing the number of states from four to three.
The update function belonging to an edge (,
/
) of the control-ow graph updates each abstract
pipeline state separately. When the bus unit is updated, the pipeline state may split into several successor
states with different cache states. The initial analysis state is a set of empty pipeline states plus a cache
that represents a cache with unknown content. There can be multiple concrete pipeline states in the initial
states, since the adjustment of internal to external clock of the processor is not known in the beginning
and every possibility (aligned, one cycle apart, etc.) has to be considered. Thus prefetching must start from
FIGURE 14.9 Possible pipeline states in a basic block.
scratch, but pending bus requests are ignored. To obtain correct results, they must be taken into account
by adding a xed penalty to the calculated upper bounds.
14.3.4 Pipeline Modeling
The basis for pipeline analysis is a model of an abstract version of the processor pipeline, which is
conservative with respect to the timing behavior, that is, times predicted by the abstract pipeline must
never be lower than those observed in concrete executions. Some terminology is needed to avoid confusion.
Processors have concrete pipelines, which may be described in some formal language, for example, VHDL.
If this is the case, there exists a formal model of the pipeline. Our abstraction step, by which we eliminate
many components of a concrete pipeline that are not relevant for the timing behavior lead us to an abstract
pipeline. This may again be described in a formal language, for example, VHDL, and thus have a formal
model. Deriving an abstract pipeline is a complex task. It is demonstrated for the Motorola ColdFire
processor, a processor quite popular in the aeronautics and the submarine industry. The presentation
follows closely that of Reference 18.
2
14.3.4.1 The ColdFire MCF 5307 Pipeline
The pipeline of the ColdFire MCF 5307 consists of a fetch pipeline that fetches instructions from memory
(or the cache), and an execution pipeline that executes instructions, cf. Figure 14.10. Fetch and execution
pipelines are connected and as far as speed is concerned decoupled by a FIFO instruction buffer that can
hold at most 8 instructions.
The MCF 5307 accesses memory through a bus hierarchy. The fast pipelined K-bus connects the cache
and an internal 4KB SRAM area to the pipeline. Accesses to this bus are performed by the IC1/IC2 and the
AGEX and DSOC stages of the pipeline. On the next level, the M-Bus connects the K-Bus to the internal
peripherals. This bus runs at the external bus frequency, while the K-Bus is clocked with the faster internal
core clock. The M-Bus connects to the external bus, which accesses off-chip peripherals and memory.
The fetch pipeline performs branch prediction in the IED stage, redirecting fetching long before the
branch reaches the execution stages. The fetch pipeline is stalled if the instruction buffer is full, or if
the execution pipeline needs the bus for a memory access. All these stalls cause the pipeline to wait for
one cycle. After that, the stall condition is checked again.
The fetch pipeline is also stalled if the memory block to be fetched is not in the cache (cache miss). The
pipeline must wait until the memory block is loaded into the cache and forwarded to the pipeline. The
instructions that are already in the later stages of the fetch pipeline are forwarded to the instruction buffer.
2
The model of the abstract pipeline of the MCF 5307 has been derived by hand. A computer-supported derivation
would have been preferable. Ways to develop this are subject of actual research.
Instruction
Address
Generation
Instruction
Fetch Cycle 1
Instruction
Fetch Cycle 2
Instruction
Early Decode
FIFO
Instruction Buffer
Decode & Select,
Operand Fetch
Address
Generation,
Execute
IAG
IC1
IC2
IB
DSOC
AGEX
IED
Instruction
Fetch
Pipeline
(IFP)
Operand
Execution
Pipeline
(OEP)
Address [31:0]
Data[31:0]
FIGURE 14.10 The pipeline of the Motorola ColdFire 5307 processor.
The execution pipeline nishes the decoding of instructions, evaluates their operands, and executes
the instructions. Each kind of operation follows a xed schedule. This schedule determines, how many
cycles the operation needs and in which cycles memory is accessed.
3
The execution time varies between
2 cycles and several dozen cycles. Pipelining admits a maximum overlap of 1 cycle between consecutive
instructions: the last cycle of each instruction may overlap with the rst of the next one. In this rst cycle,
no memory access and no control-ow alteration happen. Thus, cache and pipeline cannot be affected by
two different instructions in the same cycle. The execution of an instruction is delayed if memory accesses
lead to cache misses. Misaligned accesses lead to small time penalties of 1 to 3 cycles. Store operations
are delayed if the distance to the previous store operation is less than 2 cycles. (This does not hold if the
previous store operation was issued by a MOVEMinstruction.) The start of the next instruction is delayed
if the instruction buffer is empty.
14.3.5 Formal Models of Abstract Pipelines
An abstract pipeline can be seen as a big nite state machine, which makes a transition on every clock cycle.
The states of the abstract pipeline, although greatly simplied still contain all timing relevant information
3
In fact, there are some instructions like MOVEM whose execution schedule depends on the value of an argument
as immediate constant. These instructions can be taken into account by special means.
of the processor. The number of transitions it takes from the beginning of the execution of an instruction
until its end gives the execution time of that instruction.
The abstract pipeline although greatly reduced by leaving out irrelevant components still is a really big
nite state machine, but it has structure. Its states canbe naturally decomposedinto components according
to the architecture. This makes it easier to specify, verify, and implement a model of an abstract pipeline.
In the formal approach presented here, an abstract pipeline state consists of several units with inner states
that communicate with one another and the memory via signals, and evolve cycle-wise according to their
inner state and the signals received. Thus, the means of decomposition are units and signals.
Signals may be instantaneous, meaning that they are receivedinthe same cycle as they are sent, or delayed,
meaning that they are received one cycle after they have been sent. Signals may carry data, for example,
a fetch address. Note that these signals are only part of the formal pipeline model. They may or may
not correspond to real hardware signals. The instantaneous signals between units are used to transport
information between the units. The state transitions are coded in the evolution rules local to each unit.
Figure 14.11 shows the formal pipeline model for the ColdFire MCF 5307. It consists of the following
units: IAG(instruction address generation), IC1 (instruction fetch cycle 1), IC2 (instruction fetch cycle 2),
IED (instruction early decode), IB (instruction buffer), EX (execution unit), SST (store stall timer).
In addition, there is a bus unit modeling the buses that connect the CPU, the static RAM, the cache, and
set(a)/stop
IAG
IC1
IC2
IED
IB
EX
SST
B
U
S

U
N
I
T
addr(a)
await(a)
put(a)
instr
start
store
wait
wait
wait
wait
fetch(a)
hold
code(a)
wait
wait
read(A)/write(A)
data/hold
next
cancel
cancel
FIGURE 14.11 Abstract model of the Motorola ColdFire 5307 processor.
the main memory. The signals between these units are shown as arrows. Most units directly correspond
to a stage in the real pipeline. However, the SST unit is used to model the fact that two stores must be
separated by at least two clock cycles. It is implemented as a (virtual) counter. The two stages of the
execution pipeline are modeled by a single stage, EX, because instructions can only overlap by one cycle.
The inner states and emitted signals of the units evolve in each cycle. The complexity of this state
update varies from unit to unit. It can be as simple as a small table, mapping pending signals and inner
state to a new state and signals to be emitted, for example, for the IAG unit and the IC1 unit. It can be
much more complicated, if multiple dependencies have to be considered, for example, the instruction
reconstruction and branch prediction in the IED stage. In this case, the evolution is formulated in pseudo
code. Full details on the model can be found in Reference 19.
14.3.6 Pipeline States
Abstract Pipeline States are formed by combining the inner states of IAG, IC1, IC2, IED, IB, EX, SST,
and bus unit plus additional entries for pending signals into one overall state. This overall state evolves
from one cycle to the next. Practically, the evolution of the overall pipeline state can be implemented by
updating the functional units one by one in an order that respects the dependencies introduced by input
signals and the generation of these signals.
14.3.6.1 Update Function for Pipeline States
For pipeline modeling, one needs a function that describes the evolution of the concrete pipeline state
while traveling along an edge (,
/
) of the control-ow graph. This function can be obtained by iterating
the cycle-wise update function of the previous paragraph.
An initial concrete pipeline state at has an empty execution unit EX. It is updated until an instruc-
tion is sent from IB to EX. Updating of the concrete pipeline state continues using the knowledge that
the successor instruction is
/
until EX has become empty again. The number of cycles needed from the
beginning until this point can be taken as the time needed for the transition from to
/
for this concrete
pipeline state.
14.4 Path Analysis Using Integer Linear Programming
The structure of a program and the set of program paths can be mapped to an ILP in a very natural way.
A set of constraints describes the control ow of the program. Solving these constraints yields very precise
results [5]. However, requirements for precision of the results demand analyzing basic blocks in different
contexts, that is, in different ways, how control reached them. This makes the control quite complex,
so that the mapping to an ILP may be very complex [14].
Aproblemformulated in an ILP consists of two parts: the cost function and constraints on the variables
used in the cost function. Our cost function represents the number of CPU cycles. Correspondingly, it
has to be maximized. Each variable in the cost function represents the execution count of one basic block
of the program and is weighted by the execution time of that basic block. Additionally, variables are used
corresponding to the traversal counts of the edges in the control ow graph, see Figure 14.12.
The integer constraints describing how often basic blocks are executed relative to each other can be
automatically generated from the control ow graph (Figure 14.13). However, additional information
about the program provided by the user is usually needed, as the problem of nding the worst case
program path is unsolvable in the general case. Loop and recursion bounds cannot always be inferred
automatically and must therefore be provided by the user.
The ILP approach for program path analysis has the advantage that users are able to describe in precise
terms virtually anything they know about the program by adding integer constraints. The system rst
generates the obvious constraints automatically and then adds user supplied constraints to tighten the
WCET bounds.
fi
if v
1
then
else
e
1
e
3
e
6
trav(e
1
)
trav(e
3
)
trav (e
5
)
trav (e
6
)
e
5
e
4
e
2
v
1
cnt (v
1
)
cnt (v
3
)
cnt (v
4
)
trav (e
4
)
cnt(v
2
)
trav(e
2
)
v
2
v
3
v
4
v
2
v
3
v
4
FIGURE 14.12 A program snippet, the corresponding control ow graph, and the ILP variables generated.
en
e
m
e
1
e
1
. . .
v
. . .
n
trav(e
i
) = cnt(v) =
m

i =1 i =1
trav(e
i
)
FIGURE 14.13 Control ow joins and splits and ow-preservation laws.
14.5 Other Ingredients
14.5.1 Value Analysis
A static method for data-cache behavior prediction needs to know effective memory addresses of data,
in order to determine where a memory access goes. However, effective addresses are only available at run
time. Interval analysis as described by Cousot and Halbwachs [20] can help here. It can compute intervals
for address-valued objects like registers and variables. An interval computed for such an object at some
program point bounds the set of potential values the object may have when program execution reaches
this program point. Such an analysis, in aiT called value analysis has shown to be able to determine many
effective addresses in disciplined code statically [10].
14.5.2 Control Flow Specication and Analysis
Any information about the possible ow of control of the program may increase the precision of the
subsequent analyses. Control ow analysis may attempt to exclude infeasible paths, determine execution
frequencies of paths or the relation between execution frequencies of different paths or subpaths, etc.
The purpose of control owanalysis is to determine the dynamic behavior of the program. This includes
information about what functions are called and with which arguments, how many times loops iterate,
if there are dependencies between successive if-statements, etc. The main focus of ow analysis has been
the determination of loop bounds, since the bounding of loops is a necessary step in order to nd an
execution time bound for a program.
Control-ow analysis can be performed manually or automatically. Automatic analyses have been
based on various techniques, like symbolic execution, abstract interpretation, and pattern recognition
on parse trees. The best precision is achieved by using interprocedural analysis techniques, but this has
to be traded off with the extra computation time and memory required. All automatic techniques allow
a user to complement the results and guide the analysis using manual annotations, since this is sometimes
necessary in order to obtain reasonable results.
Since the ow analysis in general is performed separately from the path analysis, it does not know the
execution times of individual program statements, and must thus generate a safe (over)approximation
including all possible program executions. The path analysis will later select the path from the set of
possible program paths that corresponds to the upper bound using the time information computed by
processor behavior prediction.
Control ow specication is preferrably done on the source level. Concepts based on source-level
constructs are used in References 6 and 21.
14.5.3 Frontends for Executables
Any reasonably precise timing analysis takes fully linkedexecutable programs as input. Source programs do
not contain information about program and data allocation, which is essential for the described methods
to predict the cache behavior.
Executables must be analyzed to reconstruct the original control ow of the program. This may be
a difcult task depending on the instruction set of the processor and the code generation of the used
compiler. A generic approach to this problem is described in References 14, 22, and 23.
14.6 Related Work
It is not possible in general to obtain upper bounds on running times for programs. Otherwise, one could
solve the halting problem. However, real-time systems only use a restricted form of programming, which
guarantees that programs always terminate. That is, recursion is not allowed (or explicitly bounded) and
the maximal iteration counts of loops are known in advance.
A worst-case running time of a program could easily be determined if the worst-case input for the
program were known. This is in general not the case. The alternative, to execute the program with all
possible inputs, is often prohibitively expensive. As a consequence, approximations for the worst-case
execution time are determined. Two classes of methods to obtain bounds can be distinguished:
Dynamic methods employ real program executions to obtain approximations. These approxima-
tions are unsafe as they only compute the maximum of a subset of all executions.
Static methods only need the program itself, maybe extended with some additional information
(like loop bounds).
14.6.1 A (Partly) Dynamic Method
A traditional method, still used in industry, combines measuring and static methods. Here, small snippets
of code are measured for their execution time, then a safety margin is applied and the results for code
pieces are combined according to the structure of the whole task. For example, if a tasks rst executes
a snippet A and thena snippet B, the resulting time is that measured for A, t
A
, added to that measured for B,
t
B
: t = t
A
t
B
. This reduces the amount of measurements that have to be made, as code snippets tend to
be reused a lot in control software and only the different snippets need to be measured. It adds, however,
the need for an argumentation about the correctness of the composition step of the measured snippet
times. This typically relies on certain implicit assumptions about the worst-case initial execution state
for these measurements. For example, the snippets are measured with an empty cache at the beginning of
the measurement under the assumption that this is the worst-case cache state. In Reference 19 it is shown
that this assumption can be wrong. The problem of unknown worst-case input exists for this method
as well, and it is still infeasible to measure execution times for all input values.
14.6.2 Purely Static Methods
14.6.2.1 The Timing-Schema Approach
In the timing-schemata approach [24], bounds for the execution times of a composed statement are
computed from the bounds of the constituents. One timing schema is given for each type of statement.
Basis are known times of the atomic statements. These are assumed to be constant and available from
a manual or are assumed to be computed in a preceding phase. Abound for the whole programis obtained
by combining results according to the structure of the program.
The precision can be very bad because of some implicit assumptions underlying this method. Timing
schemes assume compositionality of bounds for execution times, that is, they compute bounds for
execution times of composed constructs from already computed bounds of the constituents. However,
as we have seen, the execution times of the constituents depend heavily on the execution history.
14.6.2.2 Symbolic Simulation
Another static method simulates the execution of the program on an abstract model of the processor. The
simulation is performed without input; the simulator thus has to be capable to deal with partly unkown
execution states. This method combines ow analysis, processor-behavior prediction, and path analysis
in one integrated phase [25,26]. One problem with this approach is that analysis time is proportional to
the actual execution time of the program with a usually large factor for doing a simulation.
14.6.2.3 WCET Determination by ILP
Li, Malik, and Wolfe proposed an ILP-based approach to WCET determination [2730]. Cache and
pipeline behavior prediction are formulated as a single linear program. The i960KBis investigated, a 32-bit
microprocessor with a 512 byte direct mapped instruction cache and a fairly simple pipeline. Only
structural hazards needtobe modeled, thus keeping the complexity of the integer linear programmoderate
compared to the expected complexity of a model for a modern microprocessor. Variable execution times,
branch prediction, and instruction prefetching are not considered at all. Using this approach for super-
scalar pipelines does not seemvery promising, considering the analysis times reportedinone of the articles.
One of the severe problems is the exponential increase of the size of the ILP in the number of competing
l -blocks. l-blocks are maximally long contiguous sequences of instructions in a basic block mapped to
the same cache set. Two l -blocks mapped to the same cache set compete if they do not have the same
address tag. For a xed cache architecture, the number of competing l -blocks grows linearly with the
size of the program. Differentiation by contexts, absolutely necessary to achieve precision, increases
this number additionally. Thus, the size of the ILP is exponential in the size of the program. Even
though the problem is claimed to be a network-ow problem the size of the ILP is killing the approach.
Growing associativity of the cache increases the number of competing l -blocks. Thus, also increasing
cache-architecture complexity plays against this approach.
Nonetheless, their method of modeling the control ow as an ILP, the so-called Implicit Path Enumera-
tion, is elegant and can be efcient if the size of the ILP is kept small. It has been adopted by many groups
working in this area.
14.6.2.4 Timing Analysis by Static Program Analysis
The method described in this chapter uses a sequence of static program analyses for determining
the programs control ow and its data accesses and for predicting the processors behavior for the
given program.
An early approach to timing analysis using data-ow analysis methods can be found in References 31
and 32. Jakob Engblom showed how to precompute parts of a timing analyzer to speed up the actual
timing analysis for architectures without timing anomalies [33].
Reference 34 gives an overview of existing tools for timing analysis, both commercially available tools
and academic prototypes.
14.7 State of the Art and Future Extensions
The timing-analysis technology described in this chapter is realized in the aiT tool and is used in the
aeronautics and automotive industries. Several benchmarks have shown that precision of the predicted
upper bounds is in the order of 10% [10]. To obtain such a precision, however, requires competent users
since the available knowledge about the programs control ow may be difcult to specify.
The computational effort is high, but acceptable. Future optimizations will reduce this effort. As often
in static program analysis, there is a trade-off between precision and effort. Precision can be reduced if the
effort is intolerable.
The only really drawback of the described technology is the huge effort for producing abstract processor
models. Work is under way to support this activity through transformations on the VHDL level.
Acknowledgments
Many former students have worked on different parts of the method presented in this chapter and
have together built a timing-analysis tool satisfying industrial requirements. Christian Ferdinand studied
cache analysis andshowedthat precise informationabout cache contents canbe obtained. Stephan Thesing
together with Reinhold Heckmann and Marc Langenbach developed methods to model abstract pro-
cessors. Stephan went through the pains of implementing several abstract models for real-life processors
such as the ColdFire MCF 5307 and the PPC 755. I owe him my thanks for help with the presenta-
tion of pipeline analysis, Henrik Theiling contributed the preprocessor technology for the analysis of
executables and the translation of complex control ow to integer linear programs. Many thanks to
him for his contribution to the path analysis section. Michael Schmidt implemented powerful ver-
sions of value analysis. Reinhold Heckmann managed to model even very complex cache architectures.
Florian Martin implemented the program-analysis generator, PAG, which is the basis for many of the
program analyses.
References
[1] Reinhold Heckmann, Marc Langenbach, Stephan Thesing, and Reinhard Wilhelm. The inu-
ence of processor architecture an the design and the results of WCET tools. IEEE Proceedings on
Real-Time Systems, 91: 10381054, 2003.
[2] P. Puschner and Ch. Koza. Calculating the maximum execution time of real-time programs.
Real-Time Systems, 1: 159176, 1989.
[3] Chang Yun Park and Alan C. Shaw. Experiments with a program timing tool based on source-level
timing schema. IEEE Computer, 24: 4857, 1991.
[4] Christopher A. Healy, David B. Whalley, and Marion G. Harmon. Integrating the timing analysis
of pipelining and instruction caching. In Proceedings of the IEEE Real-Time Systems Symposium,
December 1995, pp. 288297.
[5] Henrik Theiling, Christian Ferdinand, and Reinhard Wilhelm. Fast and precise WCET prediction
by separated cache and path analyses. Real-Time Systems, 18: 157179, 2000.
[6] Andreas Ermedahl. A Modular Tool Architecture for Worst-Case Execution Time Analysis.
Ph.D. thesis, Uppsala University, Uppsala, Sweden, 2003.
[7] Martin Alt, Christian Ferdinand, Florian Martin, and Reinhard Wilhelm. Cache behavior predic-
tion by abstract interpretation. In Proceedings of SAS96, Static Analysis Symposium, Vol. 1145 of
Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 1996, pp. 5266.
[8] Christian Ferdinand, Florian Martin, and Reinhard Wilhelm. Cache behavior prediction by
abstract interpretation. Science of Computer Program, 35: 163189, 1999.
[9] C. Ferdinand, R. Heckmann, M. Langenbach, F. Martin, M. Schmidt, H. Theiling, S. Thesing, and
R. Wilhelm. Reliable and precise WCET determination for a real-life processor. In Proceedings of
the First International Workshop on Embedded Software Workshop, Vol. 2211 of Lecture Notes on
Computer Science, Springer-Verlag, London, 2001, pp. 469485.
[10] Stephan Thesing, Jean Souyris, Reinhold Heckmann, Famantanantsoa Randimbivololona,
Marc Langenbach, Reinhard Wilhelm, and Christian Ferdinand. An abstract interpretation-
based timing validation of hard real-time avionics software systems. In Proceedings of the 2003
International Conference onDependable Systems andNetworks (DSN2003), IEEEComputer Society,
Washington, 2003, pp. 625632.
[11] Thomas Lundqvist and Per Stenstrm. Timing Anomalies in Dynamically Scheduled Micro-
processors. In Proceedings of the 20th IEEE Real-Time Systems Symposium, December 1999,
pp. 1221.
[12] T. Reps, M. Sagiv, and R. Wilhelm. Shape analysis and applications. In Y.N. Srikant and
Priti Shankar, Eds., The Compiler Design Handbook: Optimizations and Machine Code Generation,
CRC Press, Boca Raton, FL, 2002, pp. 175217.
[13] Florian Martin, Martin Alt, Reinhard Wilhelm, and Christian Ferdinand. Analysis of loops.
In Proceedings of the International Conference on Compiler Construction (CC98), Vol. 1383 of
Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 1998, pp. 8094.
[14] Henrik Theiling. Control Flow Graphs For Real-Time Systems Analysis. Ph.D. thesis, Universitt
des Saarlandes, Saarbrcken, Germany, 2002.
[15] Patrick Cousot and Radhia Cousot. Abstract interpretation: a unied lattice model for
static analysis of programs by construction or approximation of xpoints. In Proceedings of
the 4th ACM Symposium on Principles of Programming Languages, Los Angeles, CA, 1977,
pp. 238252.
[16] Christian Ferdinand. Cache Behavior Prediction for Real-Time Systems. Ph.D. thesis, Universitt
des Saarlandes, Saarbrueken, 1997.
[17] Flemming Nielson, Hanne Riis Nielson, and Chris Hankin. Principles of Program Analysis.
[18] Marc Langenbach, Stephan Thesing, and Reinhold Heckmann. Pipeline modelling for tim-
ing analysis. In Manuel V. Hermenegildo and German Puebla, Eds., Static Analysis Symposium
SAS 2002, Vol. 2477 of Lecture Notes in Computer Science, Springer-Verlag, Heidelberg, 2002,
pp. 294309.
[19] Stephan Thesing. Safe and Precise WCET Determination by Abstract Interpretation of Pipeline
Models. Ph.D. thesis, Saarland University, Saarbruecken, 2004.
[20] Patrick Cousot and Nicolas Halbwachs. Automatic discovery of linear restraints among variables
of a program. In Proceedings of the 5th ACM SIGPLAN-SIGACT Symposium on Principles of
Programming Languages, Tucson, AZ, ACM Press, New York, 1978, pp. 8496.
[21] Andreas Ermedahl and Jan Gustafsson. Deriving annotations for tight calculation of execution
time. In Proceedings of the Euro-Par, 1997, pp. 12981307.
[22] Henrik Theiling. Extracting safe and precise control ow from binaries. In Proceedings of the
Seventh International Conference on Real-Time Systems and Application, IEEE Computer Society,
2000, pp. 2330.
[23] Henrik Theiling. Generating decision trees for decoding binaries. In ACM SIGPLAN 2001
Workshop on Languages, Compilers, and Tools for Embedded Systems, 2001, pp. 112120.
[24] Alan C. Shaw. Reasoning about time in higher-level language software. IEEE Transactions on
Software Engineering, 15: 875889, 1989.
[25] Thomas Lundqvist and Per Stenstrm. An integrated path and timing analysis method based on
cycle-level symbolic execution. Real-Time Systems, 17: 183207, 1999.
[26] Thomas Lundqvist. A WCET Analysis Method for Pipelined Microprocessors with Cache
Memories. Ph.D. thesis, Department of Computer Engineering, Chalmers University of
Technology, Sweden, 2002.
[27] Yau-Tsun Steven Li and Sharad Malik. Performance analysis of embedded software using implicit
path enumeration. In Proceedings of the 32nd ACM/IEEE Design Automation Conference, June 1995,
pp. 456461.
[28] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Efcient microarchitecture modeling and
path analysis for real-time software. In Proceedings of the IEEE Real-Time Systems Symposium,
December 1995, pp. 298307.
[29] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Performance estimation of embedded
software withinstructioncache modeling. InProceedings of the IEEE/ACMInternational Conference
on Computer-Aided Design, November 1995, pp. 380387.
[30] Yau-Tsun Steven Li, Sharad Malik, and Andrew Wolfe. Cache modeling for real-time software:
beyonddirect mappedinstructioncaches. InProceedings of the IEEEReal-Time Systems Symposium,
December 1996.
[31] R. Arnold, F. Mueller, D. Whalley, and M. Harmon. Bounding worst-case instruction cache
performance. In Proceedings of the IEEE Real-Time Systems Symposium, Puerto Rico, December
1994, pp. 172181.
[32] Frank Mueller, David B. Whalley, and Marion Harmon. Predicting instruction cache behavior.
In Proceedings of the ACM SIGPLAN Workshop on Language, Compiler and Tool Support for
Real-Time Systems, 1994.
[33] Jakob Engblom. Processor Pipelines and Static Worst-Case Execution Time Analysis. Ph.D. thesis,
Uppsala University, Uppsala, Sweden, 2002.
[34] Reinhard Wilhelm, Jakob Engblom, Stephan Thesing, and David Whalley. The determination of
worst-case execution times introduction and survey of available tools, 2004 (submitted).
15
Performance Analysis
of Distributed
Embedded Systems
Lothar Thiele and
Ernesto Wandeler
Technology
15.1 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-1
Distributed Embedded Systems Basic Terms Role in the
Design Process Requirements
15.2 Approaches to Performance Analysis . . . . . . . . . . . . . . . . . . . 15-6
Simulation-Based Methods Holistic Scheduling Analysis
Compositional Methods
15.3 The Performance Network Approach . . . . . . . . . . . . . . . . . . 15-11
Performance Network Variability Characterization
Resource Sharing and Analysis Concluding Remarks
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15-17
15.1 Performance Analysis
15.1.1 Distributed Embedded Systems
An embedded system is a special-purpose information processing system that is closely integrated into
its environment. It is usually dedicated to a certain application domain and knowledge about the system
behavior at design time can be used to minimize resources while maximizing predictability.
The embedding into a technical environment and the constraints imposed by a particular application
domain very often lead to heterogeneous and distributed implementations. In this case, systems are
composed of hardware components that communicate via some interconnection network. The functional
and nonfunctional properties of the whole systemnot only depend on the computations inside the various
nodes but also on the interaction of the various data streams on the common communication media.
In contrast to multiprocessor or parallel computing platforms, the individual computing nodes have
a high degree of independence and usually communicate via message passing. It is particulary difcult to
maintain global state and workload information as the local processing nodes usually make independent
scheduling and resource access decisions.
In addition, the dedication to an application domain very often leads to heterogeneous distributed
implementations, where each node is specialized to its local environment and/or its functionality. For
example, in an automotive application one may nd nodes (usually called embedded control units) that
contain a communication controller, a CPU, memory, and I/Ointerfaces. But depending on the particular
15-1
task of a node, it may contain additional digital signal processors (DSP), different kinds of CPUs and
interfaces, and different memory capacities.
The same observation holds for the interconnection networks also. They may be composed of several
interconnected smaller sub-networks, each one with its own communication protocol and topology.
For example, in automotive applications we may nd Controller Area Networks (CAN), time-triggered
protocols (TTP) like in TTCAN, or hybrid protocols like in FlexRay. The complexity of a design is
particularly high if the computation nodes responsible for a single application are distributed across
several networks. In this case, critical information may owthrough several sub-networks and connecting
gateways before it reaches its destination.
Recently, we see that the earlier described architectural concepts of heterogeneity, distributivity, and
parallelism can be seen on several layers of granularity. The term system-on-a-chip refers to the imple-
mentation of sub-systems on a single device, that contains a collection of (digital or analogue) interfaces,
busses, memory, and heterogeneous computing resources such as FPGAs, CPUs, controllers, and DSPs.
These individual components are connected using networks-on-chip that can be regarded as dedicated
interconnection networks involving adapted protocols, bridges, or gateways.
Based on the assessment given, it becomes obvious that heterogeneous and distributed embedded
systems are inherently difcult to design and to analyze. In many cases, not only the availability, the safety,
and the correctness of the computations of the whole embedded system are of major concern, but also
the timeliness of the results.
One cause for end-to-end timing constraints is the fact that embedded systems are frequently
connected to a physical environment through sensors and actuators. Typically, embedded systems are
reactive systems that are in continuous interaction with their environment and they must execute at a
pace determined by that environment. Examples are automatic control tasks, manufacturing systems,
mechatronic systems, automotive/air/space applications, radio receivers and transmitters, and signal pro-
cessing tasks in general. And also in the case of multimedia and content production, missing audio or
video samples need to be avoided under all circumstances. As a result, many embedded systems must meet
real-time constraints, that is, they must react to stimuli within the time interval dictated by the environ-
ment. A real-time constraint is called hard, if not meeting that constraint could result in a catastrophic
failure of the system, and it is called soft otherwise. As a consequence, time-predictability in the strong
sense cannot be guaranteed using statistical arguments.
Finally, let us give an example that shows part of the complexity in the performance and timing analysis
of distributed embedded systems. The example adapted fromReference 1 is particularly simple in order to
point out one source of difculties, namely the interaction of event streams on a communication resource
(Figure 15.1).
Bus load
t
BCET WCET
Sensor CPU Memory I/O
Input DSP Buffer
A1
A2
Bus
. . .
P1, P2 P3
P5, P6 P4
FIGURE 15.1 Interference of two applications on a shared communication resource.
Analysis of Distributed Embedded Systems 15-3
The application A1 consists of a sensor that sends periodically bursts of data to the CPU, which stores
theminthe memory using a task P1. These data are processed by the CPUusing a task P2, with a worst-case
execution time (WCET) and a best-case execution time (BCET). The processed data are transmitted via
the shared bus to a hardware input/output device that is running task P3. We suppose that the CPU uses
a preemptive xed-priority scheduling policy, where P1 has the highest priority. The maximal workload on
the CPU is obtained when P2 continuously uses the WCET and when the sensor simultaneously submits
data. There is a second streaming application A2 that receives real-time data in equidistant packets via the
Input interface. The Input interface is running task P4 to send the data to a DSPfor processing with task P5.
The processed packets are then transferred to a playout buffer and task P6 periodically removes packets
from the buffer, for example, for playback. We suppose that the bus uses a FCFS (rst come rst serve)
scheme for arbitration. As the bus transactions fromthe applications A1 and A2 interfere on the common
bus, there will be a jitter inthe packet streamreceived by the DSPthat eventually may lead to anundesirable
buffer overow or underow. It is now interesting to note that the worst-case situation in terms of jitter
occurs if the processing in A1 uses its BCET, as this leads to a blocking of the bus for a long time period.
Therefore, the worst-case situation for the CPU load leads to a best case for the bus, and vice versa.
In case of more realistic situations, there will be simultaneous resource sharing on the computing and
communication resources, there may be different protocols and scheduling policies on these resources,
there may be a distributed architecture using interconnected sub-networks, and there may be additional
nondeterminism caused by unknown input patterns and data. It is the purpose of performance analysis
to determine the timing and memory properties of such systems.
15.1.2 Basic Terms
As a starting point of the analysis of timing and performance of embedded systems, it is very useful to
clarify a few basic terms. Very often, the timing behavior of an embedded systemcan be described by the
time interval between a specied pair of events. For example, the instantiation of a task, the occurrence
of a sensor input, or the arrival of a packet could be a start event. Such events will be denoted as arrival
events. Similar, the nishing of an application or a part of it can again be modeled as an event, denoted as
nishing event. In case of a distributed system, the physical location of the nishing event may not be equal
to that of the corresponding arrival event and the processing may require the processing of a sequence
or set of tasks, and the use of distributed computing and communication resources. In this case, we talk
about end-to-end timing constraints. Note that not all pairs of events in a system are necessarily critical,
that is, have deadline requirements.
An embedded systemprocesses the data associated with arrival events. The timing of computations and
communications within the embedded systemmay depend on the input data (because of data dependent
behavior of tasks) and on the arrival pattern. In case of a conservative resource sharing strategy, such as
the time-triggered architecture (TTA), the interference between these tasks is removed by applying a static
sharing strategy. If the use of shared resources is controlled by dynamic policies, all activities may interact
with each other and the timing properties inuence each other. As shown in Section 15.1.1, it is necessary
to distinguish between the following terms:
Worst case and best case. The worst case and the best case are the maximal and minimal time interval
between the arrival and nishing events under all admissible system and environment states. The
execution time may vary largely, owing to different input data and interference between concurrent
systemactivities.
Upper and lower bounds. Upper and lower bounds are quantities that bound the worst- and best-
case behavior. These quantities are usually computed ofine, that is, not during the runtime of the
system.
Statistical measures. Instead of computing bounds on the worst- and best-case behavior, one may
also determine a statistical characterization of the runtime behavior of the system, for example,
expected values, variances, and quantiles.
In the case of real-time systems, we are particularly interested in upper and lower bounds. They are
used in order to verify statically, whether the systemmeets its timing requirements, for example, deadlines.
In contrast to the end-to-end timing properties, the term performance is less well dened. Usually,
it refers to a mixture of the achievable deadline, the delay of events or packets, and of the number of events
that can be processed per time unit (throughput). There is a close relation between the delay of individual
packets or events, the necessary memory in the embedded systemand the throughput, that is, the required
memory is proportional to the product of throughput and delay. Therefore, we will concentrate on the
delay and memory properties in this chapter.
Several methods do exist, such as analysis, simulation, emulation, and implementation, in order to
determine or approximate the above quantities. Besides analytic methods basedonformal models, one may
also consider simulation, emulation, or implementation. All the latter possibilities should be used with
care as only a nite set of initial states, environment behaviors, and execution traces can be considered.
As is well known, the corner cases that lead to a WCET or BCET are usually not known, and thus incorrect
results may be obtained. The huge state space of realistic systemarchitectures makes it highly improbable
that the critical instances of the execution can be determined without the help of analytical methods.
In order to understand the requirements for performance analysis methods in distributed embedded
systems, we will classify possible causes for a large difference between the worst case and best case or
between the upper and lower bounds:
Nondeterminism and interference. Let us suppose that there is only limited knowledge about the
environment of the embedded system, for example, about the time when external events arrive
or about their input data. In addition, there is interference of computation and communication
on shared resources such as CPU, memory, bus, or network. Then, we will say that the timing
properties are nondeterministic with respect to the available information. Therefore, there will be
a difference between the worst-case and the best-case behavior as well as between the associated
bounds. An example may be that the execution time of a task may depend on its input data. Another
example is the communication of data packets on a bus in case of an unknown interference.
Limited analyzability. If there is complete knowledge about the whole system, then the behavior of
the system is determined. Nevertheless, it may be that because of the system complexity, there is
no feasible way of determining close upper and lower bounds on the worst- and best-case timing,
respectively.
As a result of this discussion, we understand that methods to analyze the performance of distributed
embeddedsystemmust be (1) correct inthat they determine validupper andlower bounds and(2) accurate
in that the determined bounds are close to the actual worst case and best case.
In contrast to other chapters of the handbook, we will concentrate on the interaction between the
task level of an embedded system and the distributed operation. We suppose that the whole application
is partitioned into tasks and threads. Therefore, the task level refers to operating system issues such as
scheduling, memory management, and arbitration of shared resources. In addition, we are faced with
applications that run on distributed resources. The corresponding layer contains methods of distributed
scheduling and networking. On this level of abstraction we are interested in end-to-end timing and
performance properties.
15.1.3 Role in the Design Process
One of the major challenges in the design process of embedded systems is to estimate essential character-
istics of the nal implementation early in the design. This can help in making important design decisions
before investing too much time in detailed implementations. Typical questions faced by a designer dur-
ing a system-level design process are: which functions should be implemented in hardware and which
in software (partitioning)? Which hardware components should be chosen (allocation)? How should the
different functions be mapped onto the chosen hardware (binding)? Do the system-level timing properties
meet the design requirements? What are the different bus utilizations and which bus or processor acts
Application
specification
Mapping
scheduling
arbitration
Design space
exploration
Execution
platform
Performance
analysis
FIGURE 15.2 Relation between design space exploration and performance analysis.
as a bottleneck? Then there are also questions related to the on-chip memory requirements and off-chip
memory bandwidth.
Typically, the performance analysis or estimation is part of the design space exploration, where different
implementation choices are investigated in order to determine the appropriate design trade-offs between
the different conicting objectives, for an overview see Reference 2. Following Figure 15.2, the estimation
of systemproperties in an early design phase is an essential part of the design space exploration. Different
choices of the underlying systemarchitecture, the mapping of the applications onto this architecture, and
the chosen scheduling and arbitration schemes will need to be evaluated in terms of the different quality
criteria.
In order to achieve acceptable design times though, there is a need for automatic or semiautomatic
(interactive) exploration methods. As a result, there are additional requirements for performance analysis
if used for design space exploration, namely (1) the simple recongurability with respect to architecture,
mapping, and resource sharing policies, (2) a short analysis time in order to be able to test many different
choices in a reasonable time frame, and (3) the possibility to cope with incomplete design information,
as typically the lower layers are not designed or implemented yet.
Even if the design space exploration as described is not a part of the chosen design methodology, the
performance analysis is often part of the development process of software and hardware. In embedded
system design, the functional correctness is validated after each major design step using simulation or
formal methods. If there are nonfunctional constraints such as deadline or throughput requirements, they
need to be validated as well and all aspects of the design representation related to performance become
rst class citizens.
Finally, performance analysis of the whole embedded system may be done after completion of the
design, in particular if the systemis operated under hard real-time conditions where timing failures lead
to a catastrophic situation. As has been mentioned earlier, performance simulation is not appropriate
in this case because the critical instances and test patterns are not known in general.
15.1.4 Requirements
Based on the discussion, one can list some of the requirements that a methodology for performance
analysis of distributed embedded systems must satisfy:
Correctness. The results of the analysis should be correct, that is, there exist no reachable system
states and feasible reactions of the systemenvironment such that the calculated bounds are violated.
Accuracy. The lower and upper bounds determined by the performance analysis should be close to
the actual worst- and best-case timing properties.
Embedding into the design process. The underlying performance model should be sufciently
general to allow the representation of the application (which possibly uses different specication
mechanisms), of the environment (periodic, aperiodic, bursty, different event types), of the
mapping including the resource sharing strategies (preemption, priorities, time triggered) and of
the hardware platform. The method should seamlessly integrate into the functional specication
and design methodology.
Short analysis time. Especially, if the performance analysis is part of a design space explora-
tion, a short analysis time is important. In addition, the underlying model should allow for
recongurability in terms of application, mapping, and hardware platform.
As distributed systems are heterogeneous in terms of the underlying execution platform, the diverse
concurrently running applications, and the different scheduling and arbitration policies used, modularity
is a key requirement for any performance analysis method. We can distinguish between several composition
properties:
Process composition. Often, events need to be processed by several consecutive application tasks.
In this case, the performance analysis method should be modular in terms of this functional
composition.
Scheduling composition. Within one implementation, different scheduling methods can be com-
bined, even within one computing resource (hierarchial scheduling); the same property holds for
the scheduling and arbitration of communication resources.
Resource composition. A system implementation can consist of different heterogeneous computing
and communication resources. It should be possible to compose them in a similar way as processes
and scheduling methods.
Building components. Combinations of processes, associated scheduling methods and architecture
elements should be combined into components. This way, one could associate a performance
component to a combined hardware/operating system/software module of the implementation,
that exposes the performance requirements but hides internal implementation details.
It should be mentioned that none of the approaches known to date are able to satisfy all of the
above mentioned criteria. On the other hand, depending on the application domain and the chosen
design approach, not all of the requirements are equally important. Section 15.2 summarizes some of the
available methods and in Section 15.3 one available method is described in more detail.
15.2 Approaches to Performance Analysis
In this survey, we select just a few representative and promising approaches that have been proposed for
the performance analysis of distributed embedded systems.
15.2.1 Simulation-Based Methods
Currently, the performance estimation of embedded systems is mainly done using simulation or trace-
based simulation. Examples of available approaches and software support provides the SystemCinitiative,
see for example, References 3 and 4, that is supported by tools from companies such as Cadence
(nc-systemc) and Synopsys (System Studio). In simulation-based methods, many dynamic and com-
plex interactions can be taken into account whereas analytic methods usually have to stick to a restrictive
underlying model and suffer from limited scope. In addition, there is the possibility to match the level
of abstraction in the representation of time to the required degree of accuracy. Examples for these differ-
ent layers are cycle-accurate models, for example, those used in the simulation of processors [5], up to
networks of discrete event components that can be modeled in SystemC.
In order to determine timing properties of an embedded system, a simulation framework not only
has to consider the functional behavior but also requires a concept of time and a way of taking into
Input
stimuli
Cosimulation
(Based on an abstract
architecture)
Abstract
trace
Initial CAG
Communication topology
mapping
arbitration protocols
Refined CAG
Performance
estimation
Simulation
Analysis
FIGURE 15.3 A hybrid method for performance estimation, based on simulation and analytic methods.
account properties of the execution platform, of the mapping between functional computation and
communication processes and elements of the underlying hardware, and of resource sharing policies
(as usually implemented in the operating system or directly in hardware). This additional complexity
leads to higher computation times, and performance estimation quickly becomes a bottleneck in the
design. Besides, there is a substantial set-up effort necessary if the mapping of the application to the
underlying hardware platform changes, for example, in order to perform a design space exploration.
The fundamental problem of simulation-based approaches to performance estimation is the
insufcient corner case coverage. As shown in the example in Figure 15.1, the sub-system corner case (high
computation time of A1) does not lead to the system corner case (small computation time of A1). Designers
must provide a set of appropriate simulation stimuli in order to cover all the corner cases that exist in the
distributed embedded system. Failures of embedded systems very often relate to timing anomalies that
happen infrequently and therefore, are almost impossible to discover by simulation. In general, simulation
provides estimates of the average system performance but does not yield worst-case results and cannot
determine whether the system satises required timing constraints.
The approach taken by Lahiri et al. [6] combines performance simulation and analysis by a hybrid
trace-based methodology. It is intended to ll the gap between pure simulation that may be too slow to be
used in a design space exploration cycle, and analytic methods that are often too restricted in scope and
not accurate enough. The approach as described concentrates on communication aspects of a distributed
embedded system. The performance estimation is partitioned into several stages, see Figure 15.3:
Stage 1. An initial cosimulation of the whole distributed systemis performed. The simulation not
only covers functional aspects (processing of data) but also captures the communication in an
abstract manner, that is, in form of events, tokens, or abstract data transfers. The resulting set
of traces cover essential characteristics of computation and communication but do not contain
data information anymore. Here, we do not take into account resource sharing such as different
arbitration schemes and access conicts. The output of this step is a timing inaccurate system
execution trace.
Stage 2. The traces from stage 1 are transformed into an initial Communication Analysis
Graph (CAG). One can omit unnecessary details (values of the data communicated, only the
size might be important here, etc.) and bursts of computation/communication events might be
clustered by identifying only start and end times of these bursts.
Stage 3. A communication topology is chosen, the mapping of the abstract communications to
paths in the communication architecture (network, bus, point-to-point links) is specied and
nally, the corresponding arbitration protocols are chosen.
Stage 4. In the analytic part of the whole methodology, the CAG from stage 2 is transformed
and rened using the information in stage 3. It now captures the computation, communication,
and synchronization as seen on the target system. To this end, the initial CAG is augmented to
incorporate the various latencies and additional computations introduced by moving from an
abstract communication model to an actual one.
The resulting CAG can then be analyzed in order to estimate the system performance, determine critical
paths, and collect various statistics about the computation and communication components.
The above approach still suffers from several disadvantages. All traces are the result of a simulation, and
the coverage of corner cases is still limited. The underlying representation is a complete execution of the
application in form of a graph that may be of prohibitive size. The effect of the transformations applied
in order to (1) reduce the size of the CAG and to (2) incorporate the concrete communication architecture
are not formally specied. Therefore, it is not clear what the nal analysis results represent. Finally,
because of the separation between the functional simulation and the nonfunctional analysis, no feedback
is possible. For example, a buffer overow because of a sporadic communication overload situation may
lead to a difference in the functional behavior. Nevertheless, the described approach blends two important
approaches to performance estimation, namely simulation and analytic methods and makes use of the
best properties of both worlds.
15.2.2 Holistic Scheduling Analysis
There is a large body of formal methods available for scheduling of shared computing resources,
for example, xed priority, rate monotonic, earliest deadline rst scheduling, time-triggered policies like
TDMA or round-robin, and static cyclic scheduling. From the WCET of individual tasks, the arrival pattern
of activation and the particular scheduling strategy, one can analyze in many cases the schedulability and
worst-case response times, see for example, Reference 7. Many different application models and event
patterns have been investigated such as sporadic, periodic, jitter, and bursts. There exists a large number
of commercial tools that allow for this one-model approach the analysis of quantities such as resource
load and response times. In a similar way, network protocols are increasingly supported by analysis and
optimization tools.
The classical scheduling theory has been extended toward distributed systems where the application
is executed on several computing nodes and the timing properties of the communication between these
nodes cannot be neglected. The seminal work of Tindell and Clark [8] combined xed priority preemptive
scheduling at computations nodes with TDMA scheduling on the interconnecting bus. These results
are based on two major achievements:
The communication system(in this case, the bus), was handled in a similar way than the computing
nodes. Because of this integrationof process andcommunicationscheduling, the methodwas called
a holistic approach to the performance analysis of distributed real-time systems.
The second contribution was the analysis of the inuence of the release jitter on the response time,
where the release jitter denotes the worst-case time difference between the arrival (or activation)
of a process and its release (making it available to the processor). Finally, the release jitter has been
linked to the message delay induced by the communication system.
This work was improved in terms of accuracy by Yen and Wolf [9] by taking into account correlations
between arrivals of triggering events. In the meantime, many extensions and applications have been pub-
lished based on the same line of thoughts. Other combinations of scheduling and arbitration policies have
been investigated, such as CAN [10], and more recently, the FlexRay protocol [11]. The latter extension
opens the holistic scheduling methodology to mixed event-triggered and time-triggered systems where the
processing and communication is driven by the occurrence of events or the advance of time, respectively.
Nevertheless, it must be noted that the holistic approach does not scale to general distributed architec-
tures in that for every new kind of application structure, sharing of resources and combination thereof,
a new analysis needs to be developed. In general, the model complexity grows with the size of the system
and the number of different scheduling techniques. In addition, the method is restricted to the classical
models of task arrival patterns such as periodic, or periodic with jitter.
15.2.3 Compositional Methods
Three main problems arise in the case of complex distributed embedded systems: rst, the architecture
of such systems, as already mentioned, is highly heterogeneous the different architectural components
are designed assuming different input event models and use different arbitration and resource sharing
strategies. This makes any kind of compositional performance analysis difcult. Second, applications very
often rely on a high degree of concurrency. Therefore, there are multiple control threads, which additionally
complicate timing analysis. And third, we cannot expect that an embedded system only needs to process
periodic events where to each event a xed number of bytes is associated. If, for example, the event stream
represents a sampled voice signal, then after several coding, processing, and communication steps, the
amount of data per event as well as the timing may have changed substantially. In addition, stream based
systems often also have to process other event streams that are sporadic or bursty, for example, they have
to react to external events or deal with best-effort trafc for coding, transcription, or encryption. There
are only a few approaches available that can handle such complex interactions.
One approach is based on a unifying model of different event patterns in the form of arrival curves
as known from the networking domain, see References 12 and 13. The proposed real-time calculus (RTC)
represents the resources and their processing or communication capabilities in a compatible manner and
therefore, allows for a modular hierarchical scheduling and arbitration for distributed embedded systems.
The approach will be explained in Section 15.3 in some more detail.
Richter et al. propose in References 1, 14, and 15 a method that is based on classical real-time schedul-
ing results. They combine different well known abstractions of event task arrival patterns and provide
additional interfaces between them. The approach is based on the following principles:
The main goal is to make use of the very successful results in real-time scheduling, in particular for
sharing a single processor or a single communication link, see for example, References 7 and 16. For
a large class of scheduling and arbitration policies and a set of arrival patterns (periodic, periodic
with jitter, sporadic, and bursty), upper and lower bounds on the response time can be determined,
that is, the time difference between the arrival of a task and its nishing time. Therefore, the
abstraction of a task of the application consists of a triggering event stream with a certain arrival
pattern, the WCET and BCET on the resource. Several tasks can be mapped onto a single resource.
Together with the scheduling policy, one can obtain for each task the associated lower and upper
bound of the response time. In a similar way, communication and shared busses can be handled.
The application model is a simple concatenation of several tasks. The end-to-end delay can now
be obtained by adding the individual contributions of the tasks; the necessary buffer memory can
simply be computed taking into account the initial arrival pattern.
Obviously, the approach is feasible only if the arrival patterns t the few basic models for which
results on computing bounds on the response time are available. In order to overcome this
limitation, two types of interfaces are dened:
(a) EMIF. Event Model Interfaces are used in the performance analysis only. They perform
a type conversion between certain arrival patterns, that is, they change the mathematical
representation of event streams.
(b) EAF. Event Adaptation Functions need to be used in cases where there exists no EMIF. In this
case, the hardware/software implementation must be changed in order to make the system
analyzable, for example, by adding playout buffers at appropriate locations.
In addition, a new set of six arrival patterns was dened in Reference 1 which is more suitable for the
proposed type conversion using EMIF and EAF, see Figure 15.4.
In Figure 15.5, the example of Figure 15.1 is extended by adding to the tasks P1 to P6, appropriate
arrival patterns (event streamabstractions) and EMIF/EAF interfaces. For example, we suppose that there
is an analysis method for the bus arbitration scheme available that requires periodic with jitter as the
input model. As the transformation fromperiodic with burst requires an EAF, the implementation must
be changed to accommodate a buffer that smoothens the bursts. Fromperiodic to periodic with jitter,
Periodic
t
i +1
t
i
=T
t
i
t
i
t
i
t
i +1
T
t
Periodic
w/jitter
t
T
t
i
=i T+ w
i
+ w
0
0>w
i
>J
t
i
=i T+ w
i
+ w
0
0>w
i
>J
t
i +1
t
i
>d
Admissible occurrence of event
J
J >T
Periodic
w/burst
t
T J
J >T
FIGURE 15.4 Some arrival patterns of tasks that can be used to characterize properties of event streams in Reference 1.
T, J , and d denote the period, jitter, and minimal interarrival time, respectively.
0
denotes a constant phase shift.
Sensor Memory
Buffer
A1
A2
P1
P2
P3
P5 P4
C1 C2
Periodic w/burst
Periodic w/jitter
Periodic
Periodic w/jitter
Periodic w/burst
Sporadic
Periodic w/burst
P6
EAF EMIF
EMIF
FIGURE 15.5 Example of event stream interfaces for the example in Figure 15.1.
one can construct a lossless EMIF simply by setting the jitter J = 0. There is another interface between
communication C1 and task P3 that converts the bursty output of the bus to a sporadic model. Now,
one can apply performance analysis methods to all of the components. As a result, one may determine the
minimal buffer size and an appropriate scheduling policy for the DSP such that no overow or underow
occurs.
Several extensions have been worked out, for example, in order to deal with cyclic nonfunctional
dependencies and to generalize the application model. Nevertheless, when comparing the requirements
for a modular performance analysis, the approach has some inherent drawbacks. EAFs are caused by the
limited class of supported event models and the available analysis methods. The analysis method enforces
a change in the implementation. Furthermore, the approach is not modular in terms of the resources,
as their service is not modeled explicitly. For example, if several scheduling policies need to be combined
in one resource (hierarchical scheduling), then for each newcombination an appropriate analysis method
must be developed. In this way, the approach suffers from the same problem as the holistic approach
described earlier. In addition, one is bound to the classical arrival patterns that are not sufcient in case
of streamprocessing applications. Other event models need to be converted with loss in accuracy (EMIF)
or the implementation must be changed (EAF).
15.3 The Performance Network Approach
This section describes an approach to the performance analysis of embedded systems that is inuenced by
the worst-case analysis of communication networks. The network calculus as described in Reference 17
is based on Reference 18 and uses (max,+)-algebra to formulate the necessary operations. The network
calculus is a promising analysis methodology as it is designed to be modular in various respects and as the
representation of event (or packet) streams is not restricted to the few classes mentioned in Section 15.2.3.
In References 12 and 19, the method has been extended to the RTC in order to deal with distributed
embedded systems by combining computation and communication. Because of the detailed modeling of
the capability of the shared computing and communication resources as well as the event streams, a high
accuracy can be achieved, see Reference 20. The following sections serve to explain the basic approach.
In addition, the main performance analysis method is not bound to the use of the RTC. Instead,
any suitable abstraction of event streams and resource characterization is possible. Only the actual
computations that are done within the components of the performance network need to be changed
appropriately.
15.3.1 Performance Network
In functional specication and verication, the given application is usually decomposed into components
that are communicating via event interfaces. The properties of the whole system are investigated by
combining the behavior of the components. This kind of representation is common in the design of
complex embedded systems and is supported by many tools and standards, for example, UML. It would be
highly desirable if the performance analysis follows the same line of thinking as it could be integrated into
the usual design methodology easily. Considering the discussion given earlier, we can identify two major
additions that are necessary:
Abstraction. Performance analysis is interested in making statements about the timing behavior
not just for one specic input characterization but for a larger class of possible environments.
Therefore, the concrete event streams that ow between the components must be represented in
an abstract way. As an example, we have seen their characterization by periodic or sporadic with
jitter. The same way, the nonfunctional properties of the application and the resource sharing
mechanisms must be modeled appropriately.
Resource modeling. In comparison to functional validation, we need to model the resource capabil-
ities and how they are changed by the workload of tasks or communication. Therefore, in contrary
to the approaches described before, we will model the resources explicitly as rst class citizens of
the approach.
As an example of a performance network, let us look again at the simple example from Figure 15.1 and
Figure 15.5. In Figure 15.6, we see a corresponding performance network. Because of the simplicity of the
example, not all the modeling possibilities can be shown.
On the left-hand side, you see the abstract input which models the sources of the event streams that
trigger the tasks of the applications: Timer represents the periodic instantiation of the task that reads
out the buffer for playback, Sensor models the periodic bursty events from the sensor and RT data
denotes the real-time data in equidistant packets via the Input interface. The associated abstract event
streams are transformed by the performance components. On the top, you can see the resource modules that
model the service of the shared resources, for example, the Input, CPU, Bus, and I/O component. The
abstract resource streams (vertical direction) interact with the event streams on the performance modules
and performance components. The resource interfaces at the bottom represent the remaining resource
service that is available to other applications that may run on the execution platform.
The performance components represent (1) the way how the timing properties of input event streams
are transformed to timing properties of output event streams and (2) the transformation of the resources.
Of course, these components can be hierarchically grouped into larger components. The way how the
CPU Input Bus DSP
P1 Sensor
RT data
I/O
P2 P3
P4 P5
Timer P5
C1
C2
Resource
module
Abstract
input
Abstract
event stream
Abstract
resource stream
Performance
component
Resource
interface
FIGURE 15.6 A simple performance network related to the example in Figure 15.1.
performance components are grouped and their transfer function reect the resource sharing strategy.
For example, P1 and P2 are connected serially in terms of the resource stream and therefore, they model
a xed-priority scheme with the high priority assigned to task P1. If the bus implements FCFS strategy
or a TTP, the transfer function of C1/C2 needs to be determined such that the abstract representations of
the event and resource stream are correctly transformed.
15.3.2 Variability Characterization
The timing characterization of event and resource streams is based on Variability Characterization Curves
(VCC) which substantially generalize the classical representations such as sporadic or periodic. As the
event streams propagate through the distributed architecture, their timing properties get increasingly
complex and the standard patterns can not model them with appropriate accuracy.
The event streams are described using arrival curves
u
(),
l
() R 0, R 0 which
provide upper and lower bounds on the number of events in any time interval of length . In particular,
there are at most
u
() and at least
l
() events within the time interval [t , t +) for all t 0.
In a similar way, the resource streams are characterized using service functions
u
(),
l
() R 0,
R 0 provide upper and lower bounds on the available service in any time interval of length .
The unit of service depends on the kind of the shared resource, for example, instructions (computation)
or bytes (communication).
Note that as dened above, the VCCs
u
() and
l
() are expressed in terms of events (this is marked
by a bar on their symbol), while the VCCs
u
() and
l
() are expressed in terms of workload/service.
A method to transform event-based VCCs to workload/resource-based VCCs and vice versa is presented
later in this chapter. All calculations and transformations presented here are valid both with only event-
basedor withonly workload/resource-basedVCCs, but inthis chapter mainly the event-basedformulation
is used.
Figure 15.7 shows arrival curves that specify the basic classical models shown in Figure 15.4. Note that
in case of sporadic patterns, the lower arrival curves are 0. In a similar way, Figure 15.8 shows a service
curve of a simple TDMA bus access with period T, bandwidth b, and slot interval .
1
2
3
4
1
2
3
4
1
2
3
4
Periodic Periodic w/jitter Periodic w/bursts
T 2T T 2T
TJ T+J
2T+J
T 2T
2TJ d
a
u
, a
l
a
u
, a
l
a
u
, a
l
FIGURE 15.7 Basic arrival functions related to the patterns described in Figure 15.4.
t
T
t
Bandwidth b
b
u
, b
l
T 2T
T t
t
b t
FIGURE 15.8 Example of a service curve that describes a simple TDMA protocol.
Note that arrival curves can be approximated using linear approximations, that is, a piecewise linear
function. Moreover, there are of course nite representations of the arrival and service curves, for example,
by decomposing them into an irregular initial part and a periodic part.
Where do we get the arrival and service functions from, for example, those characterizing a processor
(CPU in Figure 15.6), or an abstract input (Sensor in Figure 15.6):
Pattern. In some cases, the patterns of the event or resource stream are known, for example,
bursty, periodic, sporadic, and TDMA. In this case, the functions can be constructed analytically,
see for example, Figure 15.7 and Figure 15.8.
Trace. In case of unknown arrival or service patterns, one may use a set of traces and compute
the envelope. This can be done easily by using a sliding window of size and determining the
maximumand minimumnumber of events (or service) within the window.
Data sheets. In other cases, one can derive the curves by deriving the bounds fromthe characteristic
of the generating device (in terms of the arrival curve) or the hardware component (in case of
service curve).
The performance components transform abstract event and resource streams. But so far, the arrival
curve is denedinterms of events per time interval whereas the service curve is giveninterms of service per
time interval. One possibility to overcome this gap is to dene the concept of workload curves that connect
the number of successive events in an event stream and the maximal or minimal workload associated.
They capture the variability in execution demands.
The upper and lower workload curve
u
(e),
l
(e) R 0 denote the maximal and minimal workload
on a specic resource for any sequence of e consecutive events. If we have these curves available, then we
can easily determine upper and lower bounds on the workload that an event streamimposes in any time
interval of length on a resource as
u
() =
u
(
u
()) and
l
() =
l
(
l
()), respectively. And
analogously,

u
() =
l
1
(
u
()) and

l
() =
u1
(
l
()). As in the case of the arrival and service
curves, there appears the question, where the workload curves can come from. A selection of possibilities
is given below:
WCET and BCET. The simplest possibility is to (1) assume that each event of an event stream
triggers the same task and (2) that this task has a given WCET and BCET determined by other
e
4
8
12
16
1 2
WCET=4
BCET=3
3
e
5
10
15
20
1 5
10
2
2
1
3
Subtask
Workload
Task Workload
g
u
, g
l
g
u
, g
l
FIGURE 15.9 Two examples of modeling the relation between incoming events and the associated workload on
a resource. The left-hand side shows a simple modeling in terms of the WCET and BCET of the task triggered by an
event. The right-hand side models the workload generated by a task through a nite state machine. The workload
curves can be constructed by considering the maximum or minimum weight paths with e transitions.
methods. An example of an associated workload curve is given in Figure 15.9. The same holds for
communication events also.
Application modeling. The above method models the fact that not all events lead to the same
execution load (or number of bits) by simply using upper and lower bounds on the execution time.
The accuracy of this approach can be substantially improved, if characteristics of the application are
taken into account, for example (1) distinguishing between different event types each one triggering
a different task and (2) modeling that it is not possible that many consecutive events all have the
WCET (or BCET). This way, one can model correlations in event streams, see Reference 21.
Figure 15.9 represents on the right-hand side a simple example where a task is rened into a set
of subtasks. At each incoming event, a subtask generates the associated workload and the program
branches to one of its successors.
Trace. As in the case of arrival curves, we can use a given trace and report the workloads associated
to each event, for example, by simulation. Based on this information, we can easily compute the
upper and lower envelope.
A more ne-grained modeling of an application is possible also, for example, by taking into account
different event types in event streams, see Reference 22. By the same approach, it is also possible to model
more complex task models, for example, a task with different production and consumption rates of events
or tasks with several event inputs, see Reference 23. Moreover, the same modeling holds for the load on
communication links of the execution platform also.
In order to construct a scheduling network according to Figure 15.6, we still need to take into account
the resource sharing strategy.
15.3.3 Resource Sharing and Analysis
In Figure 15.1, we see, for example, that the performance modules associated to tasks P1 and P2 are
connected serially. This way, we can model a preemptive xed-priority resource sharing strategy as P2
only gets the CPU resource that is left after the workload of P1 has been served. Other resource sharing
strategies can be modeled as well, see for example, Figure 15.10 where in addition a proportional share
policy is modeled on the left. In this case, a xed portion of the available resource (computation or
communication) is associated to each task. Other sharing strategies are possible also, such as FCFS [17].
In the same Figure 15.10, we also see how the workload characterization as described in the last section
is used to transform the incoming arrival curve into a representation that talks about the workload for
a resource. After the transformation of the incoming streamby a block called RTC, the inverse workload
transformation may be done again in order to characterize the streamby means of events per time interval.
This way, the performance modules can be freely combined as their input and output representations
match.
Fixed priority
component
Share
Sum
Proportional share
component
RTC
Performance module
RTC
g
[a
l
,a
u
]
[ b
l
,b
u
]
[ b
l
,b
u
]
[a
l
,a
u
]
g
1
FIGURE 15.10 Two examples of resource sharing strategies and their model in the RTC.
Buffers Input
streams
Resource
sharing
Service
FIGURE 15.11 Functional model of resource sharing on computation and communication resources.
We still need to describe how a single workload stream and resource stream interact on a resource. The
underlying model and analysis very much depends on the underlying execution platform. As a common
example, we suppose that the events (or data packets) corresponding to a single stream are stored in
a queue before being processed, see Figure 15.11. The same model is used for computation as well as for
communication resources. It matches well the common structure of operating systems where ready tasks
are lined up until the processor is assigned to one of them. Events belonging to one stream are processed
in a FCFS manner whereas the order between different streams depends on the particular resource sharing
strategy.
Following this model, one can derive the equations that describe the transformation of arrival and
service curves by an RTC module according to Figure 15.10, see for example, Reference 13:

u
= [(
u
u
)

l
]

u

l
= [(
l
u
)

l
]

= (

u

l
) 0
= (

l

u
) 0
Following Reference 24, the operators used are called min-plus/max-plus convolutions
( f g)(t ) = inf
0ut
f (t u) +g(u)
( f g)(t ) = sup
0ut
f (t u) +g(u)

b
l
1
g
u
1
g
u
2
g
u
N
a
u
b
l
2
b
l
N
a
u
b
l
Delay
Backlog
b
l
FIGURE 15.12 Representation of the delay and accumulated buffer space computation in a performance network.
and min-plus/max-plus deconvolutions
( f g )(t ) = sup
u 0
f (t + u ) g (u )
( f g )(t ) = inf
u 0
f (t + u ) g (u )
further, denotes the minimum-operator (f g )(t ) = min {f (t ), g (t )}.

Using these equations, the workload curves, and the characterization of input event and resource
streams, we now can determine the characterizations of all event and resource streams in a performance
network such as in Figure 15.6. From the resulting arrival curves (leaving the network on the right-hand
side) and service curves (at the bottom), we can compute all the relevant information such as the average
resource loads, the end-to-end delays and the necessary buffer spaces on the event and packet queues,
see Figure 15.11. If the performance network contains cycles, then xed-point iterations are necessary.
As an example let us suppose that the upper input arrival curve of an event stream is
u
(). Moreover,
the stream is processed by a sequence of N modules according to the right-hand side of Figure 15.10
with incoming service curves
l
i
(), 1 i N and workload curves
u
i
(e ). Then we can determine the
maximal end-to-end delay and accumulated buffer space for this stream according to (see Reference 12)
1
i
(W ) = sup {e 0:
u
i
(e ) W } 1 i N
l
i
() =
1
i
(
l
i
()) 1 i N
l
() =

l
1
()

l
2
()

l
N
()
delay sup
0
inf { 0:
u
()

l
(+)}
backlog sup
0
{
u
()

l
()}
The curve
1
(W ) denotes the pseudo inverse of a workload curve, that is, it yields the minimum
number of events that can be processed if the service W is available. Therefore,
l
i
() is the minimal
available service in terms of events per time interval. It has been shown in Reference 17, that the delay
and backlog are determined by the accumulated service
l
() that can be obtained using the convolution
of all individual services. The delay and backlog can now be interpreted as the maximal horizontal and
vertical distance between the arrival and accumulated service curves, respectively, see Figure 15.12.
All the above computations can be implemented efciently, if appropriate representations for the
variability characterization curves are used, for example, piecewise linear, discrete points, or periodic.
15.3.4 Concluding Remarks
Because of the modularity of the performance network, one can easily analyze a large number of different
mapping and resource sharing strategies for design space exploration. Applications can be extended by
adding tasks and performance modules. Moreover, different subsystems can use different kinds of resource
sharing without sacricing the performance analysis.
Of particular interest is the possibility to build a performance component for a combined hardware
software system that describes the performance properties of a whole subsystem. This way, a subcontractor
can deliver a hardware/software/operating system module that already contains part of the application.
The system house can now integrate the performance components of the subsystems in order to validate
the performance of the whole system. To this end, he does not need to know the details of the subsystem
implementations. In addition, a system house can also add an application to the subsystems. Using
the resource interfaces that characterize the remaining available service from the subsystems, its timing
correctness can easily be veried.
On one hand, the performance network approach is correct in the sense that it yields upper and lower
bounds on quantities such as end-to-end delay and buffer space. On the other hand, it is a worst-case
approach that covers all possible corner cases independent of their probability. Even if the deviations
from simulation results can be small, see for example, Reference 25, in many cases one is interested in
average case behavior of distributed embedded systems also. Therefore, performance analysis methods as
those described in this chapter can be considered to be complementary to the existing simulation based
validation methods.
Furthermore, any automated or semiautomated exploration of different design alternatives (design
space exploration) could be separated into multiple stages, each having a different level of abstraction.
It would then be appropriate to use an analytical performance evaluation framework, such as those
described in this chapter, during the initial stages and resort to simulation only when a relatively small set
of potential architectures is identied.
References
[1] K. Richter, D. Ziegenbein, M. Jersak, and R. Ernst. Model composition for scheduling analysis
in platform design. In Proceedings of the 39th Design Automation Conference (DAC). ACM Press,
New Orleans, LA, June 2002.
[2] Lothar Thiele, Simon Knzli, and Eckart Zitzler. A modular design space exploration frame-
work for embedded systems. IEE Proceedings Computers and Digital Techniques, Special Issue on
Embedded Microelectronic Systems, 2004.
[3] SystemC homepage. http://www.systemc.org.
[4] T. Grtker, S. Liao, G. Martin, and S. Swan. System Design with SystemC. Kluwer Academic
Publishers, Boston, May 2002.
[5] Doug Burger and Todd M. Austin. The simplescalar tool set, version 2.0. SIGARCH Computer
Architecture News, 25, 1997, 1325.
[6] K. Lahiri, A. Raghunathan, and S. Dey. System-level performance analysis for designing on-chip
communication architectures. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems, 20, 2001, 768783.
[7] G.C. Buttazzo. Hard Real-Time Computing Systems: Predictable Scheduling Algorithms and
Applications. Kluwer Academic Publishers, Boston, 1997.
[8] K. Tindell and J. Clark. Holistic schedulability analysis for distributed hard real-time systems.
Microprocessing and Microprogramming Euromicro Journal, Special Issue on Parallel Embedded
Real-Time Systems, 40, 1994, 117134.
[9] T. Yen and W. Wolf. Performance estimation for real-time distributed embedded systems. IEEE
Transactions on Parallel and Distributed Systems, 9, 1998, 11251136.
[10] K. Tindell, A. Burns, and A.J. Wellings. Calculating controller area networks (CAN) message
response times. Control Engineering Practice, 3, 1995, 11631169.
[11] T. Pop, P. Eles, and Z. Peng. Holistic scheduling and analysis of mixed time/event triggered dis-
tributed embedded systems. In Proceedings of the International Symposiumon HardwareSoftware
Codesign (CODES). ACMPress, May 2002, pp. 187192.
[12] L. Thiele, S. Chakraborty, M. Gries, A. Maxiaguine, and J. Greutert. Embedded software in
network processors models and algorithms. In Proceedings of the 1st Workshop on Embedded
Software (EMSOFT). (Lake Tahoe, CA, USA), Vol. 2211 of Lecture Notes in Computer Science.
[13] L. Thiele, S. Chakraborty, M. Gries, and S. Knzli. A framework for evaluating design tradeoffs in
packet processing architectures. In Proceedings of the 39th Design Automation Conference (DAC).
ACMPress, New Orleans, LA, June 2002, pp. 880885.
[14] Kai Richter, Marek Jersak, and Rolf Ernst. A formal approach to mpsoc performance verication.
IEEE Computer, 36, 2003, 6067.
[15] K. Richter and R. Ernst. Model interfaces for heterogeneous system analysis. In Proceedings of
the 6th Design, Automation and Test in Europe (DATE). IEEE, Munich, Germany, March 2002,
pp. 506513.
[16] J.A. Stankovic, M. Spuri, K. Ramamritham, and G.C. Buttazzo. Deadline scheduling for real-time
systems: EDF and related algorithms. In Kluwer International Series in Engineering and Computer
Science, Vol. 460. Kluwer Academic Publishers, Dordrecht, 1998.
[17] J.-Y. Le Boudec andP. Thiran. NetworkCalculus ATheory of Deterministic Queuing Systems for
the Internet, Vol. 2050 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2001.
[18] R.L. Cruz. A calculus for network delay, part I: network elements in isolation. IEEE Transactions
on Information Theory, 37, 1991, 114131.
[19] L. Thiele, S. Chakraborty, and M. Naedele. Real-time calculus for scheduling hard real-time
systems. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS),
Vol. 4. IEEE, 2000, pp. 101104.
[20] S. Chakraborty, S. Knzli, L. Thiele, A. Herkersdorf, and P. Sagmeister. Performance evaluation
of network processor architectures: combining simulation with analytical estimation. Computer
Networks, 41, 2003, 641665.
[21] Alexander Maxiaguine, Simon Knzli, and Lothar Thiele. Workload characterization model for
tasks with variable execution demand. In Proceedings of Design Automation and Test in Europe
(DATE). IEEE Press, Paris, France, February 2004, pp. 10401045.
[22] Ernesto Wandeler, Alexander Maxiaguine, and Lothar Thiele. Quantitative characterization of
event streams in analysis of hard real-time applications. In Proceedings of the 10th IEEE Real-Time
and Embedded Technology and Applications Symposium(RTAS). IEEEComputer Society, May 2004,
pp. 450459.
[23] Ernesto Wandeler and Lothar Thiele. Abstracting functionality for modular performance analysis
of hard real-time systems. In Asia South Pacic Design Automation Conference (ASP-DAC). IEEE,
January 2005.
[24] F. Baccelli, G. Cohen, G. Olsder, and J.-P. Quadrat. Synchronization and Linearity. John Wiley &
Sons, NewYork, 1992.
[25] S. Chakraborty, S. Knzli, and L. Thiele. A general framework for analysing system properties in
platform-based embedded system designs. In Proceedings of the 6th Design, Automation and Test
in Europe (DATE). Munich, Germany, March 2003.
Power Aware
Computing
16 Power Aware Embedded Computing
Margarida F. Jacome and Anand Ramachandran
16
Power Aware
Embedded Computing
Margarida F. Jacome and
Anand Ramachandran
University of Texas
16.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-1
16.2 Energy and Power Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-3
Instruction- and Function-Level Models Micro-Architectural
Models Memory and Bus Models Battery Models
16.3 System/Application Level Optimizations . . . . . . . . . . . . . . 16-7
16.4 Energy Efcient Processing Subsystems . . . . . . . . . . . . . . . . 16-8
Voltage and Frequency Scaling Dynamic Resource Scaling
Processor Core Selection
16.5 Energy Efcient Memory Subsystems . . . . . . . . . . . . . . . . . . 16-11
Cache Hierarchy Tuning Novel Horizontal and Vertical
Cache Partitioning Schemes Dynamic Scaling of Memory
Elements Software-Controlled Memories, Scratch-Pad
Memories Improving Access Patterns to Off-Chip Memory
Special Purpose Memory Subsystems for Media Streaming
Code Compression Interconnect Optimizations
16.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16-17
16.1 Introduction
Embedded systems are pervasive in modern life. State-of-the-art embedded technology drives the ongoing
revolution in consumer and communication electronics, and is on the basis of substantial innovation
in many other domains, including medical instrumentation, process control, etc. [1]. The impact of
embedded systems in well established traditional industrial sectors, for example, automotive industry,
is also increasing at a fast pace [1,2].
Unfortunately, as Complementary Metal-Oxide Semiconductor (CMOS) technology rapidly scales,
enabling the fabrication of ever faster and denser Integrated Circuits (ICs), the challenges that must be
overcome to deliver each new generation of electronic products multiply. In the last few years, power
dissipation has emerged as a major concern. In fact, projections on power density increases owing to
CMOS scaling clearly indicate that this is one of the fundamental problems that will ultimately preclude
further scaling [3,4]. Although the power challenge is indeed considerable, much can be done to mitigate
the deleterious effects of power dissipation, thus enabling performance and device density to be taken to
truly unprecedented levels by the semiconductor industry throughout the next 10 to 15 years.
16-1
Power density has a direct impact on packaging and cooling costs, and can also affect system reliability,
owing to electromigration and hot-electron degradation effects. Thus, the ability to decrease power
density, while offering similar performance and functionality, critically enhances the competitiveness of
a product. Moreover, for battery operated portable systems, maximizing battery lifetime translates into
maximizing duration of service, an objective of paramount importance for this class of products. Power
is thus a primary gure of merit in contemporaneous embedded system design.
Digital CMOS circuits have two main types of power dissipation: dynamic and static. Dynamic power
is dissipated when the circuit performs the function(s) it was designed for, for example, logic and arith-
metic operations (computation), data retrieval, storage, and transport, etc. Ultimately, all of this activity
translates into switching of the logic states held on circuit nodes. Dynamic power dissipation is thus
proportional to C V
DD
2
f r, where C denotes the total circuit capacitance, V
DD
and f denote the
circuit supply voltage and clock frequency, respectively, and r denotes the fraction of transistors expected
to switch at each clock cycle [5,6]. In other words, dynamic power dissipation is impacted to rst order
by circuit size/complexity, speed/rate, and switching activity. In contrast, static power dissipation is asso-
ciated with preserving the logic state of circuit nodes between such switching activity, and is caused by
subthreshold leakage mechanisms. Unfortunately, as device sizes shrink, the severity of leakage power is
increasing at an alarming pace [3].
Clearly, the power problem must be addressed at all levels of the design hierarchy, from system to
circuit, as well as through innovations on CMOS device technology [5,7,8]. In this survey we provide
a snapshot on the state-of-the-art on system and architecture level design techniques and methodolo-
gies aimed at reducing, both, static and dynamic power dissipation. Since such techniques focus on the
highest level of the design hierarchy, their potential benets are immense. In particular, at this high level
of abstraction, the specics of each particular class of embedded applications can be considered as a
whole and, as it will be shown in our survey, such an ability is critical to designing power/energy efcient
systems, that is, systems that spend energy strictly when and where it is needed. Broadly speaking, this
requires a proper design and allocation of system resources, geared toward addressing critical perform-
ance bottlenecks in a power efcient way. Substantial power/energy savings can also be achieved through
the implementation of adequate dynamic power management policies, for example, tracking instantan-
eous workloads (or levels of resource utilization) and shutting-down idling/unused resources, so as to
reduce leakage power, or slowing down under-utilized resources, so as to decrease dynamic power dis-
sipation. These are clearly system level decisions/policies, in that their implementation typically impacts
several architectural subsystems. Moreover, different decisions/policies may interfere or conict with
each other and, thus, assessing their overall effectiveness requires a system level (i.e., global ) view of the
problem.
A typical embedded system architecture consists of a processing subsystem (including one or more pro-
cessor cores, hardware accelerators, etc.), a memory subsystem, peripherals, andglobal andlocal interconnect
structures (buses, bridges, crossbars, etc.). Figure 16.1 shows an abstract view of two such architecture
instances. Broadly speaking, system level design consists of dening the specic embedded system archi-
tecture to be used for a particular product, as well as dening how the target embedded application
(implementing the required functionality/services) is to be mapped onto that architecture.
Embedded systems come in many varieties and with many distinct design optimization goals and
requirements. Evenwhentwoproducts provide the same basic functionality, say, videoencoding/decoding,
they may have fundamentally different characteristics, namely, different performance and quality-of-
service requirements, one may be battery operated and the other not, etc. The implications of such product
differentiation are of paramount importance when power/energy are considered. Clearly, the higher
the systems required performance/speed (dened by metrics such as throughput, latency, bandwidth,
response time, etc.), the higher will be its power dissipation. The key objective is thus to minimize
the power dissipated to deliver the required level of performance [5,9]. The trade-offs, techniques, and
optimizations required to develop such power aware or power efcient designs vary widely across the vast
spectrumof embedded systems available in todays market, encompassing many complex decisions driven
by system requirements as well as intrinsic characteristics of the target applications [10].
Power Aware Embedded Computing 16-3
A/D
and
D/A
I/O ports
RAM
Modem
DSP core
Master
control
ASIP core
Sound
codec
DSP core
Host
interface
ASIP
memory
controller
Flash
ROM
Hardware
accelerator
FFT, DCT, ...
VLIW core
Primary embedded processor core
VLIW core
Embedded
processor
core
Memory
A/D
and
D/A
I
/
O

P
o
r
t
s
FIGURE 16.1 Illustrative examples of a simple and a more complex embedded system architecture.
Consider, for example, the design task of deciding on the number and type of processing elements
to be instantiated on an embedded architecture, that is, dening its processing subsystem. Power, per-
formance, cost, and time-to-market considerations dictate if one should rely entirely on readily available
processors (i.e., off-the-shelf microcontrollers, Digital Signal Processors (DSPs) and/or general-purpose
RISC cores), or should also consider custom execution engines, namely, Application Specic Instruc-
tion set Processors (ASIPs), possibly recongurable, and/or hardware accelerators (see Figure 16.1).
Hardware/software partitioning is a critical step in this process [1]. It consists of deciding which of
an applications segments/functions should be implemented in software (i.e., run on a processor core) and
which (if any) should be implemented in hardware (i.e., execute on high performance, highly power ef-
cient custom hardware accelerators). Naturally, hardware/software partitioning decisions should reect
the power/performance criticality of each such segment/function. Clearly, this is a complex multiobject-
ive optimization problem dened on a huge design space that encompasses, both, hardware and software
related decisions. To compound the problem, the performance and energy efciency of an architectures
processing subsystemcannot be evaluated in isolation, since its effectiveness can be substantially impacted
by the memory subsystem (i.e., the adopted memory hierarchy/organization) and the interconnect struc-
tures supporting communication/data transfers betweenprocessing components to/fromthe environment
in which the system is embedded. Thus, decisions with respect to these other subsystems and components
must be concurrently made and jointly assessed.
Targeting up front a specic embedded system platform, that is, an architectural subspace relevant to
a particular class of products/applications, can considerably reduce the design effort [11,12]. Still, the
design space remains (typically) so complex that a substantial design space exploration may be needed
in order to identify power/energy efcient solutions for the specied performance levels. Since time to
market is critical, methodologies to efciently drive such an exploration, as well as fast simulators and
low complexity (yet good delity) performance, power, and energy estimation models, are critical to
aggressively exploiting effective power/energy driven optimizations, within a reasonable time frame.
Our survey starts by providing an overview on state-of-the-art models and tools used to evaluate
the goodness of individual system design points. We then discuss power management techniques and
optimizations aimed at aggressively improving the power/energy efciency of the various subsystems of
an embedded system.
16.2 Energy and Power Modeling
This section discusses high-level modeling and power estimation techniques aimed at assisting system and
architecture-level design. It would be unrealistic to expect a high degree of accuracy on power estimates
produced during such an early design phase, since accurate power modeling requires detailed physical
level information that may not yet be available. Moreover, highly accurate estimation tools (working with
detailed circuit/layout-level information) would be too time consuming to allowfor any reasonable degree
of design space exploration [1,5,13].
Thus, practically speaking, power estimation during early design space exploration should aim at
ensuring a high degree of delity rather than necessarily accuracy. Specically, the primary objective
during this critical exploration phase is to assess the relative power efciency of different candidate system
architectures (populated with different hardware and/or software components), the relative effectiveness
of alternative software implementations (of the same functionality), the relative effectiveness of different
power management techniques, etc. Estimates that correctly expose such relative power trends across
the design space region being explored provide the designer with the necessary information to guide the
exploration process.
16.2.1 Instruction- and Function-Level Models
Instruction-level power models are usedtoassess the relative power/energy efciency of different processors
executing a given target embedded application, possibly with alternative memory subsystem congura-
tions. Such models are thus instrumental during the denition of the main subsystems of an embedded
architecture, as well as during hardware/software partitioning. Moreover, instruction-level power models
can also be used to evaluate the relative effectiveness of different software implementations of the same
embedded application, in the context of a specic embedded architecture/platform.
In their most basic form, instruction-level power models simply assign a power cost to each assembly
instruction (or class of assembly instructions) of a programmable processor. The overall energy consumed
by a program running on the target processor is estimated by summing up the instruction costs for
a dynamic execution trace which is representative of the application [1417].
Instruction-level power models were rst developed by experimentally measuring the current drawn by
a processor while executing different instruction sequences [14]. During this rst modeling effort, it was
observed that the power cost of an instruction may actually depend on previous instructions. Accordingly,
the instruction-level power models developed in Reference 14 include several inter-instruction effects.
Later studies observedthat, for certainprocessors, the power dissipationincurredby the hardware respons-
ible for fetching, decoding, analyzing, and issuing instructions, and then routing and reordering results,
was so high that a simpler model that only differentiates between instructions that access on-chip resources
and those that go off-chip would sufce for such processors [16].
Unfortunately, power estimation based on instruction-level models can still be prohibitively time con-
suming during early design space exploration, since it requires collecting and analyzing large instruction
traces and, for many processors, considering a quadratically large number of inter-instruction effects.
In order to accelerate estimation, processor specic coarser function-level power models were later
developed [18]. Such approaches are faster because they rely on the use of macromodels characterizing
the average energy consumption of a library of functions/subroutines executing on a target processor [18].
The key challenge in this case is to devise macromodels that can properly quantify the power consumed by
each subroutine of interest, as a function of easily observable parameters. Thus, for example, a quadratic
power model of the form an
2
+bn +c could be rst tentatively selected for a insertion sort routine, where
n denotes the number of elements to be sorted. Actual power dissipation then needs to be measured for
a large number of experiments, run with different values of n. Finally, the values of the macromodels
coefcients a, b, and c are derived, using regression analysis, and the overall accuracy of the resulting
macromodel is assessed [18].
The high-level instruction- and function-level power models discussed so far allow designers to
quickly assess a large number candidate system architectures and alternative software implementations,
so as to narrow the design space to a few promising alternatives. Once this initial broad exploration is
concluded, power models for each of the architectures main subsystems and components are needed, in
order to support the detailed architectural design phase that follows.
16.2.2 Micro-Architectural Models
Micro-architectural power models are critical to evaluating the the impact of different processing sub-
system choices on power consumption, as well as the effectiveness of different (micro-architecture level)
power management techniques implemented on the various subsystems.
In the late 1980s and early 1990s, cycle accurate (or more precisely, cycle-by-cycle) simulators, such
as Simplescalar [19], were developed to study the effect of architectural choices on the performance of
general-purpose processors. Such simulators are in general very exible, allowing designers/architects
to explore the complex design space of contemporaneous processors. Namely, they include built-in
parameters that can be used to specify the number and mix of functional units to be instantiated in the
processors datapath, the issue width of the machine, the size and associativity of the L1 and L2 caches, etc.
By varying such parameters, designers can study the performance of different machine congurations for
representative applications/benchmarks. As power consumption became more important, simulators to
estimate dynamic power dissipation (e.g., Wattch [20],CaiLim model [21], andSimplepower [22])
were later incorporated on these existing frameworks. Such an integration was performed seamlessly,
by directly augmenting the cycle-oriented performance models for the various micro-architectural
components with corresponding power models.
Naturally, the overall accuracy of these simulation-based power estimation techniques is determined
by the level of detail of the power models used for the micro-architectures constituent components. For
out-of-order RISC cores, for example, the power consumed in nding independent instructions to issue
is a function of the number of instructions currently in the instruction queue and of the actual dependen-
cies between such instructions. Unfortunately, the use of detailed power models accurately capturing the
impact of input and state data onthe power dissipated by each component would prohibitively increase the
already long micro-architectural simulation runtimes. Thus, most state-of-the-art simulators use very
simple/straightforward empirical power models for datapath and control logic, and slightly more soph-
isticated models for regular structures such as caches [20]. In their simplest form, such models capture
typical or average power dissipation for each individual micro-architectural component. Specically,
each time a given component is accessed/used during a simulation run, it is assumed that it dissipates
its corresponding average power. Slightly more sophisticated power macromodels for datapath com-
ponents have been proposed in References 2328, and shown to improve accuracy with a relatively small
impact on simulation time.
So far, we have discussed power modeling of micro-architectural components, yet a substantial percent-
age of the overall power budget of a processor is actually spent on the global clock (up to 40 to 45% [5]).
Thus, global clock power models must also be incorporated in these frameworks. The power dissipated
on global clock distribution is impacted to rst order by the number of pipeline registers (and thus by
a processors pipeline depth) and by global and local wiring capacitances (and thus by a processors core
area) [5]. Accordingly, different processor cores and/or different congurations of the same core may
dissipate substantially different clock distribution power. Power estimates incorporating such numbers
are thus critical during processor core selection and conguration.
The component-level and clock distribution models discussed so far are used to estimate the dynamic
power dissipation of a target micro-architecture. Yet, as mentioned earlier, static/leakage power dissipation
is becoming a major concern, and thus, micro-architectural techniques aimed at reducing leakage power
are increasingly relevant. Models to support early estimation of static power dissipation emerged along the
same lines as those used for dynamic power dissipation. The ButtsSohi model, which is one of the most
inuential static power models developed so far, quanties static energy in CMOS circuits/components
using a lumped parameter model that maps technology and design effects into corresponding charac-
terizing parameters [29]. Specically, static power dissipation is modeled as V
DD
N k
design
I
leak
,
where V
DD
is the supply voltage and N denotes the number of transistors in the circuit. k
design
is the
design dependent parameter it captures circuit style related characteristics of a component, includ-
ing average transistor aspect ratio, average number of transistors switched off during normal/typical
component operation, etc. Finally, I
leak
is the technology dependent parameter. It accounts for the impact
of threshold voltage, temperature, and other key parameters, on leakage current, for a specic fabrication
process.
From a system designers perspective, static power can be reduced by lowering supply voltage (V
DD
),
and/or by power supply gating or V
DD
gating (as opposed to clock gating) unused/idling devices (N).
Integrating models for estimating static power dissipation on cycle-by-cycle simulators thus enables
embedded system designers to analyze critical static power versus performance trade-offs enabled by
power aware features available in contemporaneous processors, such as dynamic voltage scaling and
selective datapath (re)conguration. An improved version of the ButtsSohi model, providing the
ability to dynamically recalculate leakage currents (as temperature and voltage change owing to oper-
ating conditions and/or dynamic voltage scaling), has been integrated into the Simplescalar simulation
framework, called HotLeakage, enabling such high-level trade-offs to be explored by embedded system
designers [30].
16.2.3 Memory and Bus Models
Storage elements, such as caches, register les, queues, buffers, and tables constitute a substantial part of
the power budget of contemporaneous embedded systems [31]. Fortunately, the high regularity of some
such memory structures (e.g., caches) permits the use of simple, yet reasonably accurate power estimation
techniques, relying on automatically synthesized structural designs for such components.
The Cache Access and Cycle TIme (CACTI) framework implements this synthesis-driven power estim-
ation paradigm. Specically, given a specic cache hierarchy conguration (dened by parameters such
as cache size, associativity, and line size), as well as information on the minimum feature size of the target
technology [32], it internally generates a coarse structural design for such cache conguration. It then
derives delay and power estimates for that particular design, using parameterized built-in C models for
the various constituent elements, namely, SRAM cells, row and column decoders, word and bit lines,
precharge circuitry, etc. [33,34].
CACTIs synthesis algorithms used to generate the structural design of the memory hierarchy (which
include dening the aspect ratio of memory blocks, the number of instantiated sub-banks, etc.) have
been shown to consistently deliver reasonably good designs across a large range of cache hierarchy
parameters [34]. CACTI can thus be used to quickly generate power estimates (starting from high-
level architectural parameters) exhibiting a reasonably good delity over a large region of the design
space. During design space exploration, the designer may thus consider a number of alternative L1 and
L2 cache congurations and use CACTI to obtain access-based power dissipation estimates for each
such conguration with good delity. Naturally, the memory access traces used by CACTI should be
generated by a micro-architecture simulator (e.g., Simplescalar) working with a memory simulator (e.g.,
Dinero [35]), so that they reect the bandwidth requirements of the embedded application of interest.
Buses are also a signicant contributor to dynamic power dissipation [5,36]. The dynamic power
dissipation on a bus is proportional to C V
2
DD
fa, where C denotes the total capacitance of the bus
(including metal wires and buffers), V
DD
denotes the supply voltage, and fa denotes the average switching
frequency of the bus [36]. In this high-level model, the average switching frequency of the bus (fa) is
dened by the product of two terms, namely, the average number of bus transitions per word, and the
bus frequency (given in bus words per second). The average number of bus transitions per word can be
estimated by simulating sample programs and collecting the corresponding transition traces. Although
this model is coarse, it may sufce during the early design phases under consideration.
16.2.4 Battery Models
The capacity of a battery is a nonlinear function of the current drawn from it, that is, if one increases
the average current drawn from a battery by a factor of two, the remaining deliverable battery capacity,
and thus its lifetime, decreases by more than half. Peukerts formula models such nonlinear behavior by
dening the capacity of a battery as k/I
, where k is a constant depending on the battery design, I is the

discharge current and quanties the nonideal behavior of the battery [37].
1
More effective system-
level trade-offs between quality/performance and duration of service can be implemented by taking such
nonlinearity (also called rate-capacity effect) into consideration. However, in order to properly evaluate
the effectiveness of such techniques/trade-offs during system level design, adequate battery models, and
metrics are needed.
Energy-delay product is a well-known metric used to assess the energy efciency of a system. It basically
quanties a systems performance loss per unit gain in energy consumption. To emphasize the importance
of accurately exploring key trade-offs between battery lifetime and system performance, a new metric, viz.,
battery-discharge delay product, has recently been proposed [38]. Task scheduling strategies and dynamic
voltage scaling policies adopted for battery operated embedded systems should thus aim at minimizing
battery-discharge delay product, rather than energy-delay product, since it captures the important rate-
capacity effect alluded to above, whereas energy-delay product is insensitive to it. Yet, a metric such as
battery-discharge delay product requires the use of precise/detailed battery models. One such detailed
battery model has been recently proposed, which can predict the remaining capacity of a rechargeable
lithium-ion battery in terms of several critical factors, viz., discharge-rate (current), battery output
voltage, battery temperature, and cycle age (i.e., the number of times a battery has been charged and
discharged) [39].
16.3 System/Application Level Optimizations
When designing an embedded system, in particular, a battery powered one, it may be useful to explore
different task implementations exhibiting different power/energy versus quality-of-service characteristics,
so as to provide the system with the desired functionality while meeting cost, battery lifetime, and
other critical requirements. Namely, one may be interested in trading-off accuracy for energy savings
on a handheld GPS system, or image quality for energy savings on an image decoder, etc. [4042].
Such application level tuning/optimizations may be performed statically (i.e., one may use a single
implementation for each task) or may be performed at runtime, under the control of a system level power
manager. For example, if the battery level drops below a certain threshold, the power manager may drop
some services and/or swap some tasks to less power hungry (lower quality) software versions, so that the
system can remain operational for a specied window of additional time. An interesting example of such
dynamic power management was implemented for the Mars Pathnder, an unmanned space robot that
draws power from, both, a nonrechargable battery and solar cells [43]. In this case, the power manager
tracks the power available fromthe solar cells, ensuring that most of the robots active work is done during
daylight, since the solar energy cannot be stored for later use.
In addition to dropping and/or swapping tasks, a systems dynamic power manager may also shutdown
or slowdown subsystems/modules that are idling or under-utilized. The Advanced Conguration and
Power Interface (ACPI) is a widely adopted standard that species an interface between an architectures
power managed subsystems/modules (e.g., display drivers, modems, hard disk drivers, processors, etc.)
and the systems dynamic power manager. It is assumed that the power managed subsystems have at
least two power states (ACTIVE and STANDBY) and that one can dynamically switch between them.
Using ACPI, one can implement virtually any power management policy, including xed time-out, pre-
dictive shutdown, predictive wake-up, etc. [44]. Since the transition from one state to another may take
a substantial amount of time and consume a nonnegligible amount of energy, the selection of a proper
power management policy for the various subsystems is critical to achieve good energy savings with a
negligible impact on performance. The so-called time-out policy, widely used in modern computers,
simply switches a subsystem to STANDBY mode when the elapsed time after the last utilization reaches a
given xed threshold. Predictive shutdown uses the previous history of the subsystem to predict the next
1
For an ideal battery, that is, a battery whose capacity is independent of the way current is drawn, = 0, while
for a real battery may be as high as 0.7 [37].
expected idle time and, based on that, decides if it should or should not be shut down. More sophistic-
ated policies, using stochastic methods [4042,45], can also be implemented, yet they are more complex,
and thus the power associated with running the associated dynamic power management algorithms may
render them inadequate or inefcient for certain classes of systems.
The system-level techniques discussed here act on each subsystemas a whole. Although the effectiveness
of such techniques has been demonstrated across a wide variety of systems, ner-grained self-monitoring
techniques, implemented at the subsystem level, can substantially add to these savings. Such techniques
are discussed in the following sections.
16.4 Energy Efcient Processing Subsystems
As mentioned earlier, hardware/software codesign methodologies partition the functionality of an embed-
ded system into hardware and software components. Software components, executing on programmable
microcontrollers or processors (either general purpose or application specic), are the preferred solution,
since their use can substantially reduce design and manufacturing costs, as well as shorten time-to-market.
Customhardware components are typically used only when strictly necessary, namely, when an embedded
systems power budget and/or performance constraints preclude the use of software. Accordingly, a large
number of power aware processor families is available in todays market, providing a vast gamut of altern-
atives suiting the requirements/needs of most embedded system [46,47]. In the sequel we discuss power
aware features available in contemporaneous processors, as well as several power related issues relevant to
processor core selection.
16.4.1 Voltage and Frequency Scaling
Dynamic power consumptionina processor (be it general purpose or applicationspecic) canbe decreased
by reducing two of its key contributors, viz., supply voltage and clock frequency. In fact, since the power
dissipated in a CMOS circuit is proportional to the square of the supply voltage, the most effective way
to reduce power is to scale down the supply voltage. Note that, however, the propagation delay across a
CMOS transistor is proportional to V
DD
/(V
DD
V
T
)
2
, where V
DD
is the supply voltage and V
T
is the
threshold voltage. So, unfortunately, as the supply voltage decreases, the propagation delay increases as
well, and so clock frequency (i.e., speed) may need to be decreased [48].
Accordingly, many contemporaneous processor families, such as Intels XScale [49], IBMs PowerPC
405LP [50], and Transmetas Crusoe [51], offer dynamic voltage and frequency scaling features. For
example, the Intel 80200 processor, which belongs to the XScale family of processors mentioned earlier,
supports a software programmable clock frequency. Specically, the voltage can be varied from 1.0 to
1.5 V, in small increments, with the frequency varying correspondingly from 200 to 733 MHz, in steps of
33/66 MHz.
The simplest way to take advantage of the scaling features discussed above is by carefully identifying
the smallest supply voltage (and corresponding operating frequency) that guarantee that the target
embedded application meets its timing constraints, and run the processor for that x setting. If the
workload is reasonably constant throughout execution, this simple scheme may sufce. However, if
the workload varies substantially during execution, more sophisticated techniques that dynamically adapt
the processors voltage and frequency to the varying workload, can deliver more substantial power savings.
Naturally, whendesigning suchtechniques it is critical toconsider the cost of transitioning fromone setting
to another, that is, the delay and power consumption overheads incurred by each transition. For example,
for Intel 80200 processor, changing the processor frequency could take up to 20 sec, while changing the
voltage could take up to 1 msec [49].
Most processors developed for the mobile/portable market already support some form of built-in
mechanism for voltage/frequency scaling. Intels SpeedStep technology, for example, detects if the system
is currently plugged into a power outlet or running on a battery, and based on that, either runs the
High voltage, High frequency Low voltage, Low frequency
P
o
w
e
r
P
o
w
e
r
Time
Deadline
Time
Idle time
Deadline
FIGURE 16.2 Power consumption without and with dynamic voltage and frequency scaling.
processor at the highest voltage/frequency or switches it to a less power hungry mode. Transmetas Crusoe
processors offer a power manager called LongRun [51], which is implemented in the processors rmware.
LongRun relies on the historical utilization of the processor to guide clock rate selection: it increases the
processors clock frequency if the current utilization is high and decreases it if the utilization is low.
More sophisticated/aggressive dynamic scaling techniques should vary the cores voltage/frequency
based on some predictive strategy, while carefully monitoring performance, so as to ensure that it does not
drop beyond a certain threshold and/or to ensure that task deadlines are not consistently missed [5255]
(see Figure 16.2). The use of such voltage/frequency scaling techniques must necessarily rely on adequate
dynamic workload prediction and performance metrics, and thus requires the direct intervention of
the operating system and/or of the applications themselves. Although more complex than the simple
schemes discussed above, several such predictive techniques have been shown to deliver substantial gains
for applications with well-dened task deadlines, for example, hard/soft real-time systems.
Simple interval-based prediction schemes consider the amount of idle time on a previous interval as
a measure of the processors utilization for the next time interval, and use that prediction to decide on
the voltage/frequency settings to be used throughout its duration. Naturally, many critical issues must be
factored in when dening the duration of such an interval (or prediction window), including overhead
costs associated with switching voltage/frequency settings. While a prediction scheme based on a single
interval may deliver substantial power gains with marginal loss in performance [52], looking exclusively
at a single interval may not sufce for many applications. Namely, the voltage/frequency settings may
end up oscillating in an inefcient way between different settings [53]. Simple smoothing techniques, for
example, using an exponentially moving average of previous intervals, can be employed to mitigate the
problem [53,56].
Finally, note that the benets of dynamic voltage and frequency scaling are not limited to reducing
dynamic power consumption. When the voltage/frequency of a battery operated system is lowered, the
instantaneous current drawn by the processor decreases accordingly, leading to a more effective utilization
of the battery capacity, and thus to increased duration of service.
16.4.2 Dynamic Resource Scaling
Dynamic resource scaling refers to exploiting adaptive, ne-grained hardware resource reconguration
techniques in order to improve power efciency. Dynamic resource scaling requires enhancing the micro-
architecture with the ability to selectively disable components, fully or partially, through either clock
gating or V
DD
gating.
2
The effectiveness of dynamic resource scaling is predicated on the fact that many
applications have variable workloads, that is, have execution phases with substantial Instruction-Level
Parallelism (ILP), and other phases with much less inherent parallelism. Thus, by dynamically scaling
down micro-architecture components during such low activity periods, substantial power savings can
potentially be achieved.
2
Note that, in contrast to clock gating, all state information is lost when a circuit is V
DD
gated.
Techniques for reducing static power dissipation on processor cores can denitely exploit resource
scaling features, once they become widely available on processors. Namely, under-utilized or idling
resources can be partially or fully V
DD
gated, thus reducing leakage power. Several utilization-driven
techniques have already been proposed, that can selectively shutdown functional units, segments of
register les, and other datapath components, when such conditions arise [5760]. Naturally, the delay
overhead incurred by power supply gating should be carefully factored into these techniques, so that the
corresponding static energy savings are achieved with only a small degradation in performance.
Dynamic energy consumption can also be reduced by dynamicallyscaling downpower hungry micro-
architecture components, for example, reducing the size of the issue window and/or reducing the effective
widthof the pipeline, during periods of lowactivity(say, whenlowILPcode is being executed). Moreover,
if one is able to scale down these micro-architecture components that dene the critical path delay of the
machines pipeline (typically located on the rename and window access stages), additional opportunities
for voltage scaling, and thus dynamic power savings, can be created. This is, thus, a very promising area
still undergoing intensive research [61].
16.4.3 Processor Core Selection
The power aware techniques discussed so far can be broadly applied to general purpose as well as applica-
tion specic processors. Naturally, the selection of processor core(s) to be instantiated in the architectures
processing subsystem is likely to have also a substantial impact on overall power consumption, par-
ticularly when computation intensive embedded systems are considered. A plethora of programmable
processing elements, including microcontrollers, general-purpose processors, DSPs, and ASIPs, address-
ing the specic needs of virtually every segment of the embedded systems market, is currently offered by
vendors.
As alluded to above, for computation intensive embedded systems with moderately high to stringent
timing constraints, the selection of processor cores is particularly critical, since high performance usu-
ally signies high levels of power dissipation. For these systems, ASIPs and DSPs have the potential to
be substantially more energy efcient than their general-purpose counterparts, yet their specialized
nature poses signicant compilation challenges
3
[62]. In contrast, general-purpose processors are easier
to compile to, being typically shipped with good optimizing compilers, as well as debuggers and other
development tools. Unfortunately, their generality incurs a substantial power overhead. In particular,
high-performance general-purpose processors require the use of power hungry hardware assists, including
reservation stations, reorder buffers and rename logic, and complex branch prediction logic to alleviate
control stalls. Still, their exibility and high quality development tools are very attractive for systems with
stringent time-to-market constraints, making them denitively relevant for embedded systems.
IBM/Motorolas PowerPC family and the ARM family are examples of general-purpose processors
enhanced with power aware features that are widely used in modern embedded systems. A plethora of
specialized/customizable processors is also offered by several vendors, including specialized media cores
from Philips, Trimedia, MIPS, etc., DSP cores offered by Texas Instruments, StarCore, and Motorola, and
customizable cores from Hewlett-Packard-STMicroelectronics and Tensilica.
The Instruction Set Architectures (ISAs) and features offered on these specialized processors can vary
substantially, since they are designed and optimized for different classes of applications. However, sev-
eral of them, including TIs TMS320C6x family, HP-STSs Lx, the Starcore and Trimedia families, and
Philips Nexperia use a Very Large Instruction Word (VLIW) [63,64] or Explicitly Parallel Instruction
Computing (EPIC) [65] paradigm. One of the key differences between VLIW and superscalar architec-
tures is that VLIW machines rely on the compiler to extract ILP, and then schedule and bind instructions
to functional units statically, while their high performance superscalar counterparts use dedicated (power
hungry) hardware to perform runtime dependence checking, instruction reordering, etc. Thus, in broad
3
Several DSP/ASIP vendors provide preoptimized assembly libraries to mitigate this problem.
terms, VLIW machines eliminate power hungry micro-architecture components by moving the corres-
ponding functionality to the compiler.
4
Moreover, wide VLIW machines are generally organized as a set
of small clusters with local register les.
5
Thus, in contrast to traditional superscalar machines, which
rely on power hungry multiported monolithic register les, multicluster VLIW machines scale better
with increasing issue widths, that is, dissipate less dynamic power and can work at faster clock rates. Yet,
they are harder to compile to [6672]. In summary, VLIW paradigm works very well for many classes
of embedded applications, as attested to by the large number of VLIW processors currently available in
the market. However, it poses substantial compilation challenges, some of which are still undergoing active
research [8,73,74]. Fortunately, many processing intensive embedded applications have only a few time
critical loop kernels. Thus, only a very small percentage of the overall code needs to be actually subjected
to the complex, time consuming compiler optimizations required by VLIWmachines. In the case of media
applications, for example, such loop kernels may represent as little as 3% of the overall program, and yet
take up to 95% of the execution time [75].
When the performance requirements of an embedded system are extremely high, using a dedicated
coprocessor aimed at accelerating the execution of time-critical kernels (under the control of a host
computer), may be the only feasible solution. Imagine, one suchprogrammable coprocessors, was designed
to accelerate the execution of kernels of streaming media applications [76]. It can deliver up to 20 GFLOPS
at a relatively low power cost (2 GFLOPS/W), yet such a power efciency does not come without a
cost [76]. As expected, Imagine has a complex programming paradigm, that requires extracting all
time-critical kernels from the target application, and carefully reprogramming them using Imagines
stream-oriented coding style, so that, both, data and instructions can be efciently routed from the host
to the accelerator. Programming Imagine thus require a substantial effort, yet its power efciency makes
it very attractive for systems that demand such high levels of performance. Finally, at the highest end
of the performance spectrum, one may need to consider using fully customized hardware accelerators.
Comprehensive methodologies to design such accelerators are discussed in detail in Reference 77.
16.5 Energy Efcient Memory Subsystems
While processor speeds have been increasing at a very fast rate (60% a year), memory performance has
increased at a comparatively modest rate (7% a year), leading to the well-known processor-memory
performance gap[31]. Inorder to alleviate the memory access latency problem, modernprocessor designs
use increasingly large on-chip caches, with up to 60% of the transistors dedicated to on-chip memory
and support circuitry [31]. As a consequence, power dissipation in the memory subsystem contributes to
a substantial fraction of the energy consumed by modern processors. A study targeting the StrongARM
SA-110, a low-power processor widely used in embedded systems, revealed that more than 40% of the
processors power budget is taken up by on-chip data and instruction caches [78]. For high-performance
general-purpose processors, this percentage is even higher, with up to 90%of the power budget consumed
by memory elements and circuits aimed at alleviating the aforementioned memory bottleneck [31]. Power
aware memory designs have thus received considerable attention in recent years.
16.5.1 Cache Hierarchy Tuning
The energy cost of accessing data/instructions from off-chip memories can be as much as two orders
of magnitude higher than that of an access to on-chip memory [79]. By retaining instructions and data
with high spatial and/or temporal locality on-chip, caches can substantially reduce the number of costly
off-chip data transfers, thus leading to potentially quite substantial energy savings.
4
Since functional unit binding decisions are made by the compiler, VLIWcode is larger than RISC/superscalar code.
We will discuss techniques to address this problem later in our survey.
5
Acluster is a set of functional units connected to a local register le. Clusters communicate with each other through
a dedicated interconnection network.
The one-size-ts-all nature of the general-purpose domain dictates that one should use large caches
with a high degree of associativity, so as to try to ensure high hit rates (and thus lowaverage memory access
latencies) for as many applications as possible. Unfortunately, as one increases cache size and associativity,
larger circuits and/or more circuits are activated on each access to the cache, leading to a corresponding
increase in dynamic energy consumption.
Clearly, in the context of embedded systems, one can do much better. Specically, by carefully
tuning/scaling the conguration of the cache hierarchy, so that it more efciently matches the band-
width requirements and access patterns of the target embedded application, one can essentially achieve
memory access latencies similar to those delivered by larger (general purpose) memory subsystems,
while substantially decreasing the average energy cost of such accesses.
Since several of the cache hierarchy parameters exhibit conicting trends, an aggressive design
space exploration over many candidate cache congurations is typically required in order to properly
tune the cache hierarchy of an embedded system. Figure 16.3 summarizes the results of one such design
space exploration performed for a media application. Namely, the graph in Figure 16.3 plots the energy-
delay product metric for a wide range of L1 and L2 on-chip D-cache congurations. The delay term is
given by the number of cycles taken to process a representative data set from start to completion. The
energy term accounts for the energy spent on data memory accesses for the particular execution trace.
6
As it can be seen, the design space is very complex, reecting the conicting trends alluded to above.
For this case study, the best set of cache hierarchy congurations exhibit an energy-delay product that
is about one order of magnitude better (i.e., lower) than that of the worst congurations. Perhaps even
more interesting is that some of the worst memory subsystem congurations use quite large caches with
L1 Cache: 4KB/16KB/32KB
32B, 1 way/2 way
1K/4K/16K/32K kill window
L2 Cache:
64B/128B/256B, 1way/2 way
8K/16K/32K/64K kill window
0
0.5
1
1.5
2
2.5
3
E
n
e
r
g
y

d
e
l
a
y

(
m
J

c
y
c
l
e
s

1
0
0
,
0
0
0
)
10
5
FIGURE 16.3 Design space exploration: energy-delay product for various L1 and L2 D-cache congurations for a
JPEG application running on an XScale-like processor core.
6
Specically, the energy term accounts for accesses to the L1 and L2 on-chip D-caches, and to main memory.
a high degree of associativity, clearly indicating that no substantial performance gains would be achieved
(for this particular media application) by using such an aggressively dimensioned memory subsystems.
For many embedded systems/applications, the power efciency of the memory subsystems can be
improved even more aggressively, yet this requires the use of novel (nonstandard) memory systemdesigns,
as discussed in the sections that follow.
16.5.2 Novel Horizontal and Vertical Cache Partitioning Schemes
In recent years, several novel cache designs have been proposed to aggressively reduce the average dynamic
energy consumption incurred by memory accesses. Energy efciency is improved in these designs by
taking direct advantage of specic characteristics of target classes of applications. The memory footprint
of instructions and data in media applications, for example, tends to be very small, thus creating unique
opportunities for energy savings [75]. Since streaming media applications are pervasive in todays portable
electronics market, they have been a preferred application domain for validating the effectiveness of such
novel cache designs.
Vertical partition schemes [8083], as the name suggests, introduce additional small buffers/caches
before the rst level of the traditional memory hierarchy. For applications withsmall working sets, this
strategy can lead to considerable dynamic power savings.
A concrete example of a vertical partition scheme is the lter cache [84], which is a very small cache
placed in front of the standard L1 data cache. If the lter cache is properly dimensioned, dynamic energy
consumption in the memory hierarchy can be substantially reduced, not only by accessing most of the
data from the lter cache, but also by powering down (clock gating) the L1 D-cache to a STANDBY
mode during periods of inactivity [84]. Although switching the L1 D-cache to STANDBY mode results in
delay/energy penalties when there is a miss in the lter cache, it was observed that for media applications,
the energy-delay product did improve quite signicantly when the two techniques were combined.
Predecoded instruction buffers [85] and loop buffers [86] are variants of the vertical partitioning scheme
discussed above, yet applied to instruction caches (I-caches). The key idea of the rst partitioning
scheme mentioned is to store recently used instructions on an instruction buffer, in a decoded form,
so as to reduce the average dynamic power spent on fetching and decoding instructions. The second
partitioning scheme allows one to hold time-critical loop bodies (identied a priori by the compiler or by
the programmer) on small and thus energy efcient dedicated loop buffers.
Horizontal partition schemes refer to the placement of additional (small) buffers or caches at the
same level as the L1 cache. For each memory reference, the appropriate (level one) cache to be accessed
is determined by dedicated decoding circuitry residing between the processor core and the memory
hierarchy. Naturally, the method used to partition data across the set of rst level caches should ensure
that the cache selection logic is simple, and thus cache access times are not signicantly affected.
Region-basedCaches implement one suchhorizontal partitioningscheme, by adding twosmall 2KBL1
D-caches to the rst level of the memory hierarchy, one for stack and one for global data. This arrangement
has also been shown to achieve substantial gains in dynamic energy consumption for streaming media
applications with a negligible impact on performance [87].
16.5.3 Dynamic Scaling of Memory Elements
With anincreasing number of on-chip transistors being devoted to storage elements inmodernprocessors,
of which only a very small set is active at any point in time, static power dissipation is expected to soon
become a key contributor to a processors power budget. State-of-the-art techniques to reduce static
power consumption in on-chip memories are based on the simple observation that, in general, data or
instructions fetched into a given cache line have an immediate urry of accesses during a small interval
of time, followed by a relatively long period of time where they are not used, before eventually being
evicted to make way for new data/instructions [88,89]. If one can guess when that period starts, it is
possible to switch-off (i.e., V
DD
gate) the corresponding cache lines without introducing extra cache
misses, thereby saving static energy consumption with no impact on performance [90,91].
Cache Decay was one of the earliest attempts to exploit such agenerationalmemory usage behavior to
decrease leakage power [90]. The original Cache Decay implementation used a simple policy that turned
off cache lines after a xed number of cycles (decay interval) since the last access. Note that if the selected
decay interval happens to be too small, cache lines are switched off prematurely, causing extra cache misses,
and if it is too large, opportunities for saving leakage energy are missed. Thus, when such a simple scheme
is used, it is critical to tune the xed decay interval very carefully, so that it adequately matches the access
patterns of the embedded application of interest. Adaptive strategies, varying the decay interval at runtime
so as to dynamically adjust it to the changing access patterns, have been proposed more recently, so as to
enable the use of the cache decay principle across a wider range of applications [90,91]. Similar leakage
energy reduction techniques have also been proposed for issue queues [59,60,92] and branch prediction
tables [93].
Naturally, leakage energy reductiontechniques for instruction/programcaches are alsovery critical [94].
Atechnique has been recently proposed that monitors the performance of the instruction cache over time,
and dynamically scales (via V
DD
gating) its size, so as to closely match the size of the working set of the
application [94].
16.5.4 Software-Controlled Memories, Scratch-Pad Memories
Most of novel designs and/or techniques discussed so far require an application-driven tuning of several
architecturally visible parameters. However, similar to more traditional cache hierarchies, the memory
subsysteminterface implementedonthese novel designs still exposes a at viewof the memory hierarchy to
the compiler/software. That is, the underlying details of the memory subsystemarchitecture are essentially
transparent to both.
Dynamic power dissipation incurred by accesses to basic memory modules occurs owing to switching
activity in bit lines, word lines, and input and output lines. Traditional caches have additional switching
overheads, owing to the circuitry (comparators, multiplexers, tags, etc.) needed to provide the at
memory interface alluded to above. Since the hardware assists necessary to support such a transparent view
of the memory hierarchy are quite power hungry, additional energy saving opportunities can be created
by relying more on the compiler (and less on dedicated hardware) to manage the memory subsystem.
The use of software-controlled (rather than hardware-controlled) memory components is thus becoming
increasing prevalent in power aware embedded system design.
Scratch-Pads are an example of such novel, software-controlled memories [95100]. Scratch-Pads
are essentially on-chip partitions of main memory directly managed by the compiler. Namely, decisions
concerning data/instruction placement in on-chip Scratch-Pads are made statically by the compiler, rather
than dynamically, using dedicated hardware circuitry. Therefore, these memories are much less complex
andthus less power hungry thantraditional caches. As one wouldexpect, the ability toaggressively improve
energy-delay efciency through the use of Scratch-Pads is predicated on the quality of the decisions
made by the compiler on the subset of data/instructions that are to be assigned to that limited memory
space [95,101]. Several compiler-driven techniques have been proposed to identify the data/instructions
that can be assigned to the Scratch-Pad more protably, with frequency of use being one of the key
selection criteria [102104].
The Cool Cache architecture [105], also proposed for media applications, is a good example of a novel,
power aware memory subsystem that relies on the use of software-controlled memories. It uses a small
Scratch-Pad and a software-controlled cache, each of which is implemented on a different on-chip
SRAM. The programs scalars are mapped to the small (2 KB) Scratch-Pad [100].
7
Nonscalar data is
mapped to the software-controlled cache, and the compiler is responsible for translating virtual addresses
to SRAM lines, using a small register lookup area. Even though cache misses are handled in software,
7
This size was found to be sufcient for most media applications.
thereby incurring substantial latency/energy penalties, the overall architecture has been shown to yield
substantial energy-delay product improvements for media applications, when compared to traditional
cache hierarchies [105].
The effectiveness of techniques such as the above is so pronounced that several embedded processors
currently offer a variety of software-controlled memory blocks, including congurable Scratch-Pads (TIs
320C6x [106]), lockable caches (Intels XScale [49] and Trimedia [107]) and stream buffers (Intels
StrongARM [49]).
16.5.5 Improving Access Patterns to Off-Chip Memory
During the last few decades, there has been a substantial effort in the compiler domain aimed at min-
imizing the number of off-chip memory accesses incurred by optimized code, as well as enabling the
implementationof aggressive prefetching strategies. This includes devising compiler techniques to restruc-
ture, reorganize, and layout data in off-chip memory, as well as techniques to properly reorder a programs
memory access patterns [108112].
Prefetching techniques have received considerable attention lately, particularly in the domain of embed-
ded streaming media applications. Instruction and data prefetching techniques can be hardware- or
software-driven [113120]. Hardware-based data prefetching techniques try to dynamically predict when
a given piece of data will be needed, so as to load it into cache (or into some dedicated on-chip buffer),
before it is actually referenced by the application (i.e., explicitly required by a demand access) [114116].
In contrast, software-based data prefetching techniques work by inserting prefetch instructions for selec-
ted data references at carefully chosen points in the program such explicit prefetch instructions are
executed by the processor, to move data into cache [117120].
It has been extensively demonstrated that, when properly used, prefetching techniques can substan-
tially improve average memory access latencies [113120]. Moreover, techniques that prefetch substantial
chunks of data (rather than, say, a single cache line), possibly to a dedicated buffer, can also simultan-
eously decrease dynamic power dissipation [121]. Namely, when data is brought from off-chip memory
in large bursts, then energy-efcient burst/page access modes can be more effectively exploited. Moreover,
by prefetching large quantities of instructions/data, the average length of DRAM idle times is expected
to increase, thus creating more protable opportunities for the DRAM to be switched to a lower power
mode [122124]. Naturally, it is important to ensure that the overhead associated with the prefetching
mechanism itself, as well as potential increases in static energy consumption owing to additional stor-
age requirements, do not outweigh the benets achieved from enabling more energy-efcient off-chip
accesses [124].
16.5.6 Special Purpose Memory Subsystems for Media Streaming
As alluded to before, streaming media applications have been a preferred application domain for validating
the effectiveness of many novel, power aware memory designs. Although the compiler is consistently
given a more preeminent role in the management of these novel memory subsystems, they require no
fundamental changes to the adopted programming paradigm. Additional opportunities for energy savings
can be unlocked by adopting a programming paradigmthat directly exposes those elements of an applica-
tion that should be considered by an optimizing compiler, during performance versus power trade-off
exploration. The two special purpose memory subsystems discussed belowdo precisely that, in the context
of streaming media applications.
Xtream-Fit is a special purpose data memory subsystem targeted to generic uni-processor embedded
systemplatforms executing media applications [124]. Xtream-Fits on-chip memory consists of a Scratch-
Pad, to hold constants and scalars, and a novel software-controlled streaming memory, partitioned into
regions, each of which holds one of the input or output streams used/produced by the target application.
The use of software-controlled memories by Xtream-Fit ensures that dynamic energy consumption is low,
while the region based organization of the streaming memory enables the implementation of very simple
and yet effective shutdown policies to turn off different memory regions, as the data they hold become
dead. Xtream-Fits programming model is actually quite simple, requiring only a minor reprogram-
ming effort. It simply requires organizing/partitioning the application code into a small set of processing
and data transfer tasks. Data transfer tasks prefetch streaming media data (the amount required by the
next set of processing tasks) into the streaming memory. The amount of prefetched data is explicitly
exposed via a single customization parameter. By varying this single customization parameter, the com-
piler can thus aggressively minimize energy-delay product, by considering, both, dynamic and leakage
power, dissipated in on-chip and in off-chip memories [124].
While Xtream-Fit provides sufcient memory bandwidth for generic uni-processor embedded media
architectures, it cannot support the very highbandwidthrequirements of high-performance media acceler-
ators. For example, Imagine, the multicluster media accelerator alluded previously, uses its own specialized
memory hierarchy, consisting of a streaming memory, a 128 KB stream register le, and stream buffers
and register les local to each of its eight clusters. Imagines memory subsystem delivers a very high band-
width (2.1 GB/sec) with very high energy-efciency, yet it requires the use of a specialized programming
paradigm. Namely, data transfers to/from the host are controlled by a stream controller, and between the
stream register le and the functional units by a microcontroller, both of which have to be programmed
separately, using Imagines own stream-oriented programming style [125].
Systems that demand still higher performance and/or energy-efciency may require memory archi-
tectures fully customized to the target application. Comprehensive methodologies for designing
high-performance memory architectures for custom hardware accelerators are discussed in detail in
[36,126].
16.5.7 Code Compression
Code size affects both program storage requirements, and off-chip memory bandwidth requirements,
and can thus have a rst order impact on the overall power consumed by an embedded system. Instruc-
tion compression schemes decrease both such requirements by storing in main memory (i.e., off-chip)
frequently fetched/executed instruction sequences in an encoded/compressed form [127129]. Naturally,
when one such scheme is adopted, it is important to factor in the overhead incurred by the on-chip
decoding circuitry, so that it does not outweigh the gains achieved on storage and interconnect elements.
Furthermore, different approaches have been considered for storing such selected instruction sequences
on-chip in either compressed or decompressed forms. On-chip storage of instructions in compressed form
saves on-chip storage, yet instructions must be decoded every time they are executed, adding additional
latency/power overheads.
Instruction subsetting is an alternative instruction compression scheme, where instructions that are not
commonly used are discarded from the instruction set, thus enabling the reduced instruction set to be
encoded using less bits [130]. The Thumb instruction set is a classic example of a compressed instruction
set, featuring the most commonly used 32-bit ARM instructions, compressed to 16-bit wide format. The
Thumb instructions set is decompressed transparently to full 32-bit ARM instructions in real time, with
no performance loss.
16.5.8 Interconnect Optimizations
Power dissipation in on- and off-chip interconnect structures is also a signicant contributor to an
embedded systems power budget [131]. A shared bus is a commonly used interconnect structure, as
it offers a good trade-off between generality/simplicity and performance. Power consumption on the
bus can be reduced by decreasing its supply voltage, capacitance, and/or switching activity. Bus splitting,
for example, reduces bus capacitance by splitting long bus lines into smaller sections, with one section
relaying the data to the next [132]. Power consumption in this approach is reduced at the expense of
a small penalty in latency, incurred at each relay point. Bus switching activity, and thus dynamic power
dissipation, can also be substantially reduced by using an appropriate bus encoding scheme [133138].
Bus invert coding [133], for example, is a simple, and yet widely used coding scheme. The rst step of
bus invert coding is to compute the Hamming distance between the current bus value and the previous
bus value. If this value is greater than half the number of total bits, then the data value is transmitted in
inverted form, with an additional invert bit to interpret the data at the other end. Several other encoding
schemes have been proposed, achieving lower switching activity at the expense of higher encoding and
decoding complexity [134138].
With the increasing adoption of System-on-Chip design methodologies for embedded systems, devising
energy-delay efcient interconnect architectures for such large scale systems is becoming increasingly
critical and is still undergoing intensive research [5].
16.6 Summary
Design methodologies for todays embedded systems must necessarily treat power consumption as a
primary gure of merit. At the system and architecture levels of design abstraction, power aware embedded
system design requires the availability of high-delity power estimation and simulation frameworks. Such
frameworks are essential to enable designers to explore and evaluate, in reasonable time, the complex
energy-delay trade-offs realized by different candidate architectures, subsystem realizations, and power
management techniques, and thus quickly identify promising solutions for the target application of
interest. The detailed, system and architecture level design phases that follow should adequately combine
coarse, system level dynamic power management strategies, with ne-grained, self-monitoring techniques,
exploiting voltage and frequency scaling, as well as advanced dynamic resource scaling and power-driven
reconguration techniques.
References
[1] G. Micheli, R. Ernst, and W. Wolf, Eds. Readings in Hardware/Software Co-Design. Morgan
Kaufman Publishers, Norwell, MA, 2002.
[2] F. Balarin, P. Giusto, A. Jurecska, C. Passerone, E. Sentovich, B. Tabbara, M. Chiodo, H. Hsieh,
L. Lavagno, A.L. Sangiovanni-Vincentelli, and K. Suzuki. HardwareSoftware Co-Design of
Embedded Systems: The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997.
[3] S. Borkar. Design Challenges of Technology Scaling. IEEE Micro, 19: 2329, 1999.
[4] http://public.itrs.net/
[5] M. Pedram and J.M. Rabaey. Power Aware Design Methodologies. Kluwer Academic Publishers,
Dordrecht, 2002.
[6] R. Gonzalez, B. Gordon, and M. Horowitz. Supply and Threshold Voltage Scaling for Low Power
CMOS. IEEE Journal of Solid-State Circuits, 32(8): 12101216, 1997.
[7] A.P. Chandrakasan, S. Sheng, and R. W. Brodersen. Low-Power CMOS Digital Design. IEEE
Journal of Solid-State Circuits, 27(4): 473484, 1992.
[8] A.A. Jerraya, S. Yoo, N. Wehn, and D. Verkest, Eds. Embedded Software for SoC. Kluwer Academic
[9] A.P. Chandrakasan and R.W. Brodersen. Low Power Digital CMOS Design. Kluwer Academic
[10] T.L. Martin, D.P. Siewiorek, A. Smailagic, M. Bosworth, M. Ettus, and J. Warren. A Case Study of a
System-Level Approach to Power-Aware Computing. ACM Transactions on Embedded Computing
Systems, Special Issue on Power-Aware Embedded Computing, 2(3): 255276, 2003.
[11] A.S. Vincentelli and G. Martin. A Vision for Embedded Systems: Platform-Based Design and
Software Methodology. IEEE Design and Test of Computers, 18(6): 2333, 2001.
[12] J.M. Rabaey and A.S. Vincentelli. System-on-a-Chip A Platform Perspective. In Keynote
Presentation, Korean Semiconductor Conference, 2002. Available at http://bwrc.eecs.berkeley.edu/
People/Faculty/jan/presentations/platformdesign.pdf
[13] J.T. Buck, S. Ha, E.A. Lee, and D.G. Messerschmitt. Ptolemy: A Framework for Simulating and
Prototyping Heterogeneous Systems. International Journal of Computer Simulation, Special Issue
on Simulation Software Development, 4: 155182, 1994.
[14] V. Tiwari, S. Malik, and A. Wolfe. Power Analysis of Embedded Software: A First Step Towards
Software Power Minimization. IEEE Transactions on Very Large Scale Integration Systems, 2(4):
437445, 1994.
[15] P.M. Chau and S.R. Powell. Power Dissipation of VLSI Array Processing Systems. Journal of VLSI
Signal Processing, 4(23): 199212, 1992.
[16] J. Russell and M. Jacome. Software Power Estimation and Optimization for High-Performance
32-bit Embedded Processors. In Proceedings of the International Conference on Computer Design,
1998, pp. 328333.
[17] C. Brandolese, W. Fornaciari, F. Salice, and D. Sciuto. An Instruction-Level Functionality-Based
Energy Estimation Model for 32-bits Microprocessors. In Proceedings of the Design Automation
Conference, 2000, pp. 346351.
[18] G. Qu, N. Kawabe, K. Usami, and M. Potkonjak. Function-Level Power Estimation Methodology
for Microprocessors. In Proceedings of the Design Automation Conference, 2000, pp. 810813.
[19] D.C. Burger and T.M. Austin. The SimpleScalar Tool Set, Version 2.0. Computer Architecture
News, 25(3): 1325, 1997.
[20] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A Framework for Architectural Level Power
Analysis and Optimizations. In Proceedings of the International Symposium on Computer
Architecture, 2000, pp. 8394.
[21] G. Cai and C.H. Lim. Architectural Level Power/Performance Optimization and Dynamic Power
Estimation. In Cool Chips Tutorial, International Symposium on Microarchitecture, 1999.
[22] W. Ye, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. The Design and Use of Simplepower:
A Cycle-Accurate Energy Estimation Tool. In Proceedings of the Design Automation Conference,
2000, pp. 340345.
[23] G. Jochens, L. Kruse, E. Schmidt, and W. Nebel. A New Parameterizable Power Macro-Model for
Datapath Components. In Proceedings of the Design Automation and Test in Europe, 1999.
[24] A. Bogliolo, L. Benini, and G.D. Micheli. Regression-Based RTL Power Modeling. ACM
Transactions on Design Automation of Electronic Systems, 5(3): 337372, 2000.
[25] S.A. Theoharis, C.E. Goutis, G. Theodoridis, and D. Soudris. Accurate Data Path Models for
RT-Level Power Estimation. In Proceedings of the International Workshop on Power and Timing
Modeling, Optimization and Simulation, 1998, pp. 213222.
[26] M. Khellah and M.I. Elmasry. Effective Capacitance Macro-Modelling for Architectural-Level
Power Estimation. InProceedings of the Eighth Great Lakes SymposiumonVLSI, 1998, pp. 414419.
[27] Z. Chen, K. Roy, and E.K. Chong. Estimation of Power Dissipation Using a Novel Power
Macromodeling Technique. IEEE Transactions on Computer Aided Design of Integrated Circuits
and Systems, 19(11): 13631369, 2000.
[28] R. Melhem and R. Graybill, Eds. Challenges for Architectural Level Power Modeling. In Power
Aware Computing. Kluwer Academic Publishers, Dordrecht, 2001.
[29] J.A. Butts and G.S. Sohi. A Static Power Model for Architects. In Proceedings of the International
Symposium on Microarchitecture, 2000, pp. 191201.
[30] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan. HotLeakage: ATemperature-
Aware Model of Subthreshold and Gate Leakage for Architects. Technical report, Department of
Computer Science, University of Virginia, 2003.
[31] D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and
K. Yelick. A Case for Intelligent RAM. IEEE Micro, 17(2): 3344, 1997.
[32] S.J.E. Wilton and N.M. Jouppi. CACTI: An Enhanced Cache Access and Cycle Time Model.
Technical report, Digital Equipment Corporation, Western Research Lab, 1996.
[33] M. Kamble and K. Ghose. Analytical Energy Dissipation Models For Low Power Caches.
In Proceedings of the International Symposium on Low Power Electronics and Design, 1997,
pp. 143148.
[34] G. Reinman and N.M. Jouppi. CACTI 2.0: An Integrated Cache Timing and Power Model.
Technical report, Compaq Computer Corporation, Western Research Lab, 2001.
[35] J. Edler and M.D. Hill. Dinero IV Trace-Driven Uniprocessor Cache Simulator, 1998.
http://www.cs.wisc.edu/markhill/DineroIV/
[36] F. Catthoor, S. Wuytack, E. DeGreef, F. Balasa, L. Nachtergaele, and A. Vandecappelle.
Custom Memory Management Methodology: Exploration of Memory Organization for Embedded
Multimedia System Design. Kluwer Academic Publishers, Dordrecht, 1998.
[37] T. Martin and D. Siewiorek. A Power Metric for Mobile Systems. In International Symposium on
Lower Power Electronics and Design, 1996, pp. 3742.
[38] M. Pedram and Q. Wu. Battery-Powered Digital CMOS Design. IEEE Transactions on Very Large
Scale Integration Systems, 10: 601607, 2002.
[39] P. Rong and M. Pedram. An Analytical Model for Predicting the Remaining Battery Capacity
of Lithium-Ion Batteries. In Proceedings of the Design Automation and Test in Europe, 2003,
pp. 1114811149.
[40] M. Srivastava, A. Chandrakasan, and R. Brodersen. Predictive System Shutdown and other Archi-
tectural Techniques for Energy Efcient Programmable Computation. IEEE Transactions on Very
Large Scale Integration Systems, 4(1): 4255, 1996.
[41] Q. Qiu and M. Pedram. Dynamic Power Management Based on Continuous-Time Markov
Decision Processes. In Proceedings of the Design Automation Conference, 1999, pp. 555561.
[42] T. Simunic, L. Benini, P. Glynn, and G. De Micheli. Dynamic Power Management of Portable
Systems. In Proceedings of the International Conference on Mobile Computing and Networking,
2000, pp. 1119.
[43] J. Liu, P. Chou, N. Bagherzadeh, and F. Kurdahi. A Constraint-Based Application Model and
Scheduling Techniques for Power-Aware Systems. In Proceedings of the International Conference
on Hardware/Software Codesign, 2001, pp. 153158.
[44] http://www.acpi.info/
[45] T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. De Micheli. Dynamic Voltage Scaling for
Portable Systems. In Proceedings of the Design Automation Conference, 2001, pp. 524529.
[46] R. Gonzalez and M. Horowitz. Energy Dissipation in General Purpose Microprocessors. IEEE
Journal of Solid-State Circuits, 31(9): 12771284, 1996.
[47] T.D. Burd and R.W. Brodersen. Processor Design for Portable Systems. Journal of VLSI Signal
Processing, 13(23): 203221, 1996.
[48] T. Ishihara and H. Yasuura. Voltage Scheduling Problem for Dynamically Variable Voltage
Processors. In Proceedings of the International Symposium on Low Power Electronics and
Design, 1998, pp. 197202.
[49] http://www.intel.com/
[50] http://www.ibm.com/
[51] http://www.transmeta.com/
[52] M. Weiser, B. Welch, A.J. Demers, and S. Shenker. Scheduling for Reduced CPU Energy.
In Proceedings of the Symposium on Operating Systems Design and Implementation, 1994,
pp. 1323.
[53] K. Govil, E. Chan, and H. Wasserman. Comparing Algorithm for Dynamic Speed-Setting of
a Low-Power CPU. In Proceedings of the International Conference on Mobile Computing and
Networking, 1995, pp. 1325.
[54] T. Pering, T. Burd, and R. Brodersen. The Simulation and Evaluation of Dynamic Voltage Scal-
ing Algorithms. In Proceedings of the International Symposium on Low Power Electronics and
Design, 1998, pp. 7681.
[55] T. Pering, T. Burd, and R. Brodersen. Voltage Scheduling in the lpARM Microprocessor
System. In Proceedings of the International Symposium on Low Power Electronics and Design,
2000, pp. 96101.
[56] K. Flautner, S. Reinhardt, and T. Mudge. Automatic Performance Setting for Dynamic Voltage
Scaling. ACM Journal of Wireless Networks, 8(5): 507520, 2002.
[57] D. Brooks and M. Martonosi. Value-Based Clock Gating and Operation Packing: Dynamic
Strategies for Improving Processor Power and Performance. ACM Transactions on Computer
Systems, 18(2): 89126, 2000.
[58] S. Dropsho, V. Kursun, D.H. Albonesi, S. Dwarkadas, and E.G. Friedma. Managing Static Leakage
Energy in Microprocessor Functional Units. In Proceedings of the International Symposium on
Microarchitecture, 2002, pp. 321332.
[59] D. Ponomarev, G. Kucuk, and K. Ghose. Reducing Power Requirements of InstructionScheduling
Through Dynamic Allocation of Multiple Datapath Resources. In Proceedings of the International
Symposium on Microarchitecture, 2001, pp. 90101.
[60] A. Buyuktosunoglu, D. Albonesi, P. Bose, P. Cook, and S. Schuster. Tradeoffs in Power-Efcient
Issue Queue Design. In Proceedings of the International Symposium on Low Power Electronics and
Design, 2002, pp. 184189.
[61] C.J. Hughes, J. Srinivasan, and S.V. Adve. Saving Energy with Architectural and Frequency
Adaptations for Multimedia Applications. In Proceedings of the International Symposium on
Microarchitecture, 2001, pp. 250261.
[62] M.F. Jacome and G. de Veciana. Design Challenges for NewApplication Specic Processors. IEEE
Design and Test of Computers, Special Issue on System Design of Embedded Systems, 17(2): 5060,
2000.
[63] R.P. Colwell, R.P. Nix, J.J.O. Donnell, D.B. Papworth, and P.K. Rodman. A VLIW
Architecture for a Trace Scheduling Compiler. IEEE Transactions on Computers, 37(8): 967979,
1988.
[64] G.R. Beck, D.W.L. Yen, and T.L. Anderson. The Cydra 5: Mini-Supercomputer: Architecture and
Implementation. The Journal of Supercomputing, 7(1/2): 143180, 1993.
[65] M.S. Schlansker and B.R. Rau. EPIC: An Architecture for Instruction-Level Parallel Processors.
Technical report HPL-99-111, Hewlewtt Packard Laboratories, 2000.
[66] W.W. Hwu, R.E. Hank, D.M. Gallagher, S.A. Mahlke, D.M. Lavery, G.E. Haab, J.C. Gyllenhaal,
and D.I. August. Compiler Technology for Future Microprocessors. Proceedings of the IEEE,
83(12): 16251640, 1995.
[67] J. R. Ellis. Bulldog: A Compiler for VLIW Architectures. MIT Press, Cambridge, MA, 1985.
[68] J. Dehnert and R. Towle. Compiling for the Cydra-5. Journal of Supercomputing, 7(1/2): 181227,
1993.
[69] C. Dulong, R. Krishnaiyer, D. Kulkarni, D. Lavery, W. Li, J. Ng, and D. Sehr. An Overview of the
Intel IA-64 Compiler. Intel Technology Journal, Q4, 1999, pp. 115.
[70] M.F. Jacome, G. de Veciana, and V. Lapinskii. Exploring Performance Tradeoffs for Clustered
VLIW ASIPs. In Proceedings of the International Conference on Computer-Aided Design, 2000,
pp. 504510.
[71] V. Lapinskii, M.F. Jacome, and G. de Veciana. Application-Specic Clustered VLIW Datapaths:
Early Exploration on a Parameterized Design Space. IEEE Transactions on Computer Aided Design
of Integrated Circuits and Systems, 21(8): 889903, 2002.
[72] S. Pillai and M.F. Jacome. Compiler-Directed ILPExtractionfor ClusteredVLIW/EPICMachines:
Predication, Speculation and Modulo Scheduling. In Proceedings of the Design Automation and
Test in Europe, 2003, p. 10422.
[73] P. Marwedel and G. Goosens, Eds. Code Generation for Embedded Processors. Kluwer Academic
[74] C. Liem. Retargetable Compilers for Embedded Core Processors. Kluwer Academic Publishers,
Dordrecht, 1997.
[75] J. Fritts, W. Wolf, and B. Liu. Understanding Multimedia Application Characteristics for
Designing Programmable Media Processors. In SPIE Photonics West, Media Processors, 1999,
pp. 213.
[76] B. Khailany, W.J. Dally, S. Rixner, U.J. Kapasi, P. Mattson, J. Namkoong, J.D. Owens, B. Towles,
and A. Chang. Imagine: Media Processing with Streams. IEEE Micro, 21: 3546, 2001.
[77] J. Rabaey, H. De Man, J. Vanhoof, G. Goossens, and F. Catthoor. CATHEDRAL-II : A Syn-
thesis System for Multiprocessor DSP Systems. In Silicon Compilation. Addison-Wesley, Reading,
MA, 1987.
[78] J. Montanaro, R.T. Witek, K. Anne, A.J. Black, E.M. Cooper, D.W. Dobberpuhl, P.M. Donahue,
J. Eno, A. Farell, G.W. Hoeppner, D. Kruckemyer, T.H. Lee, P. Lin, L. Madden,
D. Murray, M. Pearce, S. Santhanam, K.J. Snyder, R. Stephany, and S.C. Thierauf.
A 160 MHz 32b 0.5 W CMOS RISC Microprocessor. In Proceedings of the Interna-
tional Solid-State Circuits Conference, Digest of Technical Papers, 31(11): 17031714,
1996.
[79] P. Hicks, M. Walnock, and R.M. Owens. Analysis of Power Consumption in Memory Hierarch-
ies. In Proceedings of the International Symposium on Low Power Electronics and Design, 1997,
pp. 239242.
[80] K. Ghose and M.B. Kamble. Reducing Power in Superscalar Processor Caches Using Subbanking,
Multiple Line Buffers and Bit-Line Segmentation. In Proceedings of the International Symposium
on Low Power Electronics and Design, 1999, pp. 7075.
[81] C.-L. Su and A.M. Despain. Cache Design Trade-Offs for Power and Performance Optimization:
A Case Study. In Proceedings of the International Symposium on Low Power Electronics and Design,
1995, pp. 6368.
[82] J. Kin, M. Gupta, and W.H. Mangione-Smith. Filtering Memory References to Increase Energy
Efciency. IEEE Transactions on Computers, 49(1): 115, 2000.
[83] A.H. Farrahi, G.E. Tllez, and M. Sarrafzadeh. Memory Segmentation to Exploit Sleep Mode
Operation. In Proceedings of the Design Automation Conference, 1995, pp. 3641.
[84] J. Kin, M. Gupta, and W.H. Mangione-Smith. The Filter Cache: An Energy Efcient
Memory Structure. In Proceedings of the International Symposium on Microarchitecture, 1997,
pp. 184193.
[85] R.S. Bajwa, M. Hiraki, H. Kojima, D.J. Gorny, K. Nitta, A. Shridhar, K. Seki, and K. Sasaki.
Instruction Buffering to Reduce Power in Processors for Signal Processing. IEEE Transactions on
Very Large Scale Integration Systems, 5(4): 417424, 1997.
[86] L. Lee, B. Moyer, and J. Arends. Instruction Fetch Energy Reduction Using Loop Caches for
Embedded Applications with Small Tight Loops. In Proceedings of the International Symposium
on Low Power Electronics and Design, 1999, pp. 267269.
[87] H.-H. Lee and G. Tyson. Region-Based Caching: An Energy-Delay Efcient Memory Archi-
tecture for Embedded Processors. In Proceedings of the International Conference on Compilers,
Architectures and Synthesis for Embedded Systems, 2000, pp. 120127.
[88] D.A. Wood, M.D. Hill, and R.E. Kessler. AModel for Estimating Trace-Sample Miss Ratios. InPro-
ceedings of the SIGMETRICS Conference onMeasurement and Modeling of Computer Systems, 1991,
pp. 7989.
[89] D.C. Burger, J.R. Goodman, and A. Kagi. The Declining Effectiveness of Dynamic Caching
for General-Purpose Microprocessors. University of Wisconsin-Madison Computer Sciences
Technical report 1261, 1995.
[90] S. Kaxiras, Z. Hu, and M. Martonosi. Cache Decay: Exploiting Generational Behavior to
Reduce Cache Leakage Power. In Proceedings of the International Symposium on Computer
Architecture, 2001, pp. 240251.
[91] H. Zhou, M.C. Toburen, E. Rotenberg, and T.M. Conte. Adaptive Mode Control: A Static-Power-
Efcient Cache Design. In Proceedings of the International Conference on Parallel Architectures and
Compilation Techniques, 2001, pp. 6172.
[92] D. Folegnani and A. Gonzalez. Energy-Effective Issue Logic. In Proceedings of the International
Symposium on Computer, 2001, pp. 230239.
[93] Z. Hu, P. Juang, K. Skadron, D. Clark, and M. Martonosi. Applying Decay Strategies to Branch
Predictors for Leakage Energy Savings. In Proceedings of the International Conference on Computer
Design, 2002, pp. 442445.
[94] S.-H. Yang, M.D. Powell, B. Falsa, K. Roy, and T.N. Vijaykumar. An Integrated Circuit/
Architecture Approach to Reducing Leakage in Deep-Submicron High-Performance I-Caches.
In Proceedings of the High-Performance Computer Architecture, 2001, pp. 147158.
[95] P.R. Panda, N.D. Dutt, and A. Nicolau. Efcient Utilization of Scratch-Pad Memory in Embedded
Processor Applications. In Proceedings of the European Design and Test Conference, 1997,
pp. 711.
[96] D. Chiou, P. Jain, S. Devadas, and L. Rudolph. Application-Specic Memory Management for
Embedded Systems Using Software-Controlled Caches. In Proceedings of the Design Automation
[97] L. Benini, A. Macii, and M. Poncino. A Recursive Algorithm for Low-Power Memory Partition-
ing. In Proceedings of the International Symposium on Low Power Electronics and Design, 2000,
pp. 7883.
[98] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel. Scratchpad Memory:
A Design Alternative for Cache On-Chip Memory in Embedded Systems. In Proceedings of the
International Workshop on Hardware/Software Codesign, 2002, pp. 7338.
[99] M. Kandemir, J. Ramanujam, M. Irwin, N. Vijaykrishnan, I. Kadayif, and A. Parikh.
Dynamic Management of Scratch-Pad Memory Space. In Proceedings of the Design Automation
[100] O.S. Unsal, Z. Wang, I. Koren, C.M. Krishna, and C.A. Moritz. On Memory Behavior of Scalars
in Embedded Multimedia Systems. In Proceedings of the Workshop on Memory Performance Issues,
Goteborg, Sweden, 2001.
[101] P.R. Panda, F. Catthoor, N.D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandercappelle,
and P.G Kjeldsberg. Data and Memory Optimization Techniques for Embedded Systems. ACM
Transactions on Design Automation of Electronic Systems, 6(2): 149206, 2001.
[102] J. Sjdin, B. Frderberg, and T. Lindgren. Allocation of Global Data Objects in On-Chip RAM.
In Proceedings of the Workshop on Compiler and Architectural Support for Embedded Computer
Systems, Washington DC, USA, 1998.
[103] T. Ishihara and H. Yasuura. A Power Reduction Technique with Object Code Merging for Applic-
ation Specic Embedded Processors. In Proceedings of the Design, Automation and Test in Europe,
2000, pp. 617623.
[104] S. Steinke, L. Wehmeyer, B.-S. Lee, and P. Marwedel. Assigning Program and Data
Objects to Scratchpad for Energy Reduction. In Proceedings of the Design Automation and Test in
Europe, 2002, pp. 409417.
[105] O.S. Unsal, R. Ashok, I. Koren, C.M. Krishna, and C.A. Moritz. Cool Cache: A Compiler-Enabled
Energy Efcient Data Caching Framework for Embedded/Multimedia Processors. ACMTransac-
tions on Embedded Computing Systems, Special Issue on Power-Aware Embedded Computing, 2(3):
373392, 2003.
[106] http://www.ti.com/
[107] http://www.trimedia.com/
[108] M.E. Wolf and M. Lam. A Data Locality Optimizing Algorithm. In Proceedings of the Conference
on Programming Language Design and Implementation, 1991, pp. 3044.
[109] S. Carr, K.S. McKinley, and C. Tseng. Compiler Optimizations for Improving Data Locality.
In Proceedings of the International Conference on Architectural Support for Programming Languages
and Operating Systems, 1994, pp. 252262.
[110] S. Coleman and K.S. McKinley. Tile Size Selection Using Cache Organization and Data Layout.
In Proceedings of the Conference on Programming Language Design and Implementation, 1995.
[111] M.J. Wolfe. High Performance Compilers for Parallel Computing. Addison-Wesley Publishers,
Reading, MA, 1995, pp. 279290.
[112] M. Kandemir, J. Ramanujam, A. Choudhary, and P. Banerjee. ALayout-Conscious Iteration Space
Transformation Technique. IEEE Transactions on Computers, 50(12): 13211335, 2001.
[113] N.P. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-
Associative Cache and Prefetch Buffers. In Proceedings of the International Symposium on
Computer Architecture, 1990, pp. 364373.
[114] T.F. Chen and J.L. Baer. Effective Hardware-Based Data Prefetching for High Performance
Processors. IEEE Transactions on Computers, 44(5): 609623, 1995.
[115] J.W.C. Fu, J.H. Patel, and B.L. Janssens. Stride Directed Prefetching in Scalar Processor. In
Proceedings of the International Symposium on Microarchitecture, 1992, pp. 102110.
[116] S.S. Pinter andA. Yoaz. AHardware-BasedData Prefetching Technique for Superscalar Processors.
In Proceedings of the International Symposium on Microarchitecture, 1996, pp. 214225.
[117] D. Callahan, K. Kennedy, and A. Portereld. Software Prefetching. In Proceedings of the
International Conference on Architectural Support for Programming Languages and Operating
Systems, 1991, pp. 4052.
[118] A.C. Klaiber and H.M. Levy. An Architecture for Software Controlled Data Prefetching. In
Proceedings of the International Symposium on Computer Architecture, 1991, pp. 4353.
[119] T.C. Mowry, M.S. Lam, and A. Gupta. Design and Evaluation of a Compiler Algorithm
for Prefetching. In Proceedings of the International Conference on Architectural Support for
Programming Languages and Operating Systems, 1992, pp. 6273.
[120] D.F. Zucker, R.B. Lee, and M.J. Flynn. Hardware and Software Cache Prefetching Techniques
for MPEG Benchmarks. IEEE Transactions on Circuits and Systems for Video Technology, 10(5):
782796, 2000.
[121] Y. Choi and T. Kim. Memory Layout Technique for Variables Utilizing Efcient DRAM Access
Modes in Embedded System Design. In Proceedings of the Design Automation Conference, 2003,
pp. 881886.
[122] X. Fan, C.S. Ellis, and A.R. Lebeck. Memory Controller Policies for DRAM Power Manage-
ment. In Proceedings of the International Symposium on Low Power Electronics and Design, 2001,
pp. 129134.
[123] V. Delaluz, A. Sivasubramaniam, M. Kandemir, N. Vijaykrishnan, and M.J. Irwin. Scheduler-
Based DRAM Energy Management. In Proceedings of the Design Automation Conference, 2002,
pp. 697702.
[124] A. Ramachandran and M. Jacome. Xtream-Fit: An Energy-Delay Efcient Data Memory Subsys-
tem for Embedded Media Processing. In Proceedings of the Design Automation Conference, 2003,
pp. 137142.
[125] P. Mattson. A Programming System for the Imagine Media Processor. PhD thesis, Stanford
University, 2001.
[126] P. Grun, N. Dutt, and A. Nicolau. Memory Architecture Exploration for Programmable Embedded
Systems. Kluwer Academic Publishers, Dordrecht, 2003.
[127] A. Wolfe and A. Chanin. Executing Compressed Programs on an Embedded RISC Architecture.
In Proceedings of the International Symposium on Microarchitecture, 1992, pp. 8191.
[128] C. Lefurgy, P. Bird, I-C. Cheng, and T. Mudge. Improving Code Density Using Compres-
sion Techniques. In Proceedings of the International Symposium on Microarchitecture, 1997,
pp. 194203.
[129] H. Lekatsas andW. Wolf. SAMC: ACode CompressionAlgorithmfor Embedded Processors. IEEE
Transactions on Computer Aided Design of Integrated Circuits and Systems, 18(12): 16891701,
1999.
[130] W.E. Dougherty, D.J. Pursley, and D.E. Thomas. Instruction Subsetting: Trading Power
for Programmability. In Proceedings of the International Workshop on Hardware/Software
Codesign, 1998.
[131] D. Sylvester and K. Keutzer. A Global Wiring Paradigm for Deep Submicron Design. IEEE
Transactions on Computer Aided Design of Integrated Circuits and Systems, 19(2): 242252, 2000.
[132] C.-T. Hsieh and M. Pedram. Architectural Power Optimization by Bus Splitting. In Proceedings
of the Conference on Design, Automation and Test in Europe, 2000, pp. 612616.
[133] M.R. Stan and W.P. Burleson. Bus-Invert Coding for Low-Power I/O. IEEE Transactions on Very
Large Scale Integration Systems, 3(1): 4958, 1995.
[134] H. Mehta, R.M. Owens, and M.J. Irwin. Some Issues in Gray Code Addressing. In Proceedings of
the Sixth Great Lakes Symposium on VLSI, 1996, pp. 178181.
[135] L. Benini, G. De Micheli, E. Macii, M. Poncino, and S. Quez. System-Level Power Optimiza-
tion of Special Purpose Applications The Beach Solution. In Proceedings of the International
Symposium on Low Power Electronics and Design, 1997, pp. 2429.
[136] L. Benini, G. De Micheli, E. Macii, D. Sciuto, and C. Silvano. Address Bus Encoding Techniques
for System-Level Power Optimization. In Proceedings of the Design, Automation and Test in
Europe, 1998, pp. 861867.
[137] P.R. Panda andN.D. Dutt. Low-Power Memory Mapping ThroughReducingAddress Bus Activity.
IEEE Transactions on Very Large Scale Integration Systems, 7(3): 309320, 1999.
[138] N. Chang, K. Kim, and J. Cho. Bus Encoding for Low-Power High-Performance Memory Systems.
In Proceedings of the Design Automation Conference, 2000, pp. 800805.
Security in Embedded
Systems
17 Design Issues in Secure Embedded Systems
A.G. Voyiatzis, A.G. Fragopoulos, and D.N. Serpanos
17
Design Issues in
Secure Embedded
Systems
A.G. Voyiatzis,
A.G. Fragopoulos, and
D.N. Serpanos
17.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-1
17.2 Security Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-2
Abilities of Attackers Security Implementation Levels
Implementation Technology and Operational Environment
17.3 Security Constraints in Embedded Systems Design . . . 17-4
Energy Considerations Processing Power Limitations
Flexibility and Availability Requirements Cost of Implementation
17.4 Design of Secure Embedded Systems. . . . . . . . . . . . . . . . . . . 17-7
System Design Issues Application Design Issues
17.5 Cryptography and Embedded Systems . . . . . . . . . . . . . . . . . 17-10
Physical Security Side-Channel Cryptanalysis Side-
Channel Implementations Fault-Based Cryptanalysis
Passive Side-Channel Cryptanalysis Countermeasures
17.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-20
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17-20
17.1 Introduction
Acomputing systemis typically considered as an embedded systemwhen it is a programmable device with
limited resources (energy, memory, computation power, etc.) that serves one or few applications and is
embedded in a larger system. Their limited resources make them ineffective to be used as general-purpose
computing systems. However, they usually have to meet hard requirements, such as time deadlines and
other real-time processing requirements.
Embedded systems can be classied in two general categories: (1) standalone embedded systems, where
all hardware and software components of the system are physically close and incorporated into a single
device, for example, a Personal Digital Assistant (PDA) or a system in a washing machine or a fax, and
there is no attachment to a network, and (2) distributed (networked) embedded systems, where several
autonomous components each one a standalone embedded system communicate with each other
over a network in order to deliver services or support an application. Several architectural and design
parameters leading to the development of distributed embedded applications, such as the placement of
processing power at the physical point where an event takes place, data reduction, etc. [1].
17-1
The increasing capabilities of embedded systems combined with their decreasing cost have enabled their
adoption in a wide range of applications and services, from nancial and personalized entertainment
services to automotive and military applications in the eld. Importantly, in addition to the typical
requirements for responsiveness, reliability, availability, robustness, and extensibility, many conven-
tional embedded systems and applications have signicant security requirements. However, security
is a resource-demanding function that needs special attention in embedded computing. Furthermore, the
wide deployment of small devices which are used in critical applications has triggered the development
of new, strong attacks that exploit more systemic characteristics, in contrast to traditional attacks that
focused on algorithmic characteristics, due to the inability of attackers to experiment with the physical
devices used in secure applications. Thus, design of secure embedded systems requires special attention.
In this chapter we provide an overview of security issues in embedded systems. Section 17.2 presents
the parameters of security systems, while Section 17.3 describes the effect of security in the resource-
constrained environment of embedded systems. Section 17.4 presents the main issues in the design of
secure embedded systems. Finally, Section 17.5 covers in detail attacks and countermeasures of crypto-
graphic algorithm implementations in embedded systems, considering the critical role of cryptography
and the novel systemic attacks developed due to the wide availability of embedded computing systems.
17.2 Security Parameters
Security is a generic termused to indicate several different requirements in computing systems. Depending
onthe systemand its use, several security properties may be satised ineachsystemand ineachoperational
environment. Overall, secure systems need to meet all or a subset of the following requirements [2,3]:
1. Condentiality. Data stored in the system or transmitted from the system have to be protected from
disclosure; this is usually achieved through data encryption.
2. Integrity. A mechanism to ensure that data received in a data communication was indeed the data
transmitted.
3. Nonrepudiation. A mechanism to ensure that all entities (systems or applications) participating
in a transaction cannot deny their actions in the transaction.
4. Availability. The systems ability to perform its primary functions and serve its legitimate users
without any disruption, under all conditions, including possible malicious attacks that target to disrupt
service, such as the well-known Denial of Service (DoS) attacks.
5. Authentication. The ability of the receiver of a message to identify the message sender.
6. Access control. The ability to ensure that only legal users may take part ina transactionandhave access
to system resources. To be effective, access control is typically used in conjunction with authentication.
These requirements are placed by different parties involved in the development and use of computing
systems, for example, vendors, application providers, and users. For example, vendors need to ensure the
protection of their Intellectual Property (IP) that is embedded in the system, while end users want to be
certain that the systemwill provide secure user identication (only authorized users may access the system
and its applications, even if the system gets in the hands of malicious users) and will have high availability,
that is, the system will be available under all circumstances; also, content providers are concerned for
the protection of their IP, for example, that the data delivered through an application are not copied.
Ravi et al. [3,4] have identied the participating parties in system and application development and use
as well as their security requirements. This classication enables us to identify several possible malicious
users, depending on a partys view; for example, for the hardware manufacturer, even a legal end user
of a portable device (e.g., a PDA or a mobile phone) can be a possible malicious user.
Considering the security requirements and the interested parties above, the design of a secure system
requires identication and denition of the following parameters: (1) the abilities of the attackers, (2) the
level at which security should be implemented, and (3) implementation technology and operational
environment.
Design Issues in Secure Embedded Systems 17-3
17.2.1 Abilities of Attackers
Malicious users can be classied in several categories, depending on their knowledge, equipment, etc.
Abraham et al. [5] propose a classication in three categories, depending on their knowledge, their
hardwaresoftware equipment, and their funds:
1. Class I clever outsiders. Very intelligent attackers, not well fundedandwithnosophisticatedequip-
ment. They do not have specic knowledge of the attacked system; basically they are trying to exploit
hardware vulnerabilities and software glitches.
2. Class II knowledgeable insiders. Attackers with outstanding technical background and education,
using highly sophisticated equipment and, often, with inside information for the system under attack;
such attackers include former employees who participated in the development cycle of the system.
3. Class III funded organizations. Attackers who are mostly working in teams, and have excellent
technical skills and theoretical background. They are well funded, have access to very advanced tools
and also have the capability to analyze the system technically and theoretically developing highly
sophisticated attacks. Such organizations could be well-organized education foundations, government
institutions, etc.
17.2.2 Security Implementation Levels
Security can be implemented at various systemlevels, ranging fromprotection of the physical systemitself
to application and network security. Clearly, different mechanisms and implementation technologies
have to be used to implement security at different levels. In general, the levels of security considered are
four: (1) physical, (2) hardware, (3) software, and (4) network and protocol security.
Physical security mechanisms target to protect systems from unauthorized physical access to the system
itself. Protecting systems physically ensures data privacy and data and application integrity. According to
US Federal Standard 1027, physical security mechanisms are considered successful when they ensure that
a possible attack will have low possibility of success and high possibility of tracking the malicious attacker,
in reasonable time. The wide adoption of embedded computing systems in a variety of devices, such as
smartcards, mobile devices, and sensor networks, as well as the ability to network them, for example,
through the Internet or VPNs, has led to revision and reconsideration of physical security. Weingart [6]
surveys possible attacks and countermeasures concerning physical security issues, concluding that physical
security needs continuous improvement and revision in order to keep at the leading edge.
Hardware security may be considered as a subset of physical security, referring to security issues con-
cerning the hardware parts of a computer system. Hardware-level attacks exploit circuit and technological
vulnerabilities and take advantage of possible hardware defects. These attacks do not necessarily require
very sophisticated and expensive equipment. Anderson and Kuhn [7] describe several ways to attack
smartcards and microcontrollers through the use of unusual voltages and temperatures that affect the
behavior of specic hardware parts or through microprobing a smartcard chip, such as the Subscriber
Identity Module (SIM) chip found in cellular phones. Reverse engineering attack techniques are equally
successful as Blythe et al. [8] reported for the case of a wide range of microprocessors. Their work con-
cluded that special hardware protection mechanisms are necessary to avoid these types of attacks; such
mechanisms include silicon coatings of the chip, increased complexity in the chip layout, etc.
One of the major goals in the design of secure systems is the development of secure software, which is
free of aws and security vulnerabilities that may appear under certain conditions. Numerous software
security aws have been identied in real systems, for example, by Landwehr et al. [9], and there have been
several cases where malicious intruders hack into systems through exploitation of software defects [10].
Some methods for the prevention of such problems have been proposed by Tevis and Hamilton [11].
The use of the Internet, which is an unsafe interconnection for information transfer, as a backbone
network for communicating entities and the wide deployment of wireless networks demonstrate that
improvements have to be done in existing protocol architectures in order to provide new, secure pro-
tocols [12,13]. Such protocols will ensure authentication between communicating entities, integrity of
communicated data, protection of the communicating parties, and nonrepudiation (the inability of an
entity to deny its participation in a communication transaction). Furthermore, special attention has to
be paid in the design of secure protocols for embedded systems, due to their physical constraints, that is,
limited battery power, limited processing, and memory resources, as well as their cost and communication
requirements.
17.2.3 Implementation Technology and Operational Environment
In regard to implementation technology, systems can be classied by static versus programmable techno-
logy and xed versus extensible architecture. When static technology is used, the hardware-implemented
functions are xed and inexible, but they offer higher performance and can reduce cost. However, static
systems can be more vulnerable to attacks, because, once a aw is identied for example, in the design
of the system it is impossible to patch already deployed systems, especially in the case of large installa-
tions, such as SIM cards for cellular telephony or pay-per-view TV. Static systems should be implemented
only once and correctly, which is an unattainable expectation in computing.
In contrast, programmable systems are not limited as static ones, but they can be proven exible in the
hands of an attacker as well; system exibility may allow an attacker to manipulate the system in ways not
expected or dened by the designer. Programmability is typically achieved through the use of specialized
software over a general-purpose processor or hardware.
Fixed architectures are composed of specic hardware components that cannot be altered. Typically,
it is almost impossible to add functionality in later stages, but they have lower cost of implementation and
are, in general, less vulnerable because they offer limited choices to attackers. An extensible architecture
is like a general-purpose processor, capable to interface with several peripherals through standardized
connections. Peripherals can be changed or upgraded easily to increase security or to provide new func-
tionality. However, an attacker can connect malicious peripherals or interface the system in untested or
unexpected cases. As testing is more difcult relatively to static systems, one cannot be too condent that
the system operates correctly under every possible input.
Field Programmable Gate Arrays (FPGAs) combine benets of all types of systems and architectures
because they combine hardware implementation performance and programmability, enabling system
reconguration. They are widely used to implement cryptographic primitives in various systems. Thus,
signicant attention has to be paid to the security of FPGAs as independent systems. There exist research
efforts addressing this issue where systematic approaches are developed and open problems in FPGA
security are addressed; for example, Wollinger et al. [14] provide such an approach and address several
open problems, including resistance under physical attacks.
17.3 Security Constraints in Embedded Systems Design
The design of secure systems requires special considerations, because security functions are resource
demanding, especially in terms of processing power and energy consumption. The limited resources
of embedded systems require novel design approaches in order to deal with trade-offs between
efciency speed and cost and effectiveness satisfaction of the functional and operational
requirements.
17.3.1 Energy Considerations
Embedded systems are often battery powered, that is, they are power constrained. Battery capacity consti-
tutes a major bottleneck to processing for security on embedded systems. Unfortunately, improvements in
battery capacity do not followthe improvements of increasing performance, complexity, and functionality
of the systems they power. Gunther et al. [15], Buchmann [16], and Lahiri et al. [17] report the widening
battery gap, due to the exponential growth of power requirements and the linear growth in energy
density. Thus, the power subsystem of embedded systems is a weak point of system security. A malicious
attacker, for example, may forma DoS attack by draining the systems battery more quickly than the usual.
Martin et al. [18] describe three ways in which such an attack may take place: (1) service request power
attacks, (2) benign power attacks, and (3) malignant power attacks. In service request attacks, a malicious
user may request repeatedly from the device to serve a power hungry application, even if the application is
not supported by the device. In benign power attacks, the legitimate user is forced to execute an application
with high-power requirements, while in malignant power attacks malicious users modify the executable
code of an existing application, in order to drain as much battery power as possible without changing the
application functionality. They conclude that such attacks may reduce battery life by one to two orders of
magnitude.
Inclusion of security functions in an embedded system places extra requirements on power con-
sumption due to: (1) extra processing power necessary to perform various security functions, such as
authentication, encryption, decryption, signing, and data verication, (2) transmission of security-related
data between various entities, if the system is distributed, that is, a wireless sensor network, and (3) energy
required to store security-related parameters.
Embedded systems are often used to deploy performance-critical functions, which require a lot of pro-
cessing power. Inclusion of cryptographic algorithms that are used as building blocks in secure embedded
design may lead to great consumption of system battery. The energy consumption of the cryptographic
algorithms used in security protocols has been analyzed well, for example, by Potlapally et al. [19]. They
present a general framework that shows asymmetric algorithms having the highest energy cost, symmetric
algorithms as the next power-hungry category, and hash algorithms at the bottom. The power required by
cryptographic algorithms is signicant as measurements indicate [20]. Importantly, in many applications
the power consumed by security functions is larger than that used for the applications themselves. For
example, Raghunathan et al. [21] present the battery gap for a sensor node with an embedded processor,
calculating the number of transactions that the node can serve working in secure or insecure mode until
system battery runs out. Their results state that working in secure mode consumes the battery in less than
half time than when working in insecure mode.
Many applications that involve embedded systems are implemented through distributed, networked
platforms, resulting in a power overhead due to communication between the various nodes of the
system [1]. Considering a wireless sensor network, which is a typical distributed embedded system,
one can easily see that signicant energy is consumed in communication between various nodes. Factors
such as modulation type, data rate, transmit power, and security overhead affect power consumption
signicantly [22]. Savvides et al. [23] showed that the radio communication between nodes consumes
most of the power, that is, 50 to 60% of the total power, when using the WINS Wireless Integrated
Network Sensor platform [24]. Furthermore, in a wireless sensor network, the security functions
consume energy due to extra internode exchange of cryptographic information key exchange, authen-
tication information and per-message security overhead, which is a function of both the number
and the size of messages [20]. It is important to identify the energy consumption of alternative secur-
ity mechanisms. Hodjat and Verbauwhede [25], for example, have measured the energy consumption
using two widely used algorithms for the key exchange of information between entities in a distributed
environment, (1) DifeHellman protocol [26] and (2) basic Kerberos protocol [26]. Their results show
that DifeHellman, implemented using elliptic curve public key cryptography, consumes 1213.7 mJ,
4296 mJ, and 9378.3 mJ for 128-bit, 192-bit, and 256-bit keys, respectively, while the Kerberos key
exchange protocol using symmetric cryptography consumes 139.62 mJ; this indicates that the Kerberos
protocol conguration consumes signicantly less energy.
17.3.2 Processing Power Limitations
Security processing places signicant additional requirements on the processing power of embedded
systems, since conventional architectures are quite limited. The termsecurity processing is used to indicate
the portion of the system computational effort that is dedicated to the implementation of the security
requirements. Since embedded systems have limited processing power, they cannot cope efciently with
the execution of complex cryptographic algorithms, which are used in the secure design of an embedded
system. For example, the generation of a 512-bit key for the RSA public key algorithm requires 3.4 min
for the PalmIIIx PDA, while encryption using DES takes only 4.9 msec per block, leading to an encryption
rate of 13 Kbps [27]. The adoption of modern embedded systems in high-end systems (servers, rewalls,
and routers) with increasing data transmission rates and complex security protocols, such as SSL, make
the security processing gap wider and demonstrate that the existing embedded architectures need to be
improved, in order to keep up with the increasing computational requirements that are placed by security
processing.
The wide processing gap has been exposed by measurements, as by Ravi et al. [4], who measured
the security processing gap in the client-server model using the SSL protocol for various embedded
microprocessors. Specically, considering a StrongARM (206 MHz SA-1110) processor, which may be
used in a low-end system such as a PDA or a mobile device, 100% of the processing power dedicated to
SSL processing can achieve data rates up to 1.8 Mbps, while a 2.8 GHz Xeon achieves data rates up to
29 Mbps. Considering that the data rates of low-end systems range between 128 Kbps and 2 Mbps, while
data rates of high-end systems range between 2 and 100 Mbps, it is clear that the processors mentioned
above cannot achieve higher data rates than their maximum, leading to a security processing gap.
17.3.3 Flexibility and Availability Requirements
The design and implementation of security in an embedded system does not mean that the system will
not change its operational security characteristics through time. Considering that security requirements
evolve and security protocols are continuously strengthened, embedded systems need to be exible and
adaptable to changes in security requirements, without losing their performance and availability goals as
well as their primary security objectives.
Modernembeddedsystems are characterizedby their ability to operate indifferent environments, under
various conditions. Such an embedded systemmust be able to achieve different security objectives in every
environment; thus, the system must be characterized by signicant exibility and efcient adaptation.
For example, consider a PDA with mobile telecommunication capabilities that may operate in a wireless
environment [2830] or provide 3G cellular services [31]; different security objectives must be satised in
each case. Another issue that must be addressed is the implementation of different security requirements
at different layers of the protocol architecture. Consider, for example, a mobile PDA that must be able to
execute several security protocols, such as IPSec [13], SSL [12], and WEP [32], depending on its specic
application.
Importantly, availability is a signicant requirement that needs special support, considering that it
should be provided in an evolving world in terms of functionality and increasing system complexity.
Conventional embedded systems should target to provide high availability characteristics not only in their
expected, attack-free environment but in an emerging hostile environment as well.
17.3.4 Cost of Implementation
Inclusion of security in embedded system design can increase system cost dramatically. The problem
originates fromthe strong resource limitations of embedded systems, through which the systemis required
to exhibit great performance as well as high level of security while retaining a low cost of implementation.
It is necessary to perform a careful, in-depth analysis of the designed system, in terms of the abilities of
the possible adversaries, the environmental conditions under which the system will operate, etc., in order
to estimate cost realistically. Consider, for example, the incorporation of a tamper-resistant cryptographic
module in an embedded system. As described by Ravi et al. [4], according to the Federal Information
Processing Standard [33], a designer can distinguish four levels of security requirements for cryptographic
modules. The choice of the security level inuences design and implementation cost signicantly; so, the
manufacturer faces a trade-off between the security requirements that will be implemented and the cost
of manufacturing.
17.4 Design of Secure Embedded Systems
Secure embedded systems must provide basic security properties, such as data integrity, as well as mech-
anisms and support for more complex security functions, such as authentication and condentiality.
Furthermore, they have to support the security requirements of applications, which are implemented,
in turn, using the security mechanisms offered by the system. In this section, we describe the main design
issues at both the system and application level.
17.4.1 System Design Issues
Design of secure embedded systems needs to address several issues and parameters ranging from the
employed hardware technology to software development methodologies. Although several techniques
used in general-purpose systems can be effectively used in embedded system development as well, there
are specic design issues that need to be addressed separately, because they are unique or weaker in
embedded systems, due to the high volume of available low-cost systems that can be used for development
of attacks by malicious users. The major of these design issues are tamper-resistance properties, memory
protection, IP protection, management of processing power, communication security, and embedded
software design. These issues are covered in the following paragraphs.
Modernsecure embeddedsystems must be able tooperate invarious environmental conditions, without
loss of performance and deviation from their primary goals. In many cases they must survive various
physical attacks and have tamper-resistance mechanisms. Tamper resistance is the property that enables
systems to prevent the distortion of physical parts. Additionally to tamper-resistance mechanisms, there
exist tamper-evidence mechanisms, which allow users or technical staff to identify tampering attacks
and take countermeasures. Computer systems are vulnerable to tampering attacks, where malicious users
intervene in hardware system parts and compromise them, in order to take advantage of them. Security of
many critical systems relies on tamper resistance of smartcards and other embedded processors. Anderson
and Kuhn [7] describe various techniques and methods to attack tamper-resistance systems, concluding
that tamper-resistance mechanisms need to be extended or reevaluated.
Memory technology may be an additional weakness in system implementation. Typical embedded sys-
tems have ROM, RAM, and EEPROMmemory to store data. EEPROMmemory constitutes the vulnerable
spot of such systems, because it can be erased with the use of appropriate electrical signaling by malicious
users [7].
Intellectual Property (IP) protection of manufacturers is an important issue addressed in secure embed-
ded systems. Complicated systems tend to be partitioned in smaller independent modules leading to
module reusability and cost reduction. These modules include IP of the manufacturers, which needs
to be protected from third-party users, who might claim and use these modules. The illegal users of an
IP block do not necessarily need to have full, detailed knowledge of the IP component, since IP blocks are
independent modules which can very easily be incorporated and integrated with the rest of the system
components. Lach et al. [34] propose a ngerprinting technique for IP blocks implemented using FPGAs,
through an embedded unique marker onto the IP hardware that identies both the origin and the recip-
ient of the IP block. Also, they are stating that the removal of such a mark is extremely difcult, with a
probability of less than one in a million.
Implementation of security techniques for tamper resistance, tamper prevention, and IP protection
may require additional processing power, which is limited in embedded systems. The processing gap
between the computational requirements of security and the available processing power of embedded
processors requires special consideration. A variety of architectures and enhancements in security pro-
tocols have been proposed, in order to bridge that gap. Burke et al. [35] propose enhancements in
the Instruction Set Architecture (ISA) of embedded processors, in order to efciently calculate various
cryptographic primitives, such as permutations, bit rotations, fast substitutions, and modular arith-
metic. Another approach is to build dedicated cryptographic embedded coprocessors with their own
ISA. The Cryptomaniac coprocessor [36] is an example of this approach. Several vendors, for example,
Inneon [37] and ARM [38], have manufactured microcontrollers that have embedded coprocessors
dedicated to serve cryptographic functions. Intel [39] announced a new generation of 64-bit embedded
processors that have some features that can speed up processing hungry algorithms, such as cryptographic
ones; these features include larger register sets, parallel execution of computations, improvements in
large integers multiplication, etc. In a third approach, software optimizations are exploited. Potlapally
et al. [40] have conducted extensive research in the improvement of public-key algorithms, studying
various algorithmic optimizations, identifying an algorithm design space where performance is improved
signicantly. Also, SmartMIPS [41] provides system exibility and adaptation to any changes in secur-
ity requirements through high-performance software-based enhancements of its cryptographic modules,
while it supports various cryptographic algorithms.
Even if the processing gap is bridged and security functions are provided, embedded systems are
required to support secure communications as well, considering that, often, embedded applications are
implemented in a distributed environment where communicating systems may exchange (possibly) sens-
itive data over an untrusted network-wired, wireless or mobile-like Internet, a Virtual Private Network,
the Public Telephone network, etc. In order to fulll the basic security requirements for secure commu-
nications, embedded systems must be able to use strong cryptographic algorithms and to support various
protocols. One of the fundamental requirements regarding secure protocols is interoperability, leading to
the requirement for system exibility and adaptability. Since an embedded system can operate in several
environments, for example, a mobile phone may provide 3G cellular services or connect to a wireless
LAN, it is necessary for the system to operate securely in all environments without loss of performance.
Furthermore, as security protocols are developed for various layers of the OSI reference model, embedded
systems must be adaptable to different security requirements at each layer of the architecture. Finally,
the continuous evolutions of security protocols require system exibility as new standards are developed,
and requirements are reevaluated and new cryptographic techniques are added to overall architecture.
A comprehensive presentation of the evolution of security protocols in wireless communications, such as
WTLS [42], MET[43], and IPSec [13], is provided by Raghunathan et al. [21]. An important consideration
in the development of exible secure communication subsystems for embedded systems is the limitation
of energy, and processing and memory resources. The performance/cost trade-off leads to special atten-
tion for the placement of protocol functions in hardware for high performance or software for
cost reduction.
Embedded software, such as the operating system or application-specic code, constitutes a crucial
factor in secure embedded systemdesign. Kocher and co-workers [3] identify three basic factors that make
embedded software development a challenging area of security: (1) complexity of the system, (2) system
extensibility, and (3) connectivity. Embedded systems serve critical, complex, and hard to implement
applications, with many parameters that need to be considered, which, in turn, leads to buggy and
vulnerable software. Furthermore, the required extensibility of conventional embedded systems makes
the exploitation of vulnerabilities relatively easy. Finally, as modern embedded systems are designed with
network connectivity, the higher the connectivity degree of the system, the higher the risk for a software
breachtoexpandas time goes by. Many attacks canbe implementedby malicious users that exploit software
glitches and lead to system unavailability, which can have a disastrous impact, for example, a DoS attack
on a military embedded system. Landwehr et al. [9] present a survey of common software security faults,
helping designers to learn from their faults. Tevis and Hamilton [11] propose some methods to detect and
prevent software vulnerabilities, focusing on some possible weaknesses that have to be avoided, preventing
buffer overow attacks, heap overow attacks, array indexing attacks, etc. They also provide some coded
security programs that help designers to analyze the security of their software. Buffer overow attacks
constitute the most widely used type of attacks that lead to unavailability of the attacked system; with
these attacks, malicious users exploit system vulnerabilities and are able to execute malicious code, which
can cause several problems such as a system crash preventing legitimate users from using the system,
loss of sensitive data, etc. Shao et al. [44] propose a technique, called HardwareSoftware Defender, which
targets to protect an embedded system from buffer overow attacks; their proposal is to design a secure
instruction set, extending the instruction set of existing microprocessors, and to demand from outside
software developers to call secure functions from that set. The limited memory resources of embedded
systems, specically the lack of disk space and virtual memory, make the system vulnerable in cases of
memory-hungry applications: applications that require excessive amount of memory do not have a swap
le to growand canvery easily cause anout-of-memory unavailability of the system. Giventhe signicance
of this potential problemandattack, Biswas et al. [45] propose mechanisms to protect anembeddedsystem
from such a memory overow, thus providing reliability and availability of the system: (1) use of software
runtime checks, in order to check possible out-of-memory conditions, (2) allowing out-of-memory data
segments to be placed in free system space, and (3) compressing already used and unnecessary data.
17.4.2 Application Design Issues
Embedded system applications present signicant challenges to system designers, in order to achieve
efcient and secure systems.
A key issue in secure embedded design is user identication and access control. User identication
includes the necessary mechanisms that guarantee that only legitimate users have access tosystemresources
and can also verify, whenever requested, the identity of the user who has access to the system. The explos-
ive growth of mobile devices and their use in critical, sensitive transactions, such as bank transactions,
e-commerce, etc., demand secure systems with high performance and low cost. This demand has become
urgent and crucial considering the successful attacks on these systems, such as the recent hardware hacking
attacks onPIN(Personal IdenticationNumber)-based bankATMs (Automatic Teller Machines) that have
led to signicant loss of money and decreased the credibility of nancial organizations toward people.
A solution to this problem may come from an emerging new technology for user identication that is
based on biometric recognition, for both user identication and verication. Biometrics are based on
pattern recognition in acquired biological data taken from a user who wants to gain access to a system,
that is, palm prints [46], nger prints [47], iris scan, etc., and comparing them with the data that have
been stored in databases identifying the legitimate users of the system [48]. Moon et al. [49] propose
a secure smartcard that uses biometrics capabilities, claiming that the development of such a system is
less vulnerable to attacks when compared with software-based solutions and that the combination of
smartcard and ngerprint recognition is much more robust than PIN-based identication. Implement-
ation of such systems is realistic as Tang et al. [50] illustrated with the implementation of a ngerprint
recognition system with high reliability and high speed; they achieved an average computational time
per ngerprint image less than 1 sec, using a xed-point arithmetic StrongARM 200 MHz embedded
processor.
As mentioned previously, an embedded system must store information that enables it to identify and
validate users that have access to the system. But, how does an embedded system store this information?
Embedded systems use several types of memory to store different types of data: (1) ROM EPROM to store
programming data used to serve generic applications, (2) RAMto store temporary data, and (3) EEPROM
and FLASH memories to store mobile downloadable code [20]. In an embedded device such as a PDA
or a mobile phone several pieces of sensitive information, such as PINs, credit card numbers, personal
data, keys, and certicates for authorization purposes, may be permanently stored in secondary storage
media. The requirement to protect this information as well as the rapid growth of communications
capabilities of embedded devices, for example, mobile Internet access, which make embedded systems
vulnerable to network attacks as well, lead to increasing demands for secure storage space. The use
of hard cryptographic algorithms to ensure data integrity and condentiality is not feasible in most
embedded systems, mainly due to their limited computational resources. Benini et al. [51] present a
survey of architectures and techniques used to implement memory for embedded systems, taking into
consideration energy limitations of embedded systems. Rosenthal [52] presents an effective way to ensure
that data cannot be erased or destroyed by hiding memory from the processor through use of a serial
EEPROM, which is the same as standard EEPROM with the only difference that a serial link binds the
memory with the processor reading/writing data, using a strict protocol. Actel [53] describes security
issues and design considerations for the implementation of embedded memories using FPGAs claiming
that SRAM FPGAs are vulnerable to Level I [5] attacks, while it is more preferable to use nonvolatile Flash
and Antifuse-based FPGA memories, which provide higher levels of security relatively to SRAM FPGAs.
Another key issue in secure embedded systems design is to ensure that any digital content already stored
or downloaded in the embedded system will be used according to the terms and conditions the content
provider has set and in accordance with the agreements between user and provider; such content includes
software for a specic application or a hardware component embedded in the system by a third-party
vendor. It is essential that conventional embedded devices, mobile or not, be enhanced with Digital Right
Management (DRM) mechanisms, inorder to protect the digital IPof manufacturers and vendors. Trusted
computing platforms constitute one approach to resolve this problem. Such platforms are signicant, in
general, as indicated by the Trusted Computing PlatformAlliance (TCPA) [54], which tries to standardize
the methods to build trusted platforms. For embedded systems, IP protection can be implemented in
various ways. A method to produce a trusted computing platform based on a trusted, secure hardware
component, called spy, can lead to systems executing one or more applications securely [55,56].Ways to
transform a 3G mobile device into a trusted one have been investigated by Messerges and Dabbish [57],
who are capable of protecting content through analysis and probing of the various components in a
trusted system; for example, the operating systemof the embedded systemis enhanced with DRMsecurity
hardwaresoftware part, which transforms the system to a trusted one. Alternatively, Thekkath et al. [58]
propose a method to prevent unauthorized reading, modication, and copying of proprietary software
code, using eXecute Only Memory (XOM) system that permits only code execution. The concept is that
code stored in a device can be marked as execute only and content-sensitive applications can be stored
in independent compartments [59]. If an application tries to access data outside its compartment, then it
is stopped.
Signicant attention has to be paid to protect against possible attacks through malicious downloadable
software, such as viruses, Trojans, logic bombs, etc. [60]. The wide deployment of distributed embedded
systems and the Internet have resulted in the requirement for an ability of portable embedded systems, for
example, mobile phones and PDAs, to download and execute various software applications. This ability
may be newtothe worldof portable, highly constrainedembeddedsystems, but it is not newinthe worldof
general-purpose systems, which have had the ability to download and execute Java applets and executable
les from the Internet or from other network resources for a long time. One major problem in this service
is that users cannot be sure about the content of the software that is downloaded and executed on their
system(s), who the creator is and what its origin is. Kingpin and Mudge [61] provide a comprehensive
presentation of security issues in personal digitals assistants, analyzing in detail what malicious software
is, that is, viruses, Trojans, backdoors, etc., where it resides and how it is spread, giving to the future users
of such devices a deeper understanding about the extra security risks that arise with the use of mobile
downloadable code. An additional important consideration is the robustness of the downloadable code:
once the mobile code is considered secure, downloaded, and executed, it must not affect preinstalled
system software. Various techniques have been proposed to protect remote hosts from malicious mobile
code. The sandbox technique, proposed by Rubin and Geer [62], is based on the idea that the mobile code
cannot execute system functions, that is, it cannot affect the le system or open network connections.
Instead of disabling mobile code from execution, one can empower it using enhanced security policies as
Venkatakrishnan et al. [63] propose. Necula [64] suggests the use of proof-carrying code. The producer
of the mobile code, a possibly untrusted source, must embed some type of proof that can be tested by the
remote host in order to prove the validity of the mobile code.
17.5 Cryptography and Embedded Systems
Secure embedded systems should support the basic security functions for (1) condentiality, (2) integrity,
and (3) authentication. Cryptography provides a mechanism that ensures that the previous three require-
ments are met. However, implementation of cryptography in embedded systems can be a challenging
task. The requirement of high performance has to be achieved in a resource-limited environment; this
task is even more challenging when lowpower constraints exist. Performance usually dictates an increased
cost, which is not always desirable or possible. Cryptography can protect digital assets provided that the
secret keys of the algorithms are stored and accessed in a secure manner. For this, the use of specialized
hardware devices to store the secret keys and to implement cryptographic algorithms is preferred over the
use of general-purpose computers. However, this also increases the implementation cost and results in
reduced exibility. On the other hand, exibility is required, because modern cryptographic protocols do
not rely on a specic cryptographic algorithm but rather allows the use of a wide range of algorithms for
increased security and adaptability to advances on cryptanalysis. For example, both the SSL and IPSec net-
work protocols support numerous cryptographic algorithms to perform the same function, for example,
encryption. The protocol enables negotiation of the algorithms to be used, in order to ensure that both
parties use the desirable level of protection dictated by their security policies.
Apart from the performance issue, a correct cryptographic implementation requires expertise that is
not always available or affordable during the lifecycle of a system. Insecure implementations of theor-
etically secure algorithms have made their way to headline news quite often in the past. An excellent
survey on cryptography implementation faults is provided in [65], while Anderson [66] focuses on the
causes of cryptographic systems failures in banking applications. A common misunderstanding is the use
of random numbers. Pure Linear Feedback Shift Registers (LFSRs) and other pseudorandom number
generators produce random-looking sequences that may be sufcient for scientic experiments but can
be disastrous for cryptographic algorithms that require some unpredictable random input. On the other
hand, the cryptographic community has focused on proving the theoretical security of various crypto-
graphic algorithms and has paid little attention to actual implementations on specic hardware platforms.
In fact, many algorithms are designed with portability in mind and efcient implementation on a specic
platform meeting specic requirements can be quite tricky. This communication gap between vendors
and cryptographers intensies in the case of embedded systems, which can have many design choices and
constraints that are not easily comprehensible.
In the late 1990s, Side-Channel Attacks (SCAs) were introduced. SCAare a method of cryptanalysis that
focuses on the implementation characteristics of a cryptographic algorithm in order to derive its secret
keys. This advancement bridged the gap between embedded systems, a common target of such attacks, and
cryptographers. Vendors became aware and concerned by this new form of attacks, while cryptographers
focused on the specics of the implementations, in order to advance their cryptanalysis techniques.
In this section, we present side-channel cryptanalysis. First, we introduce the concept of tamper
resistance, the implementation of side channels and information leakage through them from otherwise
secure devices; then, we demonstrate how this information can be exploited to recover the secret keys of
cryptographic algorithm, presenting case studies of attacks to the RSA algorithm.
17.5.1 Physical Security
Secrecy is always a desirable property. In the case of cryptographic algorithms, the secret keys of the
algorithm must be stored, accessed, used, and destroyed in a secure manner, in order to provide the
required security functions. This statement is often overlooked and design or implementation aws result
in insecure cryptographic implementations. It is well known that general-purpose computing systems and
operating systems can not provide enough protection mechanisms for cryptographic keys. For example,
SSL certicates for web servers are stored unprotected on servers disks and rely on le system permissions
for protection. This is necessary, because web servers can offer secure services unattended. Alternatively,
a human would provide the password to access the certicate for each connection; this would not be an
efcient decision in the era of e-commerce, where thousands of transactions are made every day. On the
other hand, any software bug in the operating system, in a high-privileged application or in the web server
software itself may expose this certicate to malicious users.
Embedded systems are commonly used for implementing security functions. Since they are complete
systems, they can perform the necessary cryptographic operations in a sealed and controlled environ-
ment [6769]. Tamper resistance refers to the ability of a system to resist to tampering attacks, that is,
attempts to bypass its attack-prevention mechanisms. The IBM PCI Cryptographic Coprocessor [70] is
such a system, having achieved FIPS 140-2 Level 4 certication [33]. Advancements of DRMtechnology to
consumer devices and general-purpose computers drives the use of embedded systems for cryptographic
protection of IP. Smartcards are a well-known example of tamper-resistant embedded systems that are
used for nancial transactions and subscription-based service provision.
In many cases, embedded systems used for security-critical operations do not implement any tamper-
resistance mechanisms. Rather, a thin layer of obscurity is preferred, both for simplicity and performance
issues. However, as users become more interested in bypassing the security mechanisms of the system,
the thin layer of obscurity is easily broken and the cryptographic keys are publicly exposed. The Adobe
eBook software encryption [71], the Microsoft XBox case [72], the USB hardware token devices [73], and
the DVD CSS copy protection scheme [74] are examples of systems that have implemented security by
obscurity and were easily broken.
Finally, an often neglected issue is a lifecycle-wide management of cryptographic systems. While a
device may be withdrawn from operation, the data it has stored or processed over time may still need to
be protected. The security of keys that relies on the fact that only authorized personnel has access to the
system may not be sufcient for the recycled device. Garnkel and Shelat [75], Skorobogatov [76], and
Gutman [77] present methods for recovering data from devices using noninvasive techniques.
17.5.2 Side-Channel Cryptanalysis
Until the middle 1990s, academic research on cryptography focused on the mathematical properties of
the cryptographic algorithms. Paul Kocher was the rst to present cryptanalysis attacks on implement-
ations of cryptographic algorithms, which were based on the implementation properties of a system.
Kocher observed that a cryptographic implementation of the RSA algorithm required varying amounts
of time to encrypt a block of data depending on the secret key used. Careful analysis of the timing
differences, allowed him to derive the secret key and he extended this method to other algorithms as
well [78]. This result came as a surprise, since the RSA algorithm has withstood years of mathematical
cryptanalysis and was considered secure [79]. A short time later, Boneh et al. presented theoretical attacks
on how to derive the secret keys on implementations of the RSA algorithm and the Fiat-Shamir and
Schnorr identication schemes [80], revised in Reference 81, while similar results were presented by
Bao et al. [82].
These ndings revealed a new class of attacks on cryptographic algorithms. The term side-channel
attacks (SCAs), rst appeared in Reference 83, has been widely used to refer to this type of cryptanalysis,
while the terms fault-based cryptanalysis, implementation cryptanalysis, active/passive hardware attacks,
leakage attacks, and others have been used also. Cryptographic algorithms acquired a new security dimen-
sion, that of their exact implementation. Cryptographers had previously focused on understanding the
underlying mathematical problems to prove or conjecture for the security of a cryptographic algorithm
based on the abstract mathematical symbols. Now, in spite of the hard underlying mathematical problems
to be solved, an implementation may be vulnerable and allow the extraction of secret keys or other sens-
itive material. Implementation vulnerabilities are of course not a new security concept. In the previous
section, we presented some impressive attacks on security that were based on implementation faults. The
new concept of SCA is that even cryptographic algorithms that are otherwise considered secure can be
also vulnerable to such faults. This observation is of signicant importance, since cryptography is widely
used as a major building block for security; if cryptographic algorithms can be driven insecure, the whole
construction collapses.
Embeddedsystems andespecially smartcards are a popular target for SCA. Tounderstandthis, recall that
such systems are usually owned by a service provider, such as a mobile phone operator, a TV broadcaster
or a bank, and possessed by service clients. The service provider resides on the security of the embedded
system in order to prove service usage by the clients, such as phone calls, movie viewing or a purchase, and
charge the client accordingly. Onthe other hand, consumers have the incentive tobypass these mechanisms
in order to enjoy free services. Given that SCAare implementation specic and rely, as we will present later,
on the ability to interfere, passively or actively with the device implementing a cryptographic algorithm,
embedded systems are a further attractive target, given their resource limitation, which makes the attack
efforts easier.
Inthe following, we present the classes of SCAandcountermeasures that have beendeveloped. The tech-
nical eld remains highly active, since ingenious channels are continuously appearing in the bibliography.
Embedded system vendors must study the attacks carefully, evaluate the associated risks for their environ-
ment, and ensure that appropriate countermeasures are implemented in their systems; furthermore, they
must be prepared to adapt promptly to new techniques for deriving secrets from their systems.
17.5.3 Side-Channel Implementations
A side channel is any physical channel that can carry information from the operation of a device while
implementing a cryptographic operation; such channels are not captured by the existing abstract mathe-
matical models. The denition is quite broad and the inventiveness of attackers is noticeable. Timing
differences, power consumption, electromagnetic emissions, acoustic noise, and faults have been currently
exploited for leaking information out of cryptographic systems.
The channel realization can be categorized in three broad classes: physical or probing attacks, fault-
induction or glitch attacks, and emission attacks, such as TEMPEST. We shortly reviewthe rst two classes;
readers interested in TEMPEST attacks are referred to Reference 84.
The side channels may seem unavoidable and a frightening threat. However, it should be strongly
emphasized that in most cases, reported attacks, both theoretical and practical, rely for their success on
the detailed knowledge of the platformunder attack and the specic implementation of the cryptographic
algorithm. For example, power analysis is successful in most cases, because cryptographic algorithms
tend to use only a small subset of a processors instruction set and especially simple instructions, such
as LOAD, STORE, XOR, AND, and SHIFT, in order to develop elegant, portable, and high-performance
implementations. This decision allows an attacker to minimize the power proles he or she must construct
and simplies the distinction of different instructions that are executed.
17.5.3.1 Fault-Induction Techniques
Devices are always susceptible to erroneous computations or other kinds of faults for several reasons.
Faulty computations are a known issue from space systems, because, in deep space, devices are exposed to
radiationthat cancause temporary or permanent bit ips, gate destruction, or other problems. Incomplete
testing during manufacturing may allow imperfect designs from reaching the market, as in the case of the
Intel Pentium FDIV bug [85], or in the case of device operation in conditions out of their specica-
tions [86]. Careful manipulation of the power supply or the clock oscillator can also cause glitches in code
execution by tricking the processor, for example, to execute unknown instructions or bypass a control
statement [87].
Some researchers have questioned the feasibility of fault-injection attacks on real systems [88]. While
fault injection may seem as an approach that requires expensive and specialized equipment, there have
been reports that fault injection can be achieved with low cost and readily available equipment. Anderson
and Kuhn [89] and Anderson [66] present low-cost attacks for tamper-resistant devices, which achieve
extraction of secret information fromsmartcards and similar devices. Kmmerling and Kuhn [87] present
noninvasive fault-injection techniques, for example, by manipulating power supply. Anderson [90] sup-
ports the view that the underground community has been using such techniques for quite a long time to
break the security of smartcards of pay-TV systems. Furthermore, Weingart [6] and Aumller et al. [91]
present attacks performed in a controlled lab environment, proving that fault-injection attacks are feasible.
Skorobogatov and Anderson [140] introduces low-cost light ashes, such as a camera ash, as a means
to introduce errors, while eddy-current attacks are introduced in Reference 92. A complete presentation
of the fault-injection methods is presented in Reference 93, along with experimental evidence on the
applicability of the methods to industrial systems and anecdotal information.
The combined timespace isolation problem [94] is of signicant importance in fault-induction attacks.
The space isolation problem refers to isolation of the appropriate space (area) of the chip in which to
introduce the fault. The space isolation problem has four parameters:
1. Macroscopic. The part of the chip where the fault can be injected. Possible answers can be one or
more of the following: main memory, address bus, system bus, register le.
2. Bandwidth. The number of bits that can be affected. It may be possible to change just one bit or
multiple bits at once. The exact number of changed bits can be controllable (e.g., one) or follow a random
distribution.
3. Granularity. The area where the error can occur. The attacker may drive the fault-injection position
at a bit level or a wider area, such as a byte or a multibyte area. The fault-injected area can be covered by
a single error or by multiple errors. How are these errors distributed with respect to the area? They may
focus around the mark or may be evenly distributed.
4. Lifetime. The time duration of the fault. It may be a transient fault or a permanent fault. For
example, a power glitch may cause a transient fault at a memory location, since the next time the location
will be written, a new value will be correctly written. In contrast, a cell or gate destruction will result in
a permanent error, since the output bit will be stuck at 0 or 1, independently of the input.
The time isolation problem refers to the time at which a fault is injected. An attacker may be able to
synchronize exactly with the clock of the chip or may introduce the error in a random fashion. This
granularity is the only parameter of the time isolation problem. Clearly, the ability to inject a fault in
a clock period granularity is desirable, but impractical in real-world applications.
17.5.3.2 Passive Side Channels
Passive side channels are not a new concept in cryptography and security. The information available from
the now partially declassied TEMPEST project reveals helpful insights in how electromagnetic emissions
occur and can be used to reconstruct signals for surveillance purposes. A good review of the subject
is provided in chapter 15 of Reference 90. Kuhn [84,95,96] presents innovative use of electromagnetic
emissions to reconstruct information from CRT and LCD displays, while Loughry and Umphress [97]
reconstructs information owing through network devices using the emissions of their LEDs.
The newconcept in this area is the fact that such emissions can be also used to derive secret information
from an otherwise secure device. Probably, the rst such attack took place in 1956 [98]. MI5, the British
Intelligence, used a microphone to capture the sound of the rotor clicks of a Hagelin machine in order
to deduce the core position of some of its rotors. This resulted in reducing the problem to calculate
the initial setup of the machine within the range of their then available resources, and to eavesdrop the
encrypted communications for quite a long time. While the so-called acoustic cryptanalysis may seem
outdated, researchers provided a fresh look on this topic recently, by monitoring low-frequency (KHz)
sounds and correlating them with operations performed by a high-frequency (GHz) processor [99].
Researchers have been quite creative and have used many types of emissions or other physical interac-
tions of the device with the environment it operates. Kocher [78] introduces the idea of monitoring the
execution time of a cryptographic algorithm and tries to identify the secret keys used. The key concept
in this approach is that an implementation of an algorithm may contain branches and other conditional
execution or the implementation may follow different execution paths. If these variances are based on the
bit values of a secret key, then a statistical analysis can reveal the secret key bit-by-bit. Coron et al. [100]
explain the power dissipation sources and causes, while Kocher et al. [101] present how power con-
sumption can be also correlated with key bits. Rao and Rohatgi [102] and Quisquater and Samyde [86]
introduce electromagnetic analysis. Probing attacks can also be applied to reveal the Hamming weight of
data transferred across a bus or stored in memory. This approach is also heavily dependent on the exact
hardware platform [103,104].
While passive side channels are usually thought in the context of embedded systems and other resource-
limited environments, complex computing systems may also have passive side channels. Page [105]
explores the theoretical use of timing variations due to processor cache in order to extract secret keys.
Song et al. [106] take advantage of a timing channel in the secure communication protocol SSH to recover
user passwords, while Felten and Schneider [107] present timing attacks on web privacy. A malicious web
server can inject client-side code that fetches some specic pages transparently on behalf of the user; the
server would like to know if the user has visited these pages before. The time difference between fetching
the web page from the remote server and accessing it from the users cache is sufcient to identify if the
user has visited this page before. A more impressive result, directly related to cryptography, is presented in
Reference 108, where remote timing attacks on web servers implementing the SSL protocol are shown to
be practical and the malicious user can extract the servers certicate private key by measuring its response
times.
17.5.4 Fault-Based Cryptanalysis
The rst theoretical active attacks are presented in References 80 and 82. The attacks in the former paper
focused on RSA, when implemented with the Chinese Remainder Theorem (CRT) and Montgomery
multiplication method, and the Fiat-Shamir and the Schnorr identication schemes. The latter work
focuses on cryptosystems where security is based on the Discrete Logarithm Problem and presents attacks
on the ElGamal signature scheme, the Schnorr signature scheme and DSA. The attack on the Schnorr
signature scheme is extended, with some modication, to the identication scheme as well. Furthermore,
the second paper reports independently an attack on the RSAMontgomery. Since then, this area has been
quite active, both in developing attacks based on fault induction and countermeasures. The attacks have
succeeded in most of the popular and widely used algorithms. In the following, we give a brief review of
the bibliography.
The attacks on RSA with Montgomery have been extended by attacking the signing key, instead of
the message [109]. Furthermore, similar attacks are presented for LUC and KMOV (based on elliptic
curves) cryptosystems. In Reference 110, the attacks are generalized for any RSA-type cryptosystem, with
examples of LUC and Demytko cryptosystems. Faults can be used to expose the private key of RSAKEM
scheme [111] and transient faults can be used to derive the RSA and DSA secret keys from applications
compatible with the OpenPGPformat [112]. The Bellcore attack on the Fiat-Shamir scheme is shown to be
incomplete in Reference 94; the Precautious Fiat-Shamir scheme is introduced, which defends against it.
A new attack that succeeds against both the classical and the Precautious Fiat-Shamir scheme is presented
in Reference 113.
Beginning withBihamandShamir [114], fault-basedcryptanalysis focusedonsymmetric key cryptosys-
tems. DES is shown to be vulnerable to the so-called Differential Fault Analysis (DFA), using only 50 to 200
faulty ciphertexts. The method is also extended to unknown cryptosystems and an example of an attack on
the once classied algorithm SkipJack is presented. Another variant of the attack on DES takes advantage
of permanent instead of transient faults. The same ideas are also explored and extended for completely
unknown cryptosystems in Reference 115, while Jacob et al. [116] use faults to attack obfuscated ciphers
in software and extract secret material by avoiding de-obfuscating the code.
For some time it was believed that fault-induction attacks can only succeed on cryptographic schemes
based on algebraic-based hard mathematical problems, such as number factoring and discrete logarithm
computation. Elliptic Curve Cryptosystems (ECCs) are a preferable choice to implement cryptography,
since they offer equivalent security with that of algebraic public key algorithms, requiring only about
a tenth of key bits. Biehl et al. [117] extend the DFA on ECC and, especially, on schemes whose
security is based on the discrete logarithm problem over elliptic curve elds. Furthermore, Zheng and
Matsumoto [118] use transient and permanent faults to attack random number generators, a crucial
building block for cryptographic protocols, and the ElGamal signature scheme.
Rijndael [119] was nominated as the AES algorithm [120], the replacement of DES. The case of the
AES algorithm is quite interesting, considering that it was submitted after the introduction of SCA; thus,
authors have taken all the appropriate countermeasures to ensure that the algorithm resisted all known
cryptanalysis techniques applicable to their design. The original proposal [119] even noted timing attacks
and how they could be prevented. Koeune and Quisquater [121] describe how a careless implementation
of the AES algorithm can utilize a timing attack and derive the secret key used. The experiments carried
showthat the key can be derived having 3000 samples per key byte, with minimal cost and high probability.
The proposal of the algorithm is aware of this issue and immune against such a simple attack. However,
DFA proved to be successful against AES. Although DFA was designed for attacking algorithms with a
Feistel structure, such as DES, Dusart et al. [122] show that it can be applied to AES, which does not have
such a structure. Four different fault-injection models are presented and the attacks succeed for all key
sizes (128, 192, and 256 bits). Their experiments show that with ten pairs of faulty/correct messages in
hand, a 128-bit AES key can be extracted in a few minutes. Blmer et al. [123] present additional fault-
based attacks on AES. The attack assumes multiple kinds of fault models. The stricter model, requiring
exact synchronization in space and time for the error injection, succeeds in deriving a 128-bit secret key
after collecting 128 faulty ciphertexts, while the least strict model derives the 128-bit key, after collecting
256 faulty ciphertexts.
17.5.4.1 Case Study: RSAChinese Remainder Theorem
The RSAcryptosystemremains a viable and preferable public key cryptosystem, having withstood years of
cryptanalysis [79]. The security of the RSA public key algorithm relies on the hardness of the problem of
factoring large numbers to prime factors. The elements of the algorithm are N = pq, the product of two
large prime numbers, that is, the public and secret exponents respectively, and the modular exponentiation
operation m
k
mod N. To sign a message m, the sender computes s = m
d
mod N, using his or her public
key. The receiver computes m = s
e
mod N to verify the signature of the received message.
The modular exponentiation operation is computationally intensive for large primes and it is the major
computational bottleneck inanRSAimplementation. The CRTallows fast modular exponentiation. Using
RSA with CRT, the sender computes s
1
= m
d
mod p, s
2
= m
d
mod q and combines the two results, based
on the CRT, to compute S = as
1
+ bs
2
mod N for some predened values a and b. The CRT method
is quite popular, especially for embedded systems, since it allows four times faster execution and smaller
memory storage for intermediate results (for this, observe that typically p and q have half the size of N).
The Bellcore attack [80,81], as it is commonly referenced, is quite simple and powerful against RSA
with CRT. It sufces to have one correct signature S for a message m and one faulty signature S
, which is
caused by an incorrect computation of one of the two intermediate results s
1
and s
2
. It does not matter
either if the error occurred on the rst or the second intermediate result, or how many bits were affected
by the error. Assuming that an error indeed occurred, it sufces to compute gcd(S S
, N), which will

equal q, if the error occurred in computation of s
1
and p if it occurred in s
2
. This allows to factor N and
thus, the security of the algorithm is broken.
Lenstra [124] improves this attack by requiring a known message and a faulty signature, instead of two
signatures. In this case, it sufces to compute gcd(M (S
)
d
, N) to reveal one of the two prime factors.
Boneh et al. [80] propose double computations as means to detect such erroneous computations.
However, this is not always efcient, especially in the case of resource-limited environments or where per-
formance is animportant issue. Also, this approachis of nohelpinthe case a permanent error has occurred.
Kaliski and Robshaw [125] propose signature verication, by checking the equality S
e
mod N = M. Since
the public exponent may be quite large, this check can be rather time consuming for a resource-limited
system.
Shamir [126] describes a software method for protecting RSA with CRT from fault and timing attacks.
The idea is to use a random integer t and perform a blinded CRT by computing: S
pt
= m
d
mod p t
and S
qt
= m
d
mod q t . If the equality S
pt
= S
qt
mod t holds, then the initial computation is considered
error-free and the result of the CRT can be released from the device. Yen et al. [127] further improve this
countermeasure for efcient implementation without performance penalties, but Blmer et al. [128] show
that this improvement in fact renders RSA with CRT totally insecure. Aumller et al. [91] provide another
software implementation countermeasure for faulty RSACRT computations. However, Yen et al. [129],
using a weak fault model, showthat both these countermeasures [91,126] are still vulnerable, if the attacker
focuses on the modular reduction operation s
p
= s
p
mod p of the countermeasures. The attacks are valid
for both transient and permanent errors and again, appropriate countermeasures are proposed.
As we show, the implementation of error checking functions using the nal or intermediate results of
RSA computations can create an additional side meta-channel, although faulty computations never leave
a sealed device. Assume that an attacker knows that a bit in a register holding part of the key was invasively
set to zero during the computation and that the device checks the correctness of the output by double
computation. If the device outputs a signed message, then no error was detected and thus, the respective
bit of the key is zero. If the device does not output a signed message or outputs an error message, then
the respective bit of the key is one. Such a safe-error attack is presented in Reference 130, focusing on the
RSA when implemented with Montgomery multiplication. Yen et al. [131] extend the idea of safe-error
attacks frommemory faults to computational faults and present such an attack on RSAwith Montgomery,
which can also be applied to scalar multiplication on elliptic curves.
An even simpler attack would be to attack both an intermediate computation and the condition check.
A condition check can be a single point of failure and an attacker can easily mount an attack against it,
provided that he or she has means to introduce errors in computations [128]. Indeed, in most cases,
a condition check is implemented as a bit comparison with a zero ag. Blmer et al. [128] extend the ideas
of checking vulnerable points of computation by exhaustively testing every computation performed for
an RSACRT signing, including the CRT combination. The proposed solution seems the most promising
at the moment, allowing only attacks by powerful adversaries that can solve precisely the timespace
isolation problem. However, it should be already clear that advancements in this area of cryptanalysis are
continuous and they should be always prepared to adapt to new attacks.
17.5.5 Passive Side-Channel Cryptanalysis
Passive side-channel cryptanalysis has received a lot of attention, since its introduction in 1996 by Paul
Kocher [78]. Passive attacks are considered harder to defend against and many people are concerned, due
to their noninvasive nature. Fault-induction attacks require some form of manipulating the device and
thus, sensors or other similar means can be used to detect such actions and shut down or even zero out
the device. In the case of passive attacks, the physical characteristics of the device are just monitored,
usually with readily available probes and other hardware. So, it is not an easy task to detect the presence of
a malicious user, especially in the case where only a fewmeasurements are required or abnormal operation
(such as continuous requests for encryptions/decryptions) can not be identied.
The rst results are by Kocher [78]. Timing variations in the execution of a cryptographic algorithm
such as DifeHellman key exchange, RSA, and DSS are used to derive bit-by-bit the secret keys of these
algorithms. Although mentioned before, we should emphasize that timing attacks and other forms of
passive SCA require knowledge of the exact implementation of the cryptographic algorithm under attack.
Dhemet al. [132] describe a timing attack against the RSAsignature algorithm. The attack derives a 512-bit
secret key with 200,000 to 300,000 timing measurements. Schindler et al. [133] improve the timing attacks
on RSAmodular exponentiation by a factor of 50, allowing extraction of a 512-bit key using as fewas 5,000
timing measurements. The approach used is an error-correction (estimator) function, which can detect
erroneous bit detections as key extraction process evolves. Hevia and Kiwi [134] introduce a timing attack
against DES, which reveals the Hamming weight of the key, by exploiting the fact that a conditional bit
wrap around function results on variable execution time of the software implementing the algorithm.
They succeed in recovering the Hamming weight of the key and 3.95 key bits (out of a 56-bit key). The
most threatening issue is that keys with lowor high Hamming weight are sparse; so, if the attack reveals that
the key has such a weight, the key space that must be searched reduces dramatically. The RC5 algorithm
has also been subjected to timing attacks, due to conditional statement execution in its code [135].
Kocher et al. [101] extend the attackers arsenal further by introducing the vulnerability of DES to power
analysis attacks and more specically to Differential Power Analysis (DPA), a technique that combines
differential cryptanalysis and careful engineering, and to Simple Power Analysis (SPA). SPA refers to
power analysis attacks that can be performed only by monitoring a single or a few power traces, probably
with the same encryption key. SPA succeeds in revealing the operations performed by the device, such as
permutations, comparisons, and multiplications. Practically, any algorithm implementation that executes
INPUT :M,N,d=(d
n1
d
n2
...d
1
d
0
)
2
OUTPUT :S=M
d
mod N
S=1;
for (i =n1; i >=0;i--) {
S=S
2
mod N;
if (d
i
==1) {
S=S
M mod N;
}
}
return S;
FIGURE 17.1 Left-to-right repeated square-and-multiply algorithm.
some statements conditionally, based on data or key material, is at least susceptible to power analysis
attacks. This holds for public key, secret key, and ECC. DPA has been successfully applied at least to block
ciphers, such as IDEA, RC5, and DES [83].
Electromagnetic Attacks (EMAs) have contributed some impressive results on what information can
be reconstructed. Gandol et al. [136] report results from cryptanalysis of real-world cryptosystems, such
as DES and RSA. Furthermore, they demonstrate that electromagnetic emissions may be preferable to
power analysis, in the sense that fewer traces are needed to mount an attack and these traces carry richer
information to derive the secret keys. However, the full power of EMA attacks has not been utilized yet
and we should expect more results on real-world cryptanalysis of popular algorithms.
17.5.5.1 Case Study: RSAMontgomery
Previously, we explained the importance of a fast modular exponentiation primitive for the RSA
cryptosystem. Montgomery multiplication is a fast implementation of this primitive function [137].
The left-to-right repeated square-and-multiply method is depicted in Figure 17.1, in C pseudocode.
The timing attack of Kocher [78] exploits the timing variation caused by the condition statement on
the fourth line. If the respective bit of the secret exponent is 1, then a square (line 3) and multiply (line 5)
operations are executed, while if the bit is 0 only a square operation is performed. In summary, the exact
time of executing the loop n times is only dependent on the exact values of the bits of the secret exponent.
An attacker proceeds as follows. Assume that the rst m bits of the secret exponent are known. The
attacker has an identical device with that containing the secret exponent and can control the key used for
each encryption. The attacker collects from the attacked device the total execution time T
1
, T
2
, . . . , T
k
of
each signature operation on some known messages, M
1
, M
2
, . . . , M
k
. He also performs the same operation
onthe controlleddevice, soas tocollect another set of measurements, t
1
, t
2
, . . . , t
k
, where he xes the mrst
bits of the key, targeting for the m +1 bit. Kochers key observation is that, if the unknown bit d
m+1
= 1,
then the two sets of measurements are correlated. If d
m+1
= 0, then the two sets behave like independent
random variables. This differentiation allows the attacker to extract the secret exponent bit-by-bit.
Depending on the implementation, a simpler form of the attack could be implemented. SPA does
not require lengthy statistical computations but rather relies on power traces of execution proles of
a cryptographic algorithm. For this example, Schindler et al. [133] explain how we can use the power
proles. Execution of line 5 in the above code requires an additional multiplication. Even if the spikes
in power consumption of the squaring and multiplication operations are indistinguishable, the multipli-
cation requires additional load operations and thus, power spikes will be wider than in the case where
only squaring is performed.
17.5.6 Countermeasures
In the previous sections we provided a review of SCA, both fault-based and passive. In this section, we
review the countermeasures that have been proposed. The list is most exhaustive and new results appear
continuously, since countermeasures are steadily improving.
The proposedcountermeasures canbe classiedintotwomainclasses: hardware protectionmechanisms
and mathematical protection mechanisms. Arst layer of protection against SCAare hardware protection
layers, such as passivation layers that do not allow direct access between a (malicious) user and the system
implementing the cryptographic algorithm or memory address bus obfuscation. Various sensors can also
be embodied in the device, in order to detect and react to abnormal environmental conditions, such as
extreme temperatures, power, and clock variations. Such mechanisms are widely employed in smartcards
for nancial transactions and other high-risk applications. Such protection layers can be effective against
fault-injection attacks, since they shield the device against external manipulation. However, they cannot
protect the device from attacks based on external observation, such as power analysis techniques.
The previous countermeasures do not alter the current designs of the circuits, but rather add protection
layers on top of them. A second approach is the design of a new generation of chips to imple-
ment cryptographic algorithms and to process sensitive information. Such circuits have asynchronous/
self-clocking/dual rail logic; each part of the circuit may be clocked independently [138]. Fault attacks
that rely on external clock manipulation (such as glitch attacks) are not feasible in this case. Furthermore,
timing or power analysis attacks become harder for the attacker, since there is no global clock that correl-
ates the input data and the emitted power. Such countermeasures have the potential to become a common
practice. Their application, however, must be carefully evaluated, since they may occupy a large area of the
circuit; such expansions are justied by manufacturers usually in order to increase the systems available
memory and not to implement another security feature. Furthermore, such mechanisms require changes
in the production line, which is not always feasible.
A third approach targets to implement the cryptographic algorithms so that no key information leaks.
Proposed approaches include modifying the algorithm to run in constant time, adding random delays
in the execution of the algorithm, randomizing the exact sequence of operations without affecting the
nal result, and adding dummy operations in the execution of the algorithm. These countermeasures
can defeat timing attacks, but careful design must be employed to defeat power analysis attacks too. For
example, dummy operations or random delays are easily distinguishable in a power trace, since they tend
to consume less power than ordinary cryptographic operations. Furthermore, differences in power traces
between proles of known operations can also reveal permutation of operations. For example, a modular
multiplication is known to consume more power than a simple addition, so if the execution order is
interchanged, they will still be identiable.
In more resource-rich systems, where high-level programming languages are used, compiler or human
optimizations can remove these artifacts from the program or change the implementation resulting to
vulnerability against SCA. The same holds, if memory caches are used and the algorithm is implemented
so that the latency between cache and main memory can be detected, either by timing or power traces.
Insertion of random delays or other forms of noise should also be considered carefully, because a large
mean value of delay translates directly to reduced performance, which is not always acceptable.
The second class of countermeasures focuses on the mathematical strengthening of the algorithms
against such attacks. The RSA blinding technique by Shamir [126] is such an example; the proposed
method guards the system from leaking meaningful information, because the leaked information is
related to the random number used for blinding instead of the key; thus, even if the attacker manages
to reveal a number, this will be the random number and not the key. It should be noted, however, that
a different random number is used for each signing or encryption operation. Thus, the faults injected
in the system will be applied on a different, random number every time and the collected information is
useless.
At a crossline between mathematical and implementation protection, it is proposed to check crypto-
graphic operations for correctness, in case of fault-injection attacks. However, these checks can be also
exploited as side channels of information or can degrade performance signicantly. For example, double
computations and comparison of the results halves the throughput an implementation can achieve;
furthermore, in the absence of other countermeasures, the comparison function can be bypassed (e.g., by
a clock glitch or a fault injection in the comparison function) or used as a side channel as well. If multiple
checks are employed, measuring the rejection time can reveal in what stage of the algorithm the error
occurred; if the checks are independent, this can be utilized to extract the secret key, even when the
implementation does not output the faulty computation [111,139].
17.6 Conclusions
Security constitutes a signicant requirement in modern embedded computing systems. Their widespread
use in services that involve sensitive information in conjunction with their resource limitations have led to
a signicant number of innovative attacks that exploit system characteristics and result in loss of critical
information. Development of secure embedded systems is an emerging eld in computer engineering
requiring skills from cryptography, communications, hardware, and software.
In this chapter, we surveyed the security requirements of embedded computing systems and described
the technologies that are more critical to them, relatively to general-purpose computing systems. Consid-
ering the innovative system (side-channel) attacks that were developed with motivation to break secure
embedded systems, we presented in detail the known SCA and described the technologies for counter-
measures against the known attacks. Clearly, the technical area of secure embedded systems is far from
mature. Innovative attacks and successful countermeasures are continuously emerging, promising an
attractive and rich technical area for research and development.
References
[1] W. Wolf, Computers as Components Principles of Embedded Computing Systems Design. Elsevier,
Amsterdam, 2000.
[2] W. Freeman and E. Miller, An experimental analysis of cryptographic overhead in
performance critical systems. In Proceedings of the Seventh International Symposium on
Modeling, Analysis and Simulation of Computer and Telecommunication Systems, 1999, p. 348.
[3] S. Ravi, P. Kocher, R. Lee, G. McGraw, and A. Raghunathan, Security as a new dimension in
embedded system design. In Proceedings of the 41st Annual Conference on Design Automation,
2004, pp. 753760.
[4] S. Ravi, A. Raghunathan, P. Kocher, and S. Hattangady, Security in embedded systems: design
challenges. Transactions on Embedded Computing Systems, 3, 461491, 2004.
[5] D.G. Abraham, G.M. Dolan, G.P. Double, and J.V. Stevens, Transaction security system. IBM
Systems Journal, 30, 206229, 1991.
[6] S.H. Weingart, Physical security devices for computer subsystems: a survey of attacks and defenses.
In Cryptographic Hardware and Embedded Systems CHES 2000: Second International Workshop,
2000, p. 302.
[7] R. Anderson and M. Kuhn, Tamper resistance a cautionary note. In Proceedings of the Second
Usenix Workshop on Electronic Commerce, 1996, pp. 111.
[8] S. Blythe, B. Fraboni, S. Lall, H. Ahmed, and U. de Riu, Layout reconstruction of complex silicon
chips. IEEE Journal of Solid-State Circuits, 28, 138145, 1993.
[9] C.E. Landwehr, A.R. Bull, J.P. McDermott, and W.S. Choi, A taxonomy of computer program
security aws. ACMComputing Surveys, 26, 211254, 1994.
[10] H. Greg and M. Gary, Exploiting Software: How to Break Code. Addison-Wesley Professional,
Reading, MA, 2004.
[11] J.J. Tevis and J.A. Hamilton, Methods for the prevention, detection and removal of software
security vulnerabilities. In Proceedings of the 42nd Annual Southeast Regional Conference, 2004,
pp. 197202.
[12] P. Kocher, SSL 3.0 specication. http://wp.netscape.com/eng/ssl3/
[13] IETF, IPSec working group. http://www.ietf.org/html.charters/ipsec-charter.html
[14] T. Wollinger, J. Guajardo, and C. Paar, Security on FPGAs: state-of-the-art implementations and
attacks. Transactions on Embedded Computing Systems, 3, 534574, 2004.
[15] S.H. Gunther, F. Binns, D.M. Carmean, and J.C. Hall, Managing the impact of
increasing microprocessor power consumption. Intel Journal of Technology, Q1: 9, 2001.
http://developer.intel.com/technology/itj/q12001/articles/art_4.htm
[16] I. Buchmann, Batteries in a Portable World, 2nd ed. Cadex Electronics Inc, May 2001.
[17] K. Lahiri, S. Dey, D. Panigrahi, and A. Raghunathan, Battery-driven system design: a new fron-
tier in low power design. In Proceedings of the 2002 Conference on Asia South Pacic Design
Automation/VLSI Design, 2002, p. 261.
[18] T. Martin, M. Hsiao, D. Ha, and J. Krishnaswami, Denial-of-service attacks on battery-powered
mobile computers. In Proceedings of the Second IEEE International Conference on Pervasive
Computing and Communications (PerCom04), 2004, p. 309.
[19] N.R. Potlapally, S. Ravi, A. Raghunathan, and N.K. Jha, Analyzing the energy consumption of
security protocols. In Proceedings of the 2003 International Symposium on Low Power Electronics
and Design, 2003, pp. 3035.
[20] D.W. Carman, P.S. Kruus, and B.J. Matt, Constraints and approaches for distrib-
uted sensor network security. NAI Labs, Technical report 00-110, 2000. Available
at: http://www.cs.umbc.edu/courses/graduate/CMSC691A/Spring04/papers/nailabs_report_00-
010_nal.pdf
[21] A. Raghunathan, S. Ravi, S. Hattangady, and J. Quisquater, Securing mobile appliances: new
challenges for the system designer. In Design, Automation and Test in Europe Conference and
Exhibition (DATE03). IEEE, 2003, p. 10176.
[22] V. Raghunathan, C. Schurgers, S. Park, and M. Srivastava, Energy aware wireless microsensor
networks. IEEE Signal Processing Magazine 19, 4050, 2002.
[23] A. Savvides, S. Park, and M.B. Srivastava, On modeling networks of wireless microsensors.
In Proceedings of the 2001 ACM SIGMETRICS International Conference on Measurement and
Modeling of Computer Systems, 2001, pp. 318319.
[24] Rockwell Scientic, Wireless integrated networks systems. http://wins.rsc.rockwell.com
[25] A. Hodjat and I. Verbauwhede, The energy cost of secrets in ad-hoc networks (Short paper).
http://citeseer.ist.psu.edu/hodjat02energy.html
[26] B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C. John Wiley &
Sons, New York, 1995.
[27] N. Daswani and D. Boneh, Experimenting with electronic commerce on the PalmPilot.
In Proceedings of the Third International Conference on Financial Cryptography, 1999,
pp. 116.
[28] A. Perrig, J. Stankovic, and D. Wagner, Security in wireless sensor networks. Communications of
the ACM, 47, 5357, 2004.
[29] S. Ravi, A. Raghunathan, and N. Potlapally, Securing wireless data: system architecture challenges.
In Proceedings of the 15th International Symposium on System Synthesis, 2002, pp. 195200.
[30] IEEE 802.11 Working Group, IEEE 802.11 wireless LAN standards. http://grouper.ieee.org/
groups/802/11/
[31] 3GPP, 3G Security; Security Architecture. 3GPP Organization, TS 33.102, 30-09-2003,
Rel-6, 2003.
[32] Intel Corporation, VPN and WEP, wireless 802.11b security in a corporate environment.
http://www.intel.com/business/bss/infrastructure/security/vpn_wep.htm
[33] NIST, FIPS PUB 140-2 security requirements for cryptographic modules. Available at
http://csrc.nist.gov/cryptval/140-2.htm
[34] J. Lach, W.H. Mangione-Smith, and M. Potkonjak, Fingerprinting digital circuits on program-
mable hardware. In Information Hiding: Second International Workshop, IH98, Vol. 1525 of
Lecture Notes in Computer Science, Springer-Verlag, 1998, pp. 1631.
[35] J. Burke, J. McDonald, and T. Austin, Architectural support for fast symmetric-key cryptography.
In Proceedings of the Ninth International Conference on Architectural Support for Programming
Languages and Operating Systems, 2000, pp. 178189.
[36] L. Wu, C. Weaver, and T. Austin, CryptoManiac: a fast exible architecture for secure communi-
cation. In Proceedings of the 28th Annual International Symposium on Computer Architecture,
2001, pp. 110119.
[37] Inneon, SLE 88 Family Products. http://www.inneon.com/
[38] ARM, ARM SecurCore Family, Vol. 2004. http://www.arm.com/products/CPUs/securcore.html
[39] S. Moore, Enhancing Security Performance Through IA-64 Architecture, 2000. Intel Corp.,
http://www.intel.com/cd/ids/developer/asmo-na/eng/microprocessors/itanium/index.htm
[40] N. Potlapally, S. Ravi, A. Raghunathan, and G. Lakshminarayana, Optimizing public-key encryp-
tion for wireless clients. In Proceedings of the IEEE International Conference on Communications,
May 2002.
[41] MIPS Inc., SmartMIPS Architecture, Vol. 2004. http://www.mips.com/ProductCatalog/
P_SmartMIPSASE/productBrief
[42] Open Mobile Appliance, http://www.wapforum.org/what/technical.htm
[43] Mobile Electronic Transactions, http://www.mobiletransaction.org/
[44] Z. Shao, C. Xue, Q. Zhuge, E.H. Sha, and B. Xiao, Security protection and checking in embedded
system integration against buffer overow attacks. In Proceedings of the International Conference
on Information Technology: Coding and Computing (ITCC04), Vol. 2, 2004, pp. 409.
[45] S. Biswas, M. Simpson, and R. Barua, Memory overow protection for embedded systems using
run-time checks, reuse and compression. In Proceedings of the 2004 International Conference on
Compilers, Architecture, and Synthesis for Embedded Systems, 2004, pp. 280291.
[46] J. You, Wai-Kin Kong, D. Zhang, and King Hong Cheung, On hierarchical palmprint coding with
multiple features for personal identication in large databases. IEEE Transactions on Circuits and
Systems for Video Technology, 14, 234243, 2004.
[47] K.C. Chan, Y.S. Moon, and P.S. Cheng, Fast ngerprint verication using subregions of
ngerprint images. IEEE Transactions on Circuits and Systems for Video Technology, 14,
95101, 2004.
[48] A.K. Jain, A. Ross, and S. Prabhakar, An introduction to biometric recognition. IEEE Transactions
on Circuits and Systems for Video Technology, IEEE, 14, 420, 2004.
[49] Y.S. Moon, H.C. Ho, and K.L. Ng, A secure smart card system with biometrics capability.
In Proceedings of the IEEE 1999 Canadian Conference on Electrical and Computer Engineering,
1999, pp. 261266.
[50] T.Y. Tang, Y.S. Moon, and K.C. Chan, Efcient implementation of ngerprint verication
for mobile embedded systems using xed-point arithmetic. In Proceedings of the 2004 ACM
Symposium on Applied Computing, 2004, pp. 821825.
[51] L. Benini, A. Macii, and M. Poncino, Energy-aware design of embedded memories: a survey of
technologies, architectures, and optimization techniques. Transactions on Embedded Computing
Systems, 2, 532, 2003.
[52] Scott Rosenthal, Serial EEPROMs provide secure data storage for embedded systems. SLTF
Consulting, http://www.sltf.com/articles/pein/pein9101.htm
[53] Actel Corporation, Design security in nonvolatile ash and antifuse FPGAs. Technical report
5172163-0/11.01, 2001.
[54] Trusted Computing Group: Home. TCG , https://www.trustedcomputinggroup.org/home
[55] D.N. Serpanos and R.J. Lipton, Defense against man-in-the-middle attack inclient-server systems
with secure servers. In Proceedings of IEEE ISCC2001. Hammammet, Tunisia, July 35, 2001,
pp. 914.
[56] R.J. Lipton, S. Rajagopalan, and D.N. Serpanos, Spy: a method to secure clients for network
services. Proceedings of the 22nd International Conference on Distributed Computing Systems
Workshops (Workshop ADSN2002). Vienna, Austria, July 25, 2002, pp. 2328.
[57] T.S. Messerges and E.A. Dabbish, Digital rights management in a 3G mobile phone and
beyond. In Proceedings of the 2003 ACM Workshop on Digital Rights Management, 2003,
pp. 2738.
[58] D.L.C. Thekkath, M. Mitchell, P. Lincoln, D. Boneh, J. Mitchell, and M. Horowitz, Architec-
tural support for copy and tamper resistant software. In Proceedings of the Ninth International
Conference on Architectural Support for Programming Languages and Operating Systems, 2000,
pp. 168177.
[59] J.H. Saltzer and M.D. Schroder, The protection of information in computer systems.
Proceedings of the IEEE, 63, 12781308, 1975.
[60] T. King, Security + Training Guide. Que Cerication, Boger, Paul, 2003.
[61] Kingpin and Mudge, Security analysis of the palm operating system and its weaknesses
against malicious code threats. In Proceedings of the 10th Usenix Security Symposium, 2001,
pp. 135152.
[62] A.D. Rubin and D.E. Geer Jr., Mobile code security. Internet Computing, IEEE, 2,
3034, 1998.
[63] V.N. Venkatakrishnan, R. Peri, and R. Sekar, Empowering mobile code using expressive security
policies. In Proceedings of the 2002 Workshop on New Security Paradigms, 2002, pp. 6168.
[64] G.C. Necula, Proof-carrying code. In Proceedings of the 24th ACM SIGPLAN-SIGACT Symposium
on Principles of Programming Languages (POPL 97), 1997, pp. 106119.
[65] P. Gutmann, Lessons learned in implementing and deploying crypto software. In Proceedings of
the 11th USENIX Security Symposium, 2002, pp. 315325.
[66] R.J. Anderson, Why cryptosystems fail. In Proceedings of ACM CSS93, ACM Press, pp. 215217,
November 1993.
[67] Andrew J. Clark, Physical protection of cryptographic device. In Proceedings of Eurocrypt 87,
1987, pp. 8393.
[68] D. Chaum, Design concepts for tamper-responding system. In Advances in Cryptology Proceedings
of Crypto 83, 1983, pp. 387392.
[69] S.H. Weingart, S.R. White, W.C. Arnold, and G.P. Double, An evaluation system for the physical
security of computing systems. In Proceedings of the Sixth Annual Computer Security Applications
[70] IBM Corporation, IBM PCI Cryptographic Coprocessor, September, 2004. Available at http://
www-3.ibm.com/security/cryptocards/html/pcicc.shtml
[71] EFF, U.S. v. ElcomSoft and Sklyarov FAQ, September, 2004. Available at http://www.eff.org/IP/
DMCA/US_v_Elcomsoft/us_v_sklyarov_faq.html
[72] A. Huang, Keeping secrets in hardware: the Microsoft Xbox case study. In Revised Papers from
the Fourth International Workshop on Cryptographic Hardware and Embedded Systems, 2003,
pp. 213227.
[73] Kingpin, Attacks on and countermeasures for USB hardware token device. In Proceedings of the
Fifth Nordic Workshop on Secure IT Systems Encouraging Co-operation, 2000, pp. 135151.
[74] D.S. Touretzky, Gallery of CSS Descramblers, September 2004. Available at http://www.cs.
cmu.edu/dst/DeCSS/Gallery
[75] S.L. Garnkel and A. Shelat, Remembrance of data passed: a study of disk sanitization practices.
IEEE Security and Privacy Magazine, 1, 1727, 2003.
[76] S. Skorobogatov, Low temperature data remanence in static RAM. Technical report UCAM-CL-
TR-536, University of Cambridge, 2002.
[77] P. Gutman, Data remanence insemiconductor devices. In Proceedings of the 10th USENIXSecurity
Symposium, 2001.
[78] P.C. Kocher, Timing attacks on implementations of Dife-Hellman RSA DSS and other systems.
In Proceedings of CRYPTO 96, Lecture Notes in Computer Science, 1996, pp. 104113.
[79] D. Boneh, Twenty years of attacks on the RSAcryptosystem. Notices of the American Mathematical
Society (AMS), 46, 203213, 1999.
[80] Dan Boneh, Richard A. DeMillo, and Richard J. Lipton, On the importance of checking cryp-
tographic protocols for faults. In Proceedings of Eurocrypt97, Vol. 1233 of Lecture Notes in
Computer Science, 1997, pp. 3751.
[81] Dan Boneh, Richard A. DeMillo, and Richard J. Lipton, On the importance of eliminating errors
in cryptographic computations. Journal of Cryptology: The Journal of the International Association
for Cryptologic Research, 14, 101119, 2001.
[82] F. Bao, R.H. Deng, Y. Han, A.B. Jeng, A.D. Narasimhalu, and T. Ngair, Breaking public key
cryptosystems on tamper resistant devices in the presence of transient faults. In Proceedings of
the Fifth International Workshop on Security Protocols, 1998, pp. 115124.
[83] John Kelsey, Bruce Schneier, David Wagner, and Chris Hall, Side channel cryptanalysis of product
ciphers. In Proceedings of ESORICS 1998, 1998, pp. 97110.
[84] Markus G. Kuhn, Compromising emanations: eavesdropping risks of computer displays.
Technical report UCAM-CL-TR-577, University of Cambridge, December 2003.
[85] Intel Corporation, Analysis of the oating point aw in the Pentium processor.
November 1994. Available at http://support.intel.com/support/processors/pentium/fdiv/wp/
(September 2004).
[86] Jean-Jacques Quisquater and David Samyde, ElectroMagnetic analysis (EMA): measures and
countermeasures for smart cards. In Proceedings of the International Conference on Research in
Smart Cards, E-Smart 2001, Lecture Notes in Computer Science, 2001, pp. 200210.
[87] Oliver Kmmerling and Markus G. Kuhn, Design principles for tamper-resistant smartcard
processors. In Proceedings of the USENIX Workshop on Smartcard Technology (Smartcard 99).
USENIX Association, Chicago, IL, May 1011, 1999, pp. 920.
[88] D.P. Maher, Fault induction attacks, tamper resistance, and hostile reverse engineering in per-
spective. In Proceedings of the First International Conference on Financial Cryptography, 1997,
pp. 109122.
[89] Ross J. Anderson and Markus G. Kuhn, Low cost attacks on tamper resistant devices. InProceedings
of the Fifth International Security Protocols Conference, Vol. 1361 of Lecture Notes on Computer
Science. M. Lomas et al. Ed. Springer-Verlag, Paris, France, April 79, 1997, pp. 125136.
[90] Ross J. Anderson, Security Engineering: A Guide to Building Dependable Distributed Systems. John
Wiley & Sons, New York, 2001.
[91] C. Aumller, P. Bier, W. Fischer, P. Hofreiter, and J. Seifert, Fault attacks on RSA with CRT:
concrete results and practical countermeasures. In Revised Papers from the Fourth International
Workshop on Cryptographic Hardware and Embedded Systems, Springer-Verlag, 2003, pp. 260275.
[92] David Samyde, Sergei Skorobogatov, Ross Anderson, and Jean-Jacques Quisquater, On a new
way to read data from memory. In Proceedings of CHES 2002, Lecture Notes in Computer
Science, 2003.
[93] Hagai Bar-El, Hamid Choukri, David Naccache, Michael Tunstall, and Claire Whelan, The
Sorcerers apprentice guide to fault attacks. In Workshop on Fault Diagnosis and Tolerance in
Cryptography, 2004.
[94] Artemios G. Voyiatzis and Dimitrios N. Serpanos, Active hardware attacks and proactive
countermeasures. In Proceedings of IEEE ISCC 2002, 2002.
[95] Markus G. Kuhn, Optical time-domain eavesdropping risks of CRT displays. In Proceedings of
the IEEE Symposium on Security and Privacy, 2002, pp. 318.
[96] Markus G. Kuhn, Electromagnetic eavesdropping risks of at-panel displays. Presented at the
Fourth Workshop on Privacy Enhancing Technologies, May 2628, 2004, Toronto, Canada.
[97] J. Loughry and D.A. Umphress, Information leakage from optical emanations. ACM Transactions
on Information and SystemSecurity, 5, 262289, 2002.
[98] P. Wright, Spycatcher: The Candid Autobiography of a Senior Intelligence Ofcer. Viking, NY, 1987.
[99] Adi Shamir and Eran Tromer, Acoustic cryptanalysis on noisy people and noisy
machines. In Eurocrypt 2004 Rump Session Presentation, September, 2004. Available at
http://www.wisdom.weizmann.ac.il/tromer/acoustic/
[100] J. Coron, D. Naccache, and P. Kocher, Statistics and secret leakage. ACMTransactions on Embedded
Computing Systems, 3, 492508, 2004.
[101] P. Kocher, J. Jaffe, and B. Jun, Differential power analysis. In Proceedings of the CRYPTO 99, IACR,
1999, pp. 388397.
[102] Josyula R. Rao, and Pankaj Rohatgi, Empowering side-channel attacks. IACR Crypto-
graphy ePrint Archive: report 2001/037, September, 2004. Available at http://eprint.iacr.org/
2001/037/
[103] Mehdi-Laurent Akkar, Rgis Bevan, Paul Dischamp, and Didier Moyar, Power analysis, what is
now possible. In Advances in Cryptology ASIACRYPT 2000: 6th International, Springer-Verlag,
2000, pp. 489502.
[104] Thomas S. Messerges, Ezzy A. Dabbish, and Robert H. Sloan, Investigation of power ana-
lysis attacks on smartcards. In Proceedings of USENIX Workshop on Electronic Commerce, 1999,
pp. 151161.
[105] D. Page, Theoretical use of cache memory as a cryptanalytic side-channel. Technical report
CSTR-02-003, Computer Science Department, University of Bristol, Bristol, 2002.
[106] Dawn Xiaodong Song, David Wagner, and Xuqing Tian, Timing analysis of keystrokes and timing
attacks on SSH. In Proceedings of the 10th USENIX Security Symposium, USENIX Association,
2001.
[107] E.W. Felten and M.A. Schneider, Timing attacks on web privacy. In Proceedings of the Seventh
ACM Conference on Computer and Communications Security, ACM Press, 2000, pp. 2532.
[108] David Brumley and Dan Boneh, Remote timing attacks are practical. In Proceedings of the 12th
USENIX Security Symposium, 2003.
[109] J. Marc and Q. Jean-Jacques, Faulty RSA encryption. Technical report CG-1997/8, UCL Crypto
Group, 1997.
[110] Marc Joye and Jean-Jacques Quisquater, Attacks on systems using Chinese remaindering.
Technical report CG1996/9, UCL Crypto Group, Belgium, 1996.
[111] Vlastimil Klma and Tom Rosa, Further results and considerations on side channel attacks
on RSA. IACR Cryptography ePrint Archive: report 2002/071, September 2004. Available at
http://eprint.iacr.org/2002/071/
[112] Vlastimil and Tom Rosa, Attack on private signature keys of the OpenPGP format, PGP(TM)
programs and other applications compatible with OpenPGP. IACR Cryptology ePrint Archive
report 2002/073, IACR, September 2004. Available at http://eprint.iacr.org/2002/076.pdf
[113] A.G. Voyiatzis and D.N. Serpanos, A fault-injection attack on Fiat-Shamir cryptosystems.
In Proceedings of the 24th International Conference on Distributed Computing Systems Workshops
(ICDCS 2004 Workshops), 2004, pp. 618621.
[114] Eli Biham and Adi Shamir, Differential fault analysis of secret key cryptosystems. Lecture Notes
in Computer Science. Springer-Verlag, 1294, 513525, 1997.
[115] P. Paillier, Evaluating differential fault analysis of unknown cryptosystems. In Proceedings of
the Second International Workshop on Practice and Theory in Public Key Cryptography, 1999,
pp. 235244.
[116] M. Jacob, D. Boneh, and E. Felten, Attacking an obfuscated cipher by injecting faults. In
Proceedings of the 2002 ACMWorkshop on Digital Rights Management, 2002.
[117] Ingrid Biehl, Bernd Meyer, andVoker Mller, Differential fault attacks on elliptic curve cryptosys-
tems. In Proceedings of CRYPTO 2000, Vol. 1880 of Lecture Notes in Computer Science, 2000,
pp. 131146.
[118] Y. Zheng and T. Matsumoto, Breaking real-world implementations of cryptosystems by manipu-
lating their random number generation. In Proceedings of the 1997 Symposium on Cryptography
and Information Security, 1997.
[119] Joan Daemen and Vincent Rijmen, The block cipher Rijndael. In Proceedings of Smart Card
Research and Applications 2000, Lecture Notes in Computer Science, 2000, pp. 288296.
[120] NIST, NIST, Advanced Encryption Standard (AES), Federal Information Processing Standards
Publication 1997, November 26, 2001.
[121] Franois Koeune and Jean-Jacques Quisquater, A timing attack against Rijndael. Technical report
CG-1999/1, Universite Catholique de Louvain, 1999.
[122] P. Dusart, L. Letourneux, and O. Vivolo, Differential fault analysis on AES. In Proceedings of
the International Conference on Applied Cryptography and Network Security, Lecture Notes in
Computer Science, 2003, pp. 293306.
[123] Johaness Blmer and Jean-Pierre Seifert, Fault-based cryptanalysis of the advanced encryption
standard (AES). In Financial Cryptography 2003, Vol. 2742 of Lecture Notes in Computer Science,
2003, pp. 162181.
[124] Arjen Lenstra, Memo on RSA signature generation in the presence of faults. September 28, 1996.
(Manuscript, available from the author.)
[125] B. Kaliski and M.J.B. Robshaw, Comments on some new attacks on cryptographic devices.
RSA Laboratories Bulletin, 5 July, 1997.
[126] Adi Shamir, Method and apparatus for protecting public key schemes from timing and fault
attacks. US Patent No. 5,991,415, United States Patent and Trademark Ofce, November 23, 1999.
[127] S. Yen, S. Kim, S. Lim, and S. Moon, RSA speedup with residue number system immune against
hardware fault cryptanalysis. In Proceedings of the Fourth International Conference on Information
Security and Cryptology, Seoul, 2002, pp. 397413.
[128] J. Blmer, M. Otto, and J. Seifert, A new CRT-RSA algorithm secure against bellcore attacks.
In Proceedings of the 10th ACM Conference on Computer and Communication Security, 2003,
pp. 311320.
[129] Sung-Ming Yen, Sangjae Moon, and Jae-Cheol Ha, Hardware fault attack on RSA with
CRT revisited. In Proceedings of ICISC 2002, Lecture Notes in Computer Science, 2003,
pp. 374388.
[130] S. Yen and M. Joye, Checking before output may not be enough against fault-based cryptanalysis.
IEEE Transactions on Computers, 49, 967970, 2000.
[131] S. Yen, S. Kim, S. Lim, and S. Moon, A countermeasure against one physical cryptanalysis
may benet another attack. In Proceedings of the Fourth International Conference on Information
Security and Cryptology, Seoul, 2002, pp. 414427.
[132] J. Dhem, F. Koeune, P. Leroux, P. Mestr, J. Quisquater, and J. Willems, Apractical implementation
of the timing attack. In Proceedings of the International Conference on Smart Card Research and
Applications, 1998, pp. 167182.
[133] Werner Schindler, Franois Koeune, and Jean-Jacques Quisquater, Unleashing the full power of
timing attack. UCL Crypto Group Technical report CG-2001/3, Universite Catholique de Louvain
2001.
[134] A. Hevia and M. Kiwi, Strength of two data encryption standard implementations under timing
attacks. ACMTransactions on Information and SystemSecurity, 2, 416437, 1999.
[135] Helena Handschuh and Heys Howard, A timing attack on RC5. In Proceedings of the Fifth Annual
International Workshop on Selected Areas in Cryptography, SAC98, 1998.
[136] K. Gandol, C. Mourtel, and F. Olivier, Electromagnetic analysis: concrete results. In Proceedings
of the Third International Workshop on Cryptographic Hardware and Embedded Systems, 2001,
pp. 251261.
[137] K. Ko, T. Acar, and B.S. Kaliski Jr., Analyzing and comparing montgomery multiplication
algorithms. IEEE Micro, 16, 2633, 1996.
[138] Simon Moore, Ross Anderson, Paul Cunningham, Robert Mullins, and George Taylor, Improving
smart card security using self-timed circuits. In Proceedings of the Eighth International Symposium
on Advanced Research in Asynchronous Circuits and Systems, 2002.
[139] Kouichi Sakurai and Tsuyoshi Takagi, A reject timing attack on an IND-CCA2 public-key
cryptosystem. In Proceedings of ICISC 2002, Lecture Notes in Computer Science, 2003.
[140] S.P. Skorobogatov and R.J. Anderson, Optical fault induction attacks. In Revised Papers from
the Fourth International Workshop on Cryptographic Hardware and Embedded Systems, 2003,
pp. 212.
II
System-on-Chip Design
18 System-on-Chip and Network-on-Chip Design
Grant Martin
19 A Novel Methodology for the Design of Application-Specic
Instruction-Set Processors
Andreas Hoffmann, Achim Nohl, and Gunnar Braun
20 State-of-the-Art SoC Communication Architectures
Jos L. Ayala, Marisa Lpez-Vallejo, Davide Bertozzi, and Luca Benini
21 Network-on-Chip Design for Gigascale Systems-on-Chip
Davide Bertozzi, Luca Benini, and Giovanni De Micheli
22 Platform-Based Design for Embedded Systems
Luca P. Carloni, Fernando De Bernardinis, Claudio Pinello,
Alberto L. Sangiovanni-Vincentelli, and Marco Sgroi
23 Interface Specication and Converter Synthesis
Roberto Passerone
24 Hardware/Software Interface Design for SoC
Wander O. Cesrio, Flvio R. Wagner, and A.A. Jerraya
25 Design and Programming of Embedded Multiprocessors:
An Interface-Centric Approach
Pieter van der Wolf, Erwin de Kock, Tomas Henriksson, Wido Kruijtzer, and Gerben Essink
26 A Multiprocessor SoC Platform and Tools for Communications Applications
Pierre G. Paulin, Chuck Pilkington, Michel Langevin, Essaid Bensoudane, Damien Lyonnard,
and Gabriela Nicolescu
18
System-on-Chip and
Network-on-Chip
Design
Grant Martin
Tensilica Inc.
18.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-1
18.2 System-on-a-Chip . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-2
18.3 System-on-a-Programmable-Chip . . . . . . . . . . . . . . . . . . . . . 18-2
18.4 IP Cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-4
18.5 Virtual Components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-5
18.6 Platforms and Programmable Platforms . . . . . . . . . . . . . . . 18-5
18.7 Integration Platforms and SoC Design. . . . . . . . . . . . . . . . . 18-6
18.8 Overview of the SoC Design Process . . . . . . . . . . . . . . . . . . . 18-7
18.9 System-Level Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-10
18.10 Interconnection and Communication Architectures
for SoC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-11
18.11 Computation and Memory Architectures for SoC . . . . 18-11
18.12 IP Integration Quality and Certication Methods
and Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-12
18.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18-13
18.1 Introduction
System-on-Chip (SoC) is a phrase that has been much talked about in recent years [1]. It is more than
a design style, more than an approach to the design of Application-Specic Integrated Circuits (ASICs),
more than a methodology. Rather, SoC represents a major revolution in IC design a revolution enabled
by the advances in process technology allowing the integration of all or most of the major components and
subsystems of an electronic product onto a single chip, or integrated chipset [2]. This revolution in design
has been embraced by many designers of complex chips, as the performance, power consumption, cost,
and size advantages of using the highest level of integration made available have proven to be extremely
important for many designs. In fact, the design and use of SoCs is arguably one of the key problems in
designing real-time embedded systems.
The move to SoC began sometime in the mid-1990s. At this point, the leading CMOS-based semicon-
ductor process technologies of 0.35 and 0.25m were sufciently capable of allowing the integration of
many of the major components of a second-generation wireless handset or a digital set-top box onto a
18-1
single chip. The digital baseband functions of a cell phone a Digital Signal Processor (DSP), hardware
(HW) support for voice encoding and decoding, and a RISC processor could all be placed onto a single
die. Although such a baseband SoC was far from the complete cell phone electronics there were major
components such as the RF transceiver, the analog power control, analog baseband, and passives that were
not integrated the evolutionary path with each new process generation, to integrate more and more
onto a single die, was clear. Todays chipset would become tomorrows chip. The problems of integrating
hybrid technologies involved in making up a complete electronic system would be solved. Thus, eventually,
SoC could encompass design components drawn from the standard and more adventurous domains of
digital, analog, RF, recongurable logic, sensors, actuators, optical, chemical, microelectronic mechanical
systems, and even biological and nanotechnology.
With this viewpoint of continued process evolution leading to ever-increasing levels of integration into
ever-more-complex SoC devices, the issue of a SoC being a single chip at any particular point in time
is somewhat moot. Rather, the word system in System-on-Chip is more important than chip. What
is most important about a SoC, whether packaged as a single chip, or integrated chipset, or System-in-
Package (SiP) or System-on-Package (SoP) is that it is designed as an integrated system, making design
trade-offs across the processing domains and across the individual chip and package boundaries.
18.2 System-on-a-Chip
Let us dene a SoC as a complex integrated circuit, or integrated chipset, which combines the major
functional elements or subsystems of a complete end product into a single entity. These days, all inter-
esting SoC designs include at least one programmable processor, and very often a combination of at least
one RISC control processor and one DSP. They also include on-chip communications structures pro-
cessor bus(es), peripheral bus(es), and perhaps a high-speed system bus. A hierarchy of on-chip memory
units, and links to off-chip memory are important especially for SoC processors (cache, main memories,
very often separate instruction and data caches are included). For most signal processing applications,
some degree of HW-based accelerating functional units are provided, offering higher performance and
lower energy consumption. For interfacing to the external, real world, SoCs include a number of peri-
pheral processing blocks, and owing to the analog nature of the real world, this may include analog
components as well as digital interfaces (e.g., to system buses at a higher packaging level). Although
there is a much interesting research in incorporating MEMS-based sensors and actuators, and in SoC
applications incorporating chemical processing (lab-on-a-chip), these are, with rare exceptions, research
topics only. However, future SoCs of a commercial nature may include such subsystems as well as optical
communications interfaces.
Figure 18.1 illustrates what a typical SoC might contain for consumer applications.
One key point about SoC that is often forgotten for those approaching them from a HW-oriented
perspective is that, all interesting SoC designs encompass both hardware (HW) and software (SW) com-
ponents that is, programmable processors, Real-Time Operating Systems (RTOSs), and other aspects
of HW-dependent SW such as peripheral device drivers, as well as middleware stacks for particular
application domains, and possibly optimized assembly code for DSPs. Thus, the design and use of SoCs
cannot remain a HW-only concern it involves aspects of system-level design and engineering, HWSW
trade-off and partitioning decisions, and SW architecture, design and implementation.
18.3 System-on-a-Programmable-Chip
Recently, attention has begun to expand in the SoC world from SoC implementations using custom,
ASIC or Application-Specic Standard Part (ASSP) design approaches, to include the design and use of
complex recongurable logic parts with embedded processors and other application-oriented blocks of
intellectual property. These complex FPGAs (Field-Programmable Gate Arrays) are offered by several
vendors, including Xilinx (Virtex-II PRO Platform FPGA) and Altera (SOPC), but are referred to by
System-on-Chip and Network-on-Chip Design 18-3
External
memory access
Flash Flash
ICache DCache DCache ICache
RAM RAM DMA
Microprocessor
System
bus
Peripheral
bus
DSP
MPEG
decode
Video I/F
Audio
CODE C
PLL
Test
PCI
USB
Disk controller
100 base-T
Bus bridge
FIGURE 18.1 A typical SoC device for consumer applications.
several names: highly programmable SoCs, system-on-a-programmable-chip, embedded FPGAs. The key
idea behind this approach to SoC is to combine large amounts of recongurable logic with embedded
RISC processors (either custom laid-out, hardened blocks, or synthesizable processor cores), in order
to allow very exible and tailorable combinations of HW and SW processing to be applied to a particular
design problem. Algorithms that consist of signicant amounts of control logic, plus signicant quantities
of dataow processing, can be partitioned into the control RISC processor (e.g., in Xilinx Virtex-II PRO, a
PowerPC processor) and recongurable logic offering HW acceleration. Although the resulting combin-
ation does not offer the highest performance, lowest energy consumption, or lowest cost, in comparison
with custom IC or ASIC/ASSP implementations of the same functionality, it does offer tremendous ex-
ibility in modifying the design in the eld and avoiding expensive NonRecurring Engineering (NRE)
charges in the design. Thus, new applications, interfaces, and improved algorithms can be downloaded to
products working in the eld using this approach.
Products in this area also include other processing and interface cores, such as Multiply-ACcumulate
(MAC) blocks which are specically aimed at DSP-type dataowsignal and image processing applications;
and high speed serial interfaces for wired communications such as SERDES (serializer/de-serializer)
blocks. In this sense, system-on-a-programmable-chip SoCs are not exactly application-specic, but not
completely generic either.
It remains to be seen whether system-on-a-programmable-chip SoCs are going to be a successful way of
delivering high volume consumer applications, or will end up restricted to the two main applications for
high-endFPGAs: rapidprototyping of designs whichwill be re-targetedtoASICor ASSPimplementations;
and used in high-end, relatively expensive parts of the communications infrastructure that require in-eld
exibility and can tolerate the trade-offs in cost, energy consumption, and performance. Certainly, the
use of synthesizable processors on more moderate FPGAs to realize SoC style designs is one alternative to
the cost issue. Intermediate forms, such as the use of metal-programmable gate-array style logic fabrics
together with hard-core processor subsystems and other cores, such as is offered in the Structured
ASIC offerings of LSI Logic (RapidChip) and NEC (Instant Silicon Solutions Platform) represents an
intermediate form of SoC between the full-mask ASIC and ASSP approach and the eld-programmable
gate array approach. Here the trade-offs are much slower design creation (a few weeks rather than a day or
so), higher NRE than FPGA (but much lower than a full set of masks), and better cost, performance, and
energy consumption than FPGA (perhaps 15 to 30% worse than an ASIC approach). Further interesting
compromise or hybrid style approaches, such as ASIC/ASSP with on-chip FPGAregions, are also emerging
to give design teams more choices.
18.4 IP Cores
The design of SoC would not be possible if every design started from scratch. In fact, the design of SoC
depends heavily on the reuse of Intellectual Property blocks what are called IP Cores. IP reuse has
emerged as a strong trend over the last 8 to 9 yrs [3] and has been one key element in closing what the Inter-
national Technology Roadmap for Semiconductors [4] calls thedesign productivity gapthe difference
between the rate of increase of complexity offered by advancing semiconductor process technology, and
the rate of increase in designer productivity offered by advances in design tools and methodologies.
But reuse is not just important to offer ways of enhancing designer productivity although it has
dramatic impacts on that. It also provides a mechanism for design teams to create SoC products that
span multiple design disciplines and domains. The availability of both hard (laid-out and characterized)
and soft (synthesizable) processor cores from a number of processor IP vendors allows design teams who
would not be able to design their own processor fromscratch to drop theminto their designs and thus add
RISC control and DSP functionality to an integrated SoC without having to master the art of processor
design within the team. In this sense, the advantages of IP reuse go beyond productivity it offers both a
large reduction in design risk, and also a way for SoCdesigns to be done that would otherwise be infeasible
owing to the length of time it would take to acquire expertise and design IP from scratch.
This ability when acquiring and reusing IP cores to acquire, in a prepackaged form, design domain
expertise outside ones own design teams set of core competencies, is a key requirement for the evolution
of SoC design going forward. SoC up to this point has concentrated to a large part on integrating digital
components together, perhaps with some analog interface blocks which are treated as black boxes. The
hybrid SoCs of the future, incorporating domains unfamiliar to the integration team, such as RF or
MEMS, requires the concept of drop-in IP to be extended to these new domains. We are not yet at that
state considerable evolution in the IP business and the methodologies of IP creation, qualication,
evaluation, integration, and verication are required before we will be able to easily specify and integrate
truly heterogeneous sets of disparate IP blocks into a complete hybrid SoC.
However, the same issues existed at the beginning of the SoC revolution in the digital domain. They
have been solved to a large extent, through the creation of standards for IP creation, evaluation, exchange,
and integration primarily for digital IP blocks but extending also to Analog/Mixed-Signal (AMS) cores.
Among the leading organizations in the identication and creation of such standards has been the Virtual
Socket Interface Alliance (VSIA) [5], formed in 1996 and having at its peak membership more than 200 IP,
systems, semiconductor, and Electronic Design Automation (EDA) corporate members. Although often
criticized over the years for a lack of formal and acknowledged adoption of its IP standards, VSIA has
had a more subtle inuence on the electronics industry. Many companies instituting reuse programmes
internally; many IP, systems, and semiconductor companies engaging in IP creation and exchange; and
many design groups have usedVSIAIPstandards as a key starting point for developing their own standards
and methods for IP-based design. In this sense, use of VSIA outputs has enabled a kind of IP reuse in the
IP business.
VSIA, for example, in its early architectural documents of 1996 to 1997, helped dene the strong
industry-adopted understanding of what it meant for an IP block to be considered to be in hard or
soft form. Other important contributions to design included the widely read system level design model
taxonomy createdby one of its working groups. Its standards, specications, anddocuments thus represent
a very useful resource for the industry [6].
Other important issues for the rise of IP-based design and the emergence of a third party industry
in this area (which has taken much longer to emerge than originally hoped in the mid-1990s) are the
business issues surrounding IP evaluation, purchase, delivery, and use. Organizations such as the Virtual
Component Exchange (VCX) [7] emerged to look at these issues and provide solutions. Although still in
existence, it is clear that the vast majority of IP business relationships between rms occur within a more
ad hoc supplier to customer business framework.
18.5 Virtual Components
The VSIA has had a strong inuence on the nomenclature of the SoC- and IP-based design industry. The
concept of the virtual socket a description of all the design interfaces which an IP core must satisfy,
and design models and integration information which must be provided with the IP core required
to allow it to be more easily integrated or dropped into an SoC design comes from the concept of
Printed Circuit Board (PCB) design where components are sourced and purchased in prepackaged form
and can be dropped into a board design in a standardized way.
The dual of the virtual socket then becomes the virtual component. Not only in the VSIA context,
but also more generally in the interface, an IP core represents a design block which might be reusable.
A virtual component represents a design block that is intended for reuse, and which has been developed
and qualied to be highly reusable. The things that separate IP cores from virtual components are in
general:
Virtual components conform in their development and verication processes to well-established
design processes and quality standards.
Virtual components come with design data, models, associated design les, scripts, characterization
information, and other deliverables which conform to one or other well-accepted standards for IP
reuse for example, the VSIA deliverables, or another internal or external set of standards.
Virtual components in general should have been fabricated at least once, and characterized
postfabrication to ensure that they have validated claims.
Virtual components should have been reused at least once by an external design team, and usage
reports and feedback should be available.
Virtual components should have been rated for quality using an industry standard quality metric
such as OpenMORE (originated by Synopsys and Mentor Graphics) or the VSI Quality standard
(which has OpenMORE as one of its inputs).
To a large extent, the developments over the last decade in IP reuse have been focused on dening the
standards and processes to turn the ad hoc reuse of IP cores into a well-understood and reliable process
for acquiring and reusing virtual components thus enhancing the analogy with PCB design.
18.6 Platforms and Programmable Platforms
The emphasis in the preceding sections has been on IP (or virtual component) reuse on a somewhat
ad hoc block-by-block basis in SoC design. Over the past several years, however, there has arisen a more
integrated approach to the design of complex SoCs and the reuse of virtual components what has been
called platform-based design. This will be dealt with at much greater length in another chapter in this
book. Much more information is available in References 811. Sufce it here to dene platform-based
design in the SoC context from one perspective.
We can dene platform-based design as a planned design methodology which reduces the time and
effort required, and risk involved, in designing and verifying a complex SoC. This is accomplished by
extensive reuse of combinations of HW and SW IP. As an alternative to IP reuse in a block-by-block
manner, platform-based design assembles groups of components into reusable platformarchitecture. This
reusable architecture, together with libraries of preveried and precharacterized, application oriented HW
and SW virtual components, is a SoC integration platform.
There are several reasons for the growing popularity of the platformapproachinindustrial design. These
include the increase in design productivity, the reduction in risk, the ability to utilize preintegrated virtual
components from other design domains more easily, and the ability to reuse SoC architectures created by
experts. Industrial platforms include full application platforms, recongurable platforms, and processor-
centric platforms [12]. Full application platforms, such as Philips Nexperia and TI OMAP provide a
complete implementation vehicle for specic product domains [13]. Processor-centric platforms, such as
ARM PrimeXsys concentrate on the processor, its required bus architecture, and basic sets of peripherals,
along with RTOS and basic SW drivers. Recongurable or highly programmable platforms such as the
Xilinx Platform FPGA and Alteras SOPC deliver hardcore processors plus recongurable logic along with
associated IP libraries and design tool ows.
18.7 Integration Platforms and SoC Design
The use of SoC integration platforms changes the SoC design process in two fundamental ways:
1. The basic platform must be designed, using whatever ad hoc or formalized design process for SoC
that the platform creators decide on. Section 18.8 outlines some of the basic steps required to build a SoC,
whether building a platform or using a block-based more ad hoc integration process. However, when
constructing a SoC platform for reuse in derivative design, it is important to remember that it may not be
necessary to take the whole platformandits associatedHWandSWcomponent libraries throughcomplete
implementation. Enough implementation must be done to allowthe platformand its constituent libraries
to be fully characterized and modeled for reuse. It is also essential that the platformcreation phase produce
in an archivable and retrievable form all the design les required for the platform and its libraries to be
reused in a derivative design process. This must also include the setup of the appropriate conguration
programs or scripts to allow automatic creation of a congured platform during derivative design.
2. A design process must be created and qualied for all the derivative designs which will be created
based on the SoC integration platform. This must include processes for retrieving the platform from its
archive, for entering the derivative design conguration into a platform congurator, the generation of
the design les for the derivative, the generation of the appropriate verication environment(s) for the
derivative, the ability for derivative design teams to select components from libraries, to modify these
components and validate them within the overall platform context, and, to the extent supported by the
platform, to create new components for their particular application.
Recongurable or highly programmable platforms introduce an interesting addition to the platform-
based SoCdesign process [14]. PlatformFPGAs and SOPCdevices can be thought of as ameta-platform:
a platform for creating platforms. Design teams can obtain these devices from companies such as Xilinx
and Altera, containing a basic set of more generic capabilities and IP-embedded processors, on-chip buses,
special IP blocks such as MACs and SERDES, and a variety of other prequalied IP blocks. They can then
customize the meta-platform to their own application space by adding application domain-specic IP
libraries. Finally, the combined platform can be provided to derivative design teams, who can select the
basic meta-platform and congure it within the scope intended by the intermediate platform creation
team, selecting the IP blocks needed for their exact derivative application. More on platform-based design
will be found in another chapter in this book.
18.8 Overview of the SoC Design Process
The most important thing to remember about SoC design is that it is a multi-disciplinary design process,
which needs to exercise design processes from across the spectrum of electronics. Design teams must gain
some uency with all these multiple disciplines, but the integrative and reuse nature of SoC design means
that they may not need to become deep experts in all of them. Indeed, avoiding the need for designers to
understand all methodologies, ows, and domain-specic design techniques is one of the key reasons for
reuse and enablers of productivity. Nevertheless, from Design-for-Test (DFT) through digital and analog
HW design, from verication through system level design, from embedded SW through IP procurement
and integration, from SoC architecture through IC analysis, a wide variety of knowledge is required by
the team, if not every designer.
Figure 18.2 illustrates some of the basic constituents of the SoC design process.
We will now dene each of these steps as illustrated:
SoCrequirements analysis. This is the basic step for dening and specifying a complex SoC, based on the
needs of the end product into which it will be integrated. The primary input into this step is the marketing
denition of the end product and the resulting characteristics of what the SoC should be: both functional
and nonfunctional (e.g., cost, size, energy consumption, performance: latency and throughput, package
selection). This process of requirements analysis must ultimately answer the question: is the product
feasible? Is the desired SoC feasible to design, and with what effort and in what timeframe? How much
SoC requirement
analysis
SoC
architecture
Communications
architecture
System-level design
HWSW partitioning
System modeling
Performance analysis
Build transaction-level
golden testbench
Configure and floorplan
SoC HW microarchitecture
DFT architecture
and implementation
HW IP assembly and
implementation
Final SoC HW assembly and verification
Fabrication, testing, packaging, lab verification with SW
Acquisition of HW and SW IP
Define SW architecture
SW assembly and
implementation
HW and HWSW
verification
AMS HW
implementation
Choose
processor(s)
FIGURE 18.2 Steps in the SoC design process.
reuse will be possible? Is the SoC design based on legacy designs of previous generation products (or, in
the case of platform-based design, to be built based on an existing platform offering)?
SoC architecture. In this phase, the basic structure of the desired SoC is dened. Vitally important
is to decide on the communications architecture that will be used as the backbone of the SoC on-chip
communications network. An inadequate communications architecture will cripple the SoC and have as
big animpact as the use of aninappropriate processor subsystem. Of course, the choice of communications
architecture is impossible to divorce from making the basic processor(s) choice for example, do I use a
RISCcontrol processor? DoI have anon-boardDSP? Howmany of each? What are the processing demands
of my SoC application? Do I integrate the bare processor core, or use a whole processor subsystem
provided by an IP company (most processor IP companies have moved from offering just processor
cores, to whole processor subsystems including hierarchical bus fabrics tuned to their particular processor
needs)? Do I have some ideas, based on legacy SoC design in this space, as to how SW and HW should be
partitioned? What memory hierarchy is appropriate? What are the sizes, levels, performance requirements,
and congurations of the embedded memories most appropriate to the application domain for the SoC?
System-level design. This is an important phase of the SoC process but one that is often done in
a relatively ad hoc way. The whiteboard and the spreadsheet are as much used by the SoC architects
as more capable toolsets. However, there has long been use of ad hoc C/C++ based models for the
system design phase to validate basic architectural choices. And designers of complex signal processing
algorithms for voice and image processing have long adopted dataow models and associated tools to
dene their algorithms, dene optimal bit-widths, and validate performance whether destined for HW or
SW implementation. A urry of activity in the last few years on different C/C++modeling standards for
system architects has consolidated on SystemC [15]. The system nature of SoC demands a growing use
of system-level design modeling and analysis, as these devices grow more complex. The basic processes
carried out in this phase include HWSW partitioning (the allocation of functions to be implemented in
dedicated HW blocks, in SW on processors [and the decision of RISC versus DSP], or a combination of
both, together with decisions on the communications mechanisms to be used to interface HW and SW, or
HWHW and SWSW). In addition, the construction of system-level models, and the analysis of correct
functioning, performance, and other nonfunctional attributes of the intended SoC through simulation
and other analytical tools, is necessary. Finally, all additional IP blocks required which can be sourced
outside, or reused from the design groups legacy, must be identied both HW and SW. The remaining
new functions will need to be implemented as part of the overall SoC design process.
IP acquisition. After system-level design and the identication of the processors and communications
architecture, and other HW or SW IP required for the design, the group must undertake an IP acquisition
stage. This can, to a large extent, be done at least in part in parallel with other work such as system-level
design (assuming early identication of major external IP is made) or building golden transaction-level
testbench models. Fortunate design groups will be working in companies with a large legacy of existing
well-crafted IP (rather, virtual components) organized in databases which can be easily searched; or
those with access via supplier agreements to large external IP libraries; or at least those with experience
at IP search, evaluation, purchase, and integration. For these lucky groups, the problems at this stage are
greatly ameliorated. Others with less experience or infrastructure will need to explore these processes for
the rst time, hopefully making use of IP suppliers experience with the legal and other processes required.
Here the external standards bodies such as VSIA and VCX have done much useful work that will smooth
the path, at least a little. One key issue in IP acquisition is to conduct rigorous and thorough incoming
inspection of IP to ensure its completeness and correctness to the greatest extent possible prior to use, and
to resolve any problems with quality early with suppliers long before SoC integration. Every hour spent
on this at this stage will pay back in avoiding much longer schedule slips later. The IP quality guidelines
discussed earlier are a foundation level for a quality process at this point.
Build a transaction-level golden testbench. The system model built up during the system-level design
stage can form the basis for a more elaborated design model, using transaction-level abstractions [16],
which represents the underlying HWSW architecture and components in more detail sufcient detail
to act as a functional virtual prototype for the SoC design. This golden model can be used at this stage to
verify the microarchitecture of the design and to verify detailed design models for HW IP at the Hardware
Description Language (HDL) level within the overall system context. It thus can be reused all the way
down the SoC design and implementation cycle.
Dene the SoC SW architecture. SoC is of course not just about HW [17]. As well as often dening the
right on-chip communications architecture, the choice of processor(s) and the nature of the application
domain have a very heavy inuence on the SW architecture. For example, RTOS choice is limited by the
processor ports which have been done and by the application domain (OSEK is an RTOS for automotive
systems; Symbian OS for portable wireless devices; PalmOS for Personal Digital Assistants, etc.). As well
as the basic RTOS, every SoC peripheral device will need a device driver hopefully based on reuse
and conguration of templates; various middleware application stacks (e.g., telephony, multimedia image
processing) are important parts of the SW architecture; voice and image encoding and decoding on
portable devices often is based on assembly code IP for DSPs. There is thus a strong need in dening the
SoC to fully elaborate the SW architecture to allow reuse, easy customization and effective verication of
the overall HWSW device.
Congure and oorplan SoC microarchitecture. At this point we are beginning to deal with the SoC
on a more physical and detailed logical basis. Of course, during high-level architecture and system-level
design, the teamhas been looking at physical implementation issues (although our design process diagram
shows everything as a waterfall kind of ow, in reality SoC design like all electronics design is more of
an iterative, incremental process that is, more akin to the famous spiral model for SW). But before
beginning the detailed HW design and integration it is important that there is agreement among the
team on the basic physical oorplan; that all the IP blocks are properly and fully congured; that the
basic microarchitectures (test, power, clocking, bus, timing) have been fully dened and congured, and
that HW implementation can proceed. In addition, this process should also generate the downstream
verication environments which will be used throughout the implementation processes whether SW
simulation based, emulation based, using rapid prototypes, or other hybrid verication approaches.
DFT architecture and implementation. The test architecture is only one of the key microarchitectures
which must be implemented; it is complicated by IP legacy and the fact that it is often impossible to
impose one DFT style (such as BIST or SCAN) on all IP blocks. Rather, wrappers or adaptations of
standard test interfaces (such as JTAG ports) may be necessary to t all IP blocks together into a coherent
test architecture and plan.
AMS HW implementation. Most SoCs incorporating AMS blocks use them to interface to the external
world. VSIA, among other groups, has done considerable work in dening how AMS IP blocks should be
created to allow them to be more easily integrated into mainly digital SoCs (the Big D/little a SoC); and
guidelines andrules for suchintegration. Experiences withthese rules andguidelines andextra deliverables
have been, on the whole, promising but they have more impact between internal design groups today
than on the industry as a whole. The Big A/Big D mixed-signal SoC is still relatively rare.
HW IP assembly and integration. This design step is in many ways the most traditional. Many design
groups have experience in assembling design blocks done by various designers or subgroups in an incre-
mental fashion, into the agreed on architectures for communications, bussing, clocking, power, etc. The
main difference with SoC is that many of the design blocks may be externally sourced IP. To avoid dif-
culties at this stage, the importance of rigorous qualication of incoming IP and the early denition of
the SoC microarchitecture, to which all blocks must conform, cannot be overstated.
SW assembly and implementation. Just as with HW, the SW IP, together with new or modied SW tasks
created for the particular SoC under design, must be assembled together and validated as to conformance
to interfaces and expected operational quality. It is important to verify as much of the SW in its normal
system operating context as possible.
HW and HWSW verication. Although represented as a single box on the diagram, this is perhaps one
of the largest consumers of design time and effort and the major determinant of nal SoC quality. Vital to
effective verication is the setup of a targeted SoC verication environment, reusing the golden testbench
models created at higher levels of the design process. In addition, highly capable, multi-language, mixed
simulation environments are important (e.g., SystemC models and HDL implementation models need to
be mixed in the verication process and effective links between themare crucial). There are a large number
of different verication tools and techniques [18], ranging from SW-based simulation environments to
HW emulators, HW accelerators, and FPGA and bonded-core-based rapid prototyping approaches. In
addition, formal techniques such as equivalence checking, and model/property checking have enjoyed
some successful usage in verifying parts of SoC designs, or the design at multiple stages in the process.
Mixed approaches to HWSW verication range from incorporating Instruction Set Simulators (ISSs) of
processors in SW-based simulation to linking HW emulation of the HW blocks (compiled from the HDL
code) to SW running natively on a host workstation, linked in an ad hoc fashion by design teams or using
a commercial mixed verication environment. Alternatively, HDL models of new HW blocks running in
a SW simulator can be linked to emulation of the rest of the system running in HW a mix of emulation
and use of bonded-out processor cores for executing SW. It is important that as much of the system SW
be exercised in the context of the whole system as possible, using the most appropriate verication tech-
nology that can get the design team close to real-time execution speed (no more than 100 slower is the
minimum to run signicant amounts of SW). The trend to transaction-based modeling of systems, where
transactions range in abstraction from untimed functional communications via message calls, through
abstract bus communications models, through cycle-accurate bus functional models, and nally to cycle
and pin-accurate transformations of transactions to the fully detailed interfaces, allows verication to
occur at several levels or with mixed levels of design description. Finally, a new trend in verication is
assertion-based verication, using a variety of input languages (PSL/Sugar, e, Vera, or regular Verilog and
VHDL) to model design properties, which can then be monitored during simulation, to ensure that either
certain properties will be satised or certain error conditions never occur. Combinations of formal prop-
erty checking and simulation-based assertionchecking have beencreated viz. semiformal verication.
The most important thing toremember about vericationis that armedwitha host of techniques andtools,
it is essential for design teams to craft a well-ordered verication process that allows them to denitively
answer the questionhow do we know that verication is done? and thus allows the SoC to be fabricated.
Final SoCHWassembly andverication. Oftendone inparallel or overlappingthose nal fewsimulation
runs in the verication stage, the nal SoC HW assembly and verication phase includes nal place and
route of the chip, any hand-modications required, and nal physical verication (using design rule
checking and layout-versus-schematic [netlist] tools), as well as important analysis steps for issues which
occur in advanced semiconductor processes such as IR drop, signal integrity, power network integrity, as
well as satisfaction and design transformation for manufacturability (OPC, etc.).
Fabrication, testing, packaging, and lab verication. When a SoC has been shipped to fabrication, it
would seem time for the design team to relax. Instead, this is an opportunity for additional verication to
be carried out especially more verication of system SW running in context of the HW design and
for xes, either of SW, or of the SoC HW on hopefully no more than one expensive iteration of the
design to be determined and planned. When the tested packaged parts arrive back for verication in
the lab, the ideal scenario is to load the SW into the system and have the SoC and its system booted up and
running SW within a few hours. Interestingly, the most advanced SoC design teams, with well-ordered
design methodologies and processes, are able to achieve this quite regularly.
18.9 System-Level Design
As discussed earlier, when describing the overall SoC design ow, system-level design, and SoC are
essentially made for each other. Akey aimof IP reuse and of SoCtechniques such as platform-based design
is to make the back end (RTL to GDS II) design implementation processes easier fast and with low-
risk; and to shift the major design phase for SoCup in time and in abstraction level to the systemlevel. This
also means that the back-end tools and ows for SoC designs do not necessarily differ from those used for
complex ASIC, ASSP, and customICdesign it is the methodology of howthey are used, and howblocks
are sourced and integrated, that overlays the underlying design tools and ows, that may differ for SoC.
However, the fundamental nature of IP-based design of SoC has a stronger inuence on the system level.
It is at the systemlevel that the vital tasks of deciding on and validating the basic systemarchitecture and
choice of IP blocks are carried out. In general, this is known as design space exploration (DSE). As part
of this exploration, SoC platform customization for a particular derivative is carried out, should the SoC
platform approach be used. Essentially one can think of platform DSE as being a similar task to general
DSE, except that the scope and boundaries of the exploration are much more tightly constrained the
basic communications architecture and platformprocessor choices may be xed, and the design teammay
be restricted to choosing certain customization parameters and choosing optional IP froma library. Other
tasks include HWSWpartitioning, usually restricted to decisions about key processing tasks which might
be mapped onto either HW or SW form and which have a big impact on system performance, energy
consumption, on-chip communications bandwidth consumption, or other key attributes. Of course,
in multiprocessor systems, there are SWSW partitioning or codesign issues as well, deciding on the
assignment of SW tasks to various processor options. Again, perhaps 80 to 95% of these decisions can or
are made a priori, especially if a SoC is based on either a platform or an evolution of an existing system;
such codesign decisions are usually made on a small number of functions which have critical impact.
Because partitioning, codesign and DSE tasks at the system level involve much more than HWSW
issues, a more appropriate termfor this isfunction-architecture codesign[19,20]. Inthis codesignmodel,
systems are described on two equivalent levels:
The functional intent of the system for example, a network of applications, decomposed into
individual sets of functional tasks which may be modeled using a variety of models of computation
such as discrete event, nite state machine, or dataow.
The architectural structure of the system the communications architecture, major IP blocks
such as processor(s), memorie(s), and HW blocks, captured or modeled, for example, using some
kind of IP or platform congurator.
The methodology implied in this approach is then to build explicit mappings between the functional
view of the system and the architectural view, which carry within them the implicit partitioning that is
made for both computation and communications. This hybrid model can then be simulated, the results
analyzed, and a variety of ancillary models (e.g., cost, power, performance, communications bandwidth
consumption, etc.) can be utilized in order to examine the suitability of the systemarchitecture as a vehicle
for realizing or implementing the end product functionality.
The function-architecture codesign approach has been implemented and used in both research and
commercial tools [21] and forms the foundation of many system-level codesign approaches going for-
ward. In addition, it has been found extremely suitable as the best system-level design approach for
platform-based design of SoC [22].
18.10 Interconnection and Communication Architectures
for SoC
This topic is dealt with in more detail in other chapters in this book. Sufce it to say here that current
SoC architectures deal in fairly traditional hierarchies of standard on-chip buses: for example, processor-
specic buses, high-speed system buses, and lower-speed peripheral buses, using standards such as ARMs
AMBA and IBMs CoreConnect [13], and traditional masterslave bus approaches. Recently, there has
been a lot of interest in Network-on-Chip (NoC) communications architectures, based on packet-SWand
a number of approaches have been reported in the literature but this remains primarily a research topic
both in universities and industrial research labs [23].
18.11 Computation and Memory Architectures for SoC
The primary processors used in SoC are embedded RISCs such as ARM processors, PowerPCs, MIPS
architecture processors, and some of the congurable processors designed specically for SoC such as
Tensilica and ARC. In addition, embedded DSPs from traditional suppliers as TI, Motorola, ParthusCeva,
and others are also quite common in many consumer applications, for embedded signal processing for
voice and image data. Research groups have looked at compiling or synthesizing application-specic pro-
cessors or coprocessors [24,25] and these have interesting potential in future SoCs which may incorporate
networks of heterogeneous congurable processors collaborating to offer large amounts of computational
parallelism. This is an especially interesting prospect given wider use of recongurable logic which opens
up the prospect of dynamic adaptation of SoC to application needs. However, most multiprocessor SoCs
today involve at most 2 to 4 processors of conventional design; the larger networks are more often found
today in the industrial or university lab.
Although several years ago most embedded processors in early SoCs did not use cache memory-based
hierarchies, this has changed signicantly over the years, and most RISC and DSP processors now involve
signicant amounts of Level 1 Cache memory, as well as higher level memory units both on- and off-chip
(off-chip ash memory is often used for embedded SW tasks which may be only infrequently required).
Systemdesigntasks and tools must consider the structure, size, and congurationof the memory hierarchy
as one of the key SoC conguration decisions that must be made.
18.12 IP Integration Quality and Certication Methods
and Standards
We have emphasized the design reuse aspects of SoC and the need for reuse of both internally and
externally sourced IP blocks by design teams creating SoCs. In the discussion of the design process above,
we mentioned issues such as IP quality standards and the need for incoming inspection and qualication
of IP. The issue of IP quality remains one of the biggest impediments to the use of IP-based design for
SoC [26]. The quality standards and metrics available from VSIA and OpenMORE, and their further
enhancement help, but only to a limited extent. The industry could clearly use a formal certication body
or lab for IPquality that would ensure conformance to IPtransfer requirements and the integration quality
of the blocks. Such a certication process would be of necessity quite complex owing to the large number
of congurations possible for many IP blocks and the almost innite variety of SoC contexts into which
they might be integrated. Certied IP would begin the deliver thevirtual componentsof the VSIAvision.
In the absence of formal external certication (and such third party labs seema long way off, if they ever
emerge), designgroups must provide their owncerticationprocesses and real reuse quality metrics, based
on their internal design experiences. Platform-based design methods help owing to the advantages of pre-
qualifying andcharacterizing groups of IPblocks andlibraries of compatible domain-specic components.
Short of independent evaluation and qualication, this is the best that design groups can do currently.
One key issue to remember is that IP not created for reuse, with all the deliverables created and
validated according to a well-dened set of standards, is inherently not reusable. The effort required to
make a reusable IP block has been estimated to be 50 to 200% more than that required to use it once;
however, assuming the most conservative extra cost involved implies positive payback with three uses
of the IP block. Planned and systematic IP reuse and investment in those blocks with greatest SoC use
potential gives a high chance of achieving signicant productivity soon after starting a reuse programme.
But ad hoc attempts to reuse existing design blocks not designed to reuse standards have failed in the past
and are unlikely to provide the quality and productivity desired.
18.13 Summary
In this chapter, we have dened SoC and surveyed a large number of the issues involved in its design. An
outline of the important methods andprocesses involvedinSoCdesigndene a methodology whichcanbe
adopted by design groups and adapted to their specic requirements. Productivity in SoCdesign demands
high levels of design reuse and the existence of the third party and internal IP groups and the chance to cre-
ate a library of reusable IP blocks (true virtual components) are all possible for most design groups today.
The wide variety of design disciplines involved in SoC mean that unprecedented collaboration between
designers of all backgrounds from systems experts through embedded SW designers through architects
through HW designers is required. But the rewards of SoC justify the effort required to succeed.
References
[1] Merrill Hunt and Jim Rowson, Blocking in a system on a chip. IEEE Spectrum, 33(11), 3541,
November 1996.
[2] Rochit Rajsuman. System-on-a-Chip Design and Test. Artech House, Norwood, Massachusetts,
2000.
[3] Michael Keating and Pierre Bricaud, Reuse Methodology Manual for System-on-a-Chip Designs.
Kluwer Academic Publishers, Dordrecht, 1998 (1st ed.), 1999 (2nd ed.), 2002 (3rd ed.).
[4] International Technology Roadmap for Semiconductors (ITRS), 2001 edn. http://public.itrs.net/.
[5] Virtual Socket Interface Alliance, on the web at URL: http://www.vsia.org. This includes access to
its various public documents, including the original Reuse Architecture document of 1997, as well
as more recent documents supporting IP reuse released to the public domain.
[6] B. Bailey, G. Martin, and T. Anderson (eds.). Taxonomies for the Development and verication of
Digital Systems. Springer, New York, 2005.
[7] The Virtual Component Exchange (VCX). Available at http://www.thevcx.com/.
[8] Henry Chang, Larry Cooke, Merrill Hunt, Grant Martin, Andrew McNelly, and Lee Todd,
Surviving the SOC Revolution: A Guide to Platform-Based Design. Kluwer Academic Publishers,
Dordrecht, 1999.
[9] K. Keutzer, S. Malik, A.R. Newton, J. Rabaey, andA. Sangiovanni-Vincentelli, System-Level Design:
Orthogonalization of Concerns and Platform-Based Design. IEEE Transactions on CAD of ICs and
Systems, 19, 1523, 2000.
[10] Alberto Sangiovanni-Vincentelli and Grant Martin, Platform-Based Design and Software Design
Methodology for Embedded Systems. IEEE Design and Test of Computers, 18, 2333, 2001.
[11] IEEE Design and Test of Computers Special Issue on Platform-Based Design of SoCs, 19, 463, 2002.
[12] G. Martin and F. Schirrmeister, ADesign Chain for Embedded Systems. IEEE Computer, Embedded
Systems Column, 35(3), 100103, March 2002.
[13] Grant Martin and Henry Chang, Eds., Winning the SOC Revolution: Experiences in Real Design.
Kluwer Academic Publishers, Dordrecht, May 2003.
[14] Patrick Lysaght, FPGAs as Meta-Platforms for Embedded Systems. In Proceedings of the IEEE
Conference on Field Programmable Technology. Hong Kong, December 2002.
[15] Thorsten Groetker, Stan Liao, Grant Martin, and Stuart Swan, SystemDesign with SystemC. Kluwer
Academic Publishers, Dordrecht, May 2002.
[16] Janick Bergeron, Writing Testbenches, 3rd ed. Kluwer Academic Publishers, Dordrecht, 2003.
[17] G. Martin and C. Lennard, Improving Embedded SW Design and Integration for SOCs. Invited
Custom Integrated Circuits Conference Paper, May 2000, pp. 101108.
[18] Prakash Rashinkar, Peter Paterson, and Leena Singh, System-on-a-Chip Verication: Methodology
and Techniques. Kluwer Academic Publishers, Dordrecht, 2001.
[19] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. Sangiovanni-
Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara, HardwareSoftware Co-Design of Embedded
Systems: The POLIS Approach. Kluwer Academic Publishers, Dordrecht, 1997.
[20] S. Krolikoski, F. Schirrmeister, B. Salefski, J. Rowson, and G. Martin, Methodology and Technology
for Virtual Component Driven Hardware/Software Co-Design on the System Level. Paper 94.1,
ISCAS 99, Orlando, FL, May 30June 2, 1999.
[21] G. Martin and B. Salefski, System Level Design for SOCs: A Progress Report Two Years On.
In System-on-Chip Methodologies and Design Languages, Jean Mermet, Ed. Kluwer Academic
Publishers, Dordrecht, 2001, pp. 297306.
[22] G. Martin, Productivity in VC Reuse: Linking SOC Platforms to Abstract Systems Design
Methodology. In Virtual Component Design and Reuse, Ralf Seepold and Natividad Martinez
Madrid, Eds. Kluwer Academic Publishers, Dordrecht, 2001, pp. 3346.
[23] Axel Jantsch and Hennu Tenhunen, Eds., Networks on Chip. Kluwer Academic Publishers,
Dordrecht, 2003.
[24] Vinod Kithail, Shail Aditya, Robert Schreiber, B. Ramakrishna Rau, Darren C. Cronquist, and
Mukund Sivaraman, PICO: Automatically Designing Custom Computers. IEEE Computer, 35,
3947, 2002.
[25] T.J. Callahan, J.R. Hauser, and J. Wawrzynek, The Garp Architecture and C Compiler. IEEE
Computer, 33, 6269, 2000.
[26] DATE 2002 Proceedings, Session 1A: How to Choose Semiconductor IP?: Embedded Processors,
Memory, Software, Hardware. In Proceedings of DATE 2002. Paris, March 2002, pp. 1417.
19
A Novel Methodology
for the Design of
Application-Specic
Instruction-Set
Processors
Andreas Hoffmann,
Achim Nohl, and
Gunnar Braun
CoWare Inc.
19.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-1
19.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-3
19.3 ASIP Design Flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-4
Architecture Exploration LISA Language
19.4 LISA Processor Design Platform. . . . . . . . . . . . . . . . . . . . . . . . 19-10
Hardware Designer Platform For Exploration and Processor
Generation Software Designer Platform For Software
Application Design System Integrator Platform For System
Integration and Verication
19.5 SW Development Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-12
Assembler and Linker Simulator
19.6 Architecture Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-17
LISA Language Elements for HDL Synthesis Implementation
Results
19.7 Tools for Application Development . . . . . . . . . . . . . . . . . . . . 19-22
Examined Architectures Efciency of the Generated Tools
19.8 Requirements and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 19-25
LISA Language HLL C-compiler HDL Generator
19.9 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-26
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19-26
19.1 Introduction
In consumer electronics and telecommunications high product volumes are increasingly going along with
short life-times. Driven by the advances in semiconductor technology combined with the need for new
From Andreas Hoffmann, Tim Kogel, Achim Nohl, Gunnar Braun, Oliver Schliebusch, Oliver Wahlen, Andreas
Wieferink, and Heinrich Meyr. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20,
2001. With permission.
19-1
applications like digital TV and wireless broadband communications, the amount of system functionality
realized on a single chip is growing enormously. Higher integration and thus increasing miniaturization
have led to a shift from using distributed hardware components towards heterogeneous system-on-chip
(SOC) designs [1]. Due tothe complexity introducedby suchSOCdesigns andtime-to-market constraints,
the designers productivity has become the vital factor for successful products. For this reason a growing
amount of system functions and signal processing algorithms is implemented in software rather than in
hardware by employing embedded processor cores.
In the current technical environment, embedded processors and the necessary development tools are
designed manually, with very little automation. This is because the design and implementation of an
embedded processor, such as a DSP device embedded in a cellular phone, is a highly complex process
composed of the following phases: architecture exploration, architecture implementation, application
software design, and system integration and verication.
During the architecture exploration phase, software development tools (i.e., HLL compiler, assembler,
linker, and cycle-accurate simulator) are required to prole and benchmark the target application on
different architectural alternatives. This process is usually an iterative one that is repeated until a best
t between selected architecture and target application is obtained. Every change to the architecture
specication requires a complete new set of software development tools. As these changes on the tools
are carried out mainly manually, this results in a long, tedious, and extremely error-prone process.
Furthermore, the lack of automation makes it very difcult to match the proling tools to an abstract
specication of the target architecture. In the architecture implementation phase, the specied processor
has to be converted into a synthesizable HDL model. With this additional manual transformation it is
quite obvious that considerable consistency problems arise between the architecture specication, the
software development tools, and the hardware implementation. During the software application design
phase, software designers need a set of production-quality software development tools. Since the demands
of the software application designer and the hardware processor designer place different requirements
on software development tools, new tools are required. For example, the processor designer needs a
cycle/phase-accurate simulator for hardware/software partitioning and proling, which is very accurate
but inevitably slow, whereas the application designer demands more simulation speed than accuracy. At
this point, the complete software development tool-suite is usually re-implementedby handconsistency
problems are self-evident. In the system integration and verication phase, co-simulation interfaces must
be developed to integrate the software simulator for the chosen architecture into a system simulation
environment. These interfaces vary with the architecture that is currently under test. Again, manual
modication of the interfaces is required with each change of the architecture.
The efforts of designing a newarchitecture can be reduced signicantly by using a retargetable approach
based on a machine description. The Language for Instruction Set Architectures (LISAs) [2,3] was
developed for the automatic generation of consistent software development tools and synthesizable HDL
code. A LISA processor description covers the instruction-set, the behavioral, and the timing model of
the underlying hardware, thus providing all essential information for the generation of a complete set
of development tools including compiler, assembler, linker, and simulator. Moreover, it contains enough
micro-architectural details to generate synthesizable HDL code of the modelled architecture. Changes on
the architecture are easily transferred to the LISA model and are applied automatically to the generated
tools and hardware implementation. In addition, speed and functionality of the generated tools allow
usage even after the product development has been nished. Consequently, there is no need to rewrite
the tools to upgrade them to production quality standard. In its predicate to represent an unambiguous
abstraction of the real hardware, a LISA model description bridges the gap between hardware and software
design. It provides the software developer with all required information and enables the hardware designer
to synthesize the architecture from the same specication the software tools are based on.
The chapter is organized as follows: Section 19.2 reviews existing approaches on machine description
languages and discusses their applicability for the design of application specic instruction set processors.
Section 19.3 presents an overview on a typical ASIP design ow using LISA: from specication to imple-
mentation. Moreover, different processor models are worked out which contain the required information
A Novel Methodology for the Design of ASIPs 19-3
the tools need for their retargetation. Besides, sample LISA code segments are presented showing how the
different models are expressed in the LISA language. Section 19.4 introduces the LISA processor design
platform. Following that, the different areas of application are illuminated in more detail. In Section 19.5
the generated software development tools are presented in more detail with a focus on different simulation
techniques that are applicable. Section 19.6 shows the path to implementation and gives results for a case
study that was carried out using the presented methodology. To prove the quality of the generated software
development tools, in Section 19.7 simulation benchmark results are shown for modelled state-of-the-
art processors. In Section 19.8, requirements and limitations of the presented approach are explained.
Section 19.9 summarizes the chapter and gives an outlook on future research topics.
19.2 Related Work
Hardware description languages (HDLs) like VHDL or Verilog are widely used to model and simulate
processors, but mainly with the goal to develop hardware. Using these models for architecture exploration
and production quality software development tool generation has a number of disadvantages, especially
for cycle-based or instruction-level processor simulation. They cover a huge amount of hardware imple-
mentation details which are not needed for performance evaluation, cycle-based simulation, and software
verication. Moreover, the description of detailed hardware structures has a signicant impact on simula-
tion speed [4,5]. Another problem is that the extraction of the instruction set is a highly complex, manual
task and some instruction set information, for example, assembly syntax, cannot be obtained from HDL
descriptions at all.
There are many publications on machine description languages providing instruction-set models.
Most approaches using such models are addressing retargetable code generation [69]. Other approaches
address retargetable code generation and simulation. The approaches of Maril [10] as part of the Marion
environment and a system for VLIW compilation [11] are both using latency annotation and reservation
tables for code generation. But models based on operation latencies are too coarse for cycle-accurate
simulation or even generation of synthesizable HDL code. The language nML was developed at TU
Berlin[12,13] andadoptedinseveral projects [1417]. However, the underlying instructionsequencer does
not allow to describe the mechanisms of pipelining as required for cycle-based models. Processors with
more complex executionschemes andinstruction-level parallelismlike the Texas Instruments TMS320C6x
cannot be described, even at the instruction-set level, because of the numerous combinations of instruc-
tions. The same restriction applies to ISDL [18], which is very similar to nML. The language ISDL is an
enhanced version of the nML formalism and allows the generation of a complete tool-suite consisting
of HLL compiler, assembler, linker, and simulator. Even the possibility of generating synthesizable HDL
code is reported, but no results on the efciency of the generated tools nor on the generated HDL code are
given. The EXPRESSION language [19] allows the cycle-accurate processor description based on a mixed
behavioral/structural approach. However, no results on simulation speed have been published nor is it
clear if it is feasible to generate synthesizable HDL code automatically. The FlexWare2 environment [20]
is capable of generating assembler, linker, simulator, and debugger from the Insulin formalism. A link to
implementation is non-existent, but test vectors can be extracted fromthe Insulin description to verify the
HDL model. The HLL compiler is derived froma separate description targeting the CoSy [21] framework.
Recently, various ASIP development systems have been introduced [2224] for systematic co-design
of instruction-set and micro-architecture implementation using a given set of application benchmarks.
The PEAS-III system [25] is an ASIP development environment based on a micro-operation description
of instructions that allows the generation of a complete tool-suite consisting of HLL compiler, assembler,
linker, and simulator including HDL code. However, no further information about the formalism is given
that parameterizes the tool generators nor have any results beenpublishedonthe efciency of the generated
tools. The MetaCore system [26] is a benchmark driven ASIP development system based on a formal
representation language. The system accepts a set of benchmark programs and estimates the hardware
cost and performance for the conguration under test. Following that, software development tools and
synthesizable HDL code are generated automatically. As the formal specication of the ISA is similar to
the ISPS formalism [27], complex pipeline operations as ushes and stalls can hardly be modelled. In
addition, exibility in designing the instruction-set is limited to a predened set of instructions. Tensilica
Inc. customizes a RISC processor within the Xtensa system [28]. As the system is based on an architecture
template comprising quite a number of base instructions, it is far too powerful and thus not suitable for
highly application specic processors, which do in many cases only employ very few instructions.
Our interest in a complete retargetable tool-suite for architecture exploration, production quality
software development, architecture implementation, and system integration for a wide range of embedded
processor architectures motivated the introduction of the language LISA used in our approach. In many
aspects, LISA incorporates ideas which are similar to nML. However, it turned out from our experience
with different DSP architectures that signicant limitations of existing machine description languages
must be overcome to allow the description of modern commercial embedded processors. For this reason,
LISA includes improvements in the following areas:
Capability to provide cycle-accurate processor models, including constructs to specify pipelines
and their mechanisms including stalls, ushes, operation injection, etc.
Extension of the target class of processors including SIMD, VLIW, and superscalar architectures of
real-world processor architectures.
Explicit language statements addressing compiled simulation techniques.
Distinction between the detailed bit-true description of operation behavior including side-effects
for the simulation and implementation on the one hand and assignment to arithmetical func-
tions for the instruction selection task of the compiler on the other hand which allows to determine
freely the abstraction level of the behavioral part of the processor model.
Strong orientation on the programming languages C/C++; LISA is a framework which encloses
pure C/C++ behavioral operation description.
Support for instruction aliasing and complex instruction coding schemes.
19.3 ASIP Design Flow
Powerful application specic programmable architectures are increasingly required in the DSP, multimedia
and networking application domains in order to meet demanding cost and performance requirements.
The complexity of algorithms and architectures in these application domains prohibits an ad hoc imple-
mentation and requests for an elaborated design methodology with efcient support in tooling. In this
section, a seamless ASIP design methodology based on LISA will be introduced. Moreover, it will be
demonstrated how the outlined concepts are captured by the LISA language elements. The expressiveness
of the LISA formalism providing high exibility with respect to abstraction level and architecture category
is especially valuable for the design of high performance processors.
19.3.1 Architecture Exploration
The LISA based methodology sets in after the algorithms, which are intended for execution on the
programmable platform, are selected. The algorithm design is beyond the scope of LISA and is typically
performed in an application specic system level design environment, such as, for example, COSSAP
[29] for wireless communications or OPNET [30] for networking. The outcome of the algorithmic
exploration is a pure functional specication usually represented by means of an executable prototype
written in a high-level language (HLL) like C, together with a requirement document specifying cost and
performance parameters. In the following, the steps of our proposed design ow depicted in Figure 19.1
are described, where the ASIP designer renes successively the application jointly with the LISA model of
the programmable target architecture.
First the performance critical algorithmic kernels of the functional specication have to be identied.
This task can be easily performed with a standard proling tool, that instrumentalizes the application
Application
Assembly
algorithm kernal
Assembly program
Revised
assembly program
Assembly program
LISA model
Datapath
model
Instruction
model
Cycle-true
model
RTL model
ISA accurate
profiling (data)
ISA accurate
profiling (data+control)
Cycle accurate
profiling (data+control)
HW cost +timing
Exploration result
1
2
3
4
FIGURE 19.1 LISA based ASIP development ow.
code in order to generate HLL execution statistics during the simulation of the functional prototype. Thus
the designer becomes aware of the performance critical parts of the application and is therefore prepared
to dene the data path of the programmable architecture on the assembly instruction level. Starting
from a LISA processor model which implements an arbitrary basic instruction set, the LISA model can
be enhanced with parallel resources, special purpose instructions, and registers in order to improve the
performance of the considered application. At the same time, the algorithmic kernel of the application
code is translated into assembly by making use of the specied special purpose instructions. By employing
assembler, linker, and processor simulator derived from the LISA model (cf. Section 19.5), the designer
can iteratively prole and modify the programmable architecture in cadence with the application until
both fulll the performance requirements.
After the processing intensive algorithmic kernels are considered and optimized, the instruction set
needs to be completed. This is accomplished by adding instructions to the LISA model which are dedicated
to the low speed control and conguration parts of the application. However, while these parts usually
represent major portions of the application in terms of code amount, they have only negligible inuence
on the overall performance. Therefore it is very often feasible to employ the HLL C-compiler derived
from the LISA model and accept suboptimal assembly code quality in return for a signicant cut in
design time.
So far, the optimization has only been performed with respect to the software related aspects, while
neglecting the inuence of the micro-architecture. For this purpose the LISA language provides capabilities
to model cycle-accurate behavior of pipelined architectures. The LISA model is supplemented by the
instruction pipeline and the execution of all instructions is assigned to the respective pipeline stage. If the
architecture does not provide automatic interlocking mechanisms, the application code has to be revised
to take pipeline effects into account. Now the designer is able to verify that the cycle true processor model
still satises the performance requirements.
At the last stage of the design ow, the HDL generator (see Section 19.6) can be employed to generate
synthesizable HDL code for the base structure and the control path of the architecture. After implementing
the dedicated execution units of the data path, strainable numbers on hardware cost and performance
parameters (e.g., design size, power consumption, clock frequency) can be derived by running the HDL
processor model through the standard synthesis ow. On this high level of detail the designer can tweak
the computational efciency of the architecture by applying different implementations of the data path
execution units.
19.3.2 LISA Language
The language LISA [2,3] is aiming at the formalized description of programmable architectures, their
peripherals and interfaces. LISA closes the gap between purely structural oriented languages (VHDL,
Verilog) and instruction set languages.
Memory
model
HLL-
compiler
Assembler
Linker
Simulator
Debugger
HDL
generator
Resource
model
Behavioral
model
Instruction
set model
Timing
model
Micro-
architecture
model
Instruction
selection
Instruction
scheduling
write
conflict
resolution
Instruction
scheduling
Operation
scheduling
Operation
grouping
Register
allocation
Memory
allocation
Profiling
Display
configuration
Basic
structure
Operation
sheduling
Operation
simulation
Simulation
of storage

Decoder/
disassembler

Instruction
translation
Instruction
decoder

FIGURE 19.2 Model requirements for ASIP design.
LISA descriptions are composed of resources and operations. The declared resources represent the storage
objects of the hardware architecture (e.g., registers, memories, pipelines) which capture the state of the
system. Operations are the basic objects in LISA. They represent the designers view of the behavior,
the structure, and the instruction set of the programmable architecture. A detailed reference of the LISA
language can be found in Reference 31.
The process of generating software development tools and synthesizing the architecture requires inform-
ation on architectural properties and the instruction set denition as depicted in Figure 19.2. These
requirements can be grouped into different architectural models the entirety of these models consti-
tutes the abstract model of the target architecture. The LISA machine description provides information
consisting of the following model components:
The memory model. This lists the registers and memories of the system with their respective bit widths,
ranges, and aliasing. The compiler gets information on available registers and memory spaces. The
memory conguration is provided to perform object code linking. During simulation, the entirety of
storage elements represents the state of the processor which can be displayed in the debugger. The HDL
code generator derives the basic architecture structure.
In LISA, the resource section lists the denitions of all objects which are required to build the memory
model. A sample resource section of the ICORE architecture described in Reference 32 is shown in
Figure 19.3. The resource section begins with the keyword RESOURCE followed by (curly) braces enclosing
all object denitions. The denitions are made in C-style and can be attributed with keywords like, for
example, REGISTER, PROGRAM_COUNTER, etc. These keywords are not mandatory but they are used
to classify the denitions in order to congure the debugger display. The resource section in Figure 19.3
shows the declaration of program counter, register le, memories, the four-stage instruction pipeline, and
pipeline-registers.
The resource model. This describes the available hardware resources and the resource requirements of
operations. Resources reect properties of hardware structures which can be accessed exclusively by one
operation at a time. The instruction scheduling of the compiler depends on this information. The HDL
code generator uses this information for resource conict resolution.
Besides the denition of all objects, the resource section in a LISA processor description provides
information about the availability of hardware resources. By this, the property of several ports, for
example, toa register bank or a memory is reected. Moreover, the behavior sectionwithinLISAoperations
announces the use of processor resources. This takes place in the section header using the keyword USES in
conjunction with the resource name and the information if the used resource is read, written or both (IN,
OUT, or INOUT respectively).
RESOURCE
{
PROGRAM_COUNTER int PC;
REGISTER signed int R[0..7];
DATA_MEMORY signed int RAM[0..255];
PROGRAM_MEMORY unsigned int ROM[0..255];
PIPELINE ppu_pipe = { FI; ID; EX; WB };
PIPELINE_REGISTER IN ppu_pipe
{
bit[6] Opcode;
short operandA;
short operandB;
};
}
FIGURE 19.3 Specication of the memory model.
RESOURCE
{
REGISTER unsigned int R([0..7])6;
DATA_MEMORY signed int RAM([0..15]);
}
OPERATION NEG_RM {
BEHAVIOR
USES (IN R[];
OUT RAM[];)
{
/* C-code */
RAM[address] = (-1) * R[index];
}
}
FIGURE 19.4 Specication of the resource model.
For illustrationpurposes, a sample LISAcode takenfromthe ICOREarchitecture is showninFigure 19.4.
The denition of the availability of resources is carried out by enclosing the C-style resource denition
with round braces followed by the number of simultaneously allowed accesses. If the number is omitted,
one allowed access is assumed. The gure shows the declaration of a register bank and a memory with six
and one ports respectively. Furthermore, the behavior section of the operation announces the use of these
hardware resources for read and write.
The instruction set model. This identies valid combinations of hardware operations and admissible
operands. It is expressed by the assembly syntax, instruction word coding, and the specication of legal
operands and addressing modes for each instruction. Compilers and assemblers can identify instructions
based on this model. The same information is used at the reverse process of decoding and disassembling.
In LISA, the instruction set model is captured within operations. Operation denitions collect the
description of different properties of the instruction set model which are dened in several sections:
The CODING section describes the binary image of the instruction word.
The SYNTAX section describes the assembly syntax of instructions, operands, and execution
modes.
The SEMANTICS section species the transition function of the instruction.
OPERATION COMPARE_IMM {
DECLARE {
LABEL index;
GROUP src1, dest = { register };
}
CODING { 0b10011 index=0bx[5] src1 dest }
SYNTAX { "CMP" src1 "," index "," dest }
SEMANTICS { CMP (dest,src1,index) }
}
FIGURE 19.5 Specication of the instruction set model.
OPERATION register
{
DECLARE { LABEL index; }
CODING { index=0bx[4] }
EXPRESSION { R[index] }
}
OPERATION ADD {
DECLARE { GROUP src1,src2,dest = { register }; }
CODING { 0b010010 src1 src2 dest }
BEHAVIOR
{
/* C-code */
dest = src1 + src2;
saturate(&dest);
}
}
FIGURE 19.6 Specication of the behavioral model.
Figure 19.5 shows an excerpt of the ICORE LISA model contributing to the instruction set model
information on the compare immediate instruction. The DECLARE section contains local declarations
of identiers and admissible operands. Operation register is not shown in the gure but comprises the
denition of the valid coding and syntax for src1 and dest respectively.
The behavioral model. This abstracts the activities of hardware structures to operations changing the
state of the processor for simulation purposes. The abstraction level of this model can range widely
between the hardware implementation level and the level of HLL statements.
The BEHAVIOR and EXPRESSION sections within LISA operations describe components of the beha-
vioral model. Here, the behavior section contains pure C-code that is executed during simulation whereas
the expression section denes the operands and execution modes used in the context of operations. An
excerpt of the ICORE LISA model is shown in Figure 19.6. Depending on the coding of the src1, src2, and
dest eld, the behavior code of operation ADD works with the respective registers of register bank R. As
arbitrary C-code is allowed, function calls can be made to libraries which are later linked to the executable
software simulator.
The timing model. This species the activation sequence of hardware operations and units. The instruc-
tion latency information lets the compiler nd an appropriate schedule and provides timing relations
between operations for simulation and implementation.
Several parts within a LISA model contribute to the timing model. First, the declaration of pipelines in
the resource section. The declaration starts with the keyword PIPELINE, followed by an identifying name
and the list of stages. Second, operations are assigned to pipeline stages by using the keyword IN and
providing the name of the pipeline and the identier of the respective stage, such as:
OPERATION name_of _operation IN ppu_pipe.EX (19.1)
RESOURCE
{
}
OPERATION CORDIC IN ppu_pipe.EX
{
ACTIVATION { WriteBack }
BEHAVIOR {
PIPELINE_REGISTER(ppu_pipe, EX/WB).ResultE = cordic();
}
}
OPERATION WriteBack IN ppu_pipe.WB {
BEHAVIOR {
R[value] = PIPELINE_REGISTER(ppu_pipe, EX/WB).ResultE;
}
}
FIGURE 19.7 Specication of the timing model.
Thirdly, the ACTIVATION section in the operation description is used to activate other operations in the
context of the current instruction. The activated operations are launched as soon as the instruction enters
the pipeline stage the activated operation is assigned to. Non-assigned operations are launched in the
pipeline stage of their activation.
To exemplify this, Figure 19.7 shows sample LISA code taken from the ICORE architecture. Oper-
ations CORDIC and WriteBack are assigned to stages EX and WB of pipeline ppu_pipe, respectively.
Here, operation CORDIC activates operation WriteBack which will be launched in the following cycle
(incorrespondence to the spacial ordering of pipeline stages) incase of anundisturbedowof the pipeline.
Moreover, in the ACTIVATION section, pipelines are controlled by means of predened functions stall,
shift, ush, insert, and execute which are automatically provided by the LISA environment for each pipeline
declared in the resource section. All these pipeline control functions can be applied to single stages as well
as whole pipelines, for example:
PIPELINE(ppu_pipe,EX/WB).stall(); (19.2)
Using this very exible mechanism, arbitrary pipelines, hazards, and mechanisms like forwarding can be
modelled in LISA.
The micro-architecture model. This allows grouping of hardware operations to functional units and
contains the exact micro-architecture implementation of structural components such as adders, multi-
pliers, etc. This enables the HDL generator to generate the appropriate HDL code from a more abstract
specication.
In analogy to the syntax of the VHDL language, operation grouping to functional units is formalized
using the keyword ENTITY in the resource section of the LISA model, for example:
ENTITY Alu
{
Add, Sub
}
(19.3)
Here, LISA operations Add and Sub are assigned to the functional unit Alu. Information on the exact
micro-architectural implementation of structural components can be included into the LISA model,
for example, by calling DesignWare components [33] from within the behavior section or by inlining
HDL code.
19.4 LISA Processor Design Platform
The LISA processor design platform ( LPDP) is an environment that allows the automatic generation of
software development tools for architecture exploration, hardware implementation, software development
tools for application design, and hardwaresoftware co-simulation interfaces from one sole specication of
the target architecture in the LISA language. Figure 19.8 shows the components of the LPDP environment.
19.4.1 Hardware Designer Platform For Exploration and Processor
Generation
As indicated in Section 19.3, architecture design requires the designer to work in two elds (see Figure 19.9):
on the one hand the development of the software part including compiler, assembler, linker, and simulator
and on the other hand the development of the target architecture itself.
The software simulator produces proling data and thus may answer questions concerning the instruc-
tion set, the performance of an algorithm and the required size of memory and registers. The required
silicon area or power consumption can only be determined in conjunction with a synthesizable HDL
model. To accommodate these requirements, the LISA hardware designer platform can generate the
following tools:
LISA language debugger for debugging the instruction-set with a graphical debugger frontend.
Exploration C-compiler for the non-critical parts of the application.
Exploration assembler which translates text-based instructions into object code for the respective
programmable architecture.
Exploration linker which is controlled by a dedicated linker command le.
Instruction-set architecture (ISA) simulator providing extensive proling capabilities, such as
instruction execution statistics and resource utilization.
Besides the ability to generate a set of software development tools, synthesizable HDL code (bothVHDL
and Verilog) for the processors control path and instruction decoder can be generated automatically from
the LISAprocessor description. This also comprises the pipeline and pipeline controller including complex
Hardware designer
Architecture implementation
Integration and verification Software application design
Architecture exploration
LISA
Architecture
specification
C-compiler
Simulator/debug.
Assembler/
linker
System on chip
System integrator Software designer
C compiler
Application
Simulator
Linker
Assembler
FIGURE 19.8 LISA processor design environment.
Target architecture
LISA description
Language compiler
HDL description
Synthesis tools
Gate level model
Language C-compiler
LISA assembler
LISA linker
LISA simulator
Evalution results
Profiling data,
execution speed
Evalution results
Chip size, clock speed,
power consumption
E
x
p
l
o
r
a
t
i
o
n
I
m
p
l
e
m
e
n
t
a
t
i
o
n
FIGURE 19.9 Exploration and implementation.
interlocking mechanisms, forwarding, etc. For the data path, hand-optimized HDL code has to be inserted
manually into the generated model. This approach has been chosen as the data path typically represents
the critical part of the architecture in terms of power consumption and speed (critical path).
It is obvious that deriving both software tools and hardware implementation model from one sole
specication of the architecture in the LISA language has signicant advantages: only one model needs
to be maintained, changes on the architecture are applied automatically to the software tools and the
implementation model and the consistency problemamong the software tools and between software tools
and implementation model is reduced signicantly.
19.4.2 Software Designer Platform For Software Application Design
To cope with the requirements of functionality and speed in the software design phase, the tools generated
for this purpose are an enhanced version of the tools generated during architecture exploration phase. The
generated simulation tools are enhanced in speed by applying the compiled simulation principle [34]
where applicable and are faster by one to two orders in magnitude than the tools currently provided by
architecture vendors. As the compiled simulation principle requires the content of the program memory
not to be changed during the simulation run, this holds true for most DSPs. However, for architectures
running the program from external memory or working with operating systems which load/unload
applications to/frominternal programmemory, this simulationtechnique is not suitable. For this purpose,
an interpretive simulator is also provided.
19.4.3 System Integrator Platform For System Integration and
Verication
Once the processor software simulator is available, it must be integrated and veried in the context of the
whole system (SOC), which can include a mixture of different processors, memories, and interconnect
components. In order to support the system integration and verication, the LPDP system integrator
platformprovides a well-dened application programmer interface (API) to interconnect the instruction-
set simulator generated from the LISA specication with other simulators. The API allows to control the
simulator by stepping, running, and setting breakpoints in the application code and by providing access
to the processor resources.
The following chapters will present the different areas addressed be the LISA processor design platform
in more detail software development tools and HDL code generation. Additionally, Section 19.7 will
prove the high quality of the generated software development tools by comparing them to those shipped
by the processor vendors.
19.5 SW Development Tools
The feasibility to generate automatically HLL C-compilers, assemblers, linkers, and ISA simulators from
LISA processor models enables the designer to explore the design space rapidly. In this section, specialties
and requirements of these tools are discussed with particular focus on different simulation techniques.
19.5.1 Assembler and Linker
The LISA assembler processes textual assembly source code and transforms it into linkable object code for
the target architecture. The transformation is characterized by the instruction-set information dened
in a LISA processor description. Besides the processor specic instruction-set, the generated assembler
provides a set of pseudo-instructions (directives) to control the assembling process and initialize data.
Section directives enable the grouping of assembled code into sections which can be positioned separately
in the memory by the linker. Symbolic identiers for numeric values and addresses are standard assem-
bler features and are supported as well. Moreover, besides mnemonic-based instruction formats, C-like
algebraic assembly syntax can be processed by the LISA assembler.
The linking process is controlled by a linker command le which keeps a detailed model of the target
memory environment and an assignment table of the module sections to their respective target memories.
Moreover, it is suitable to provide the linker with an additional memory model which is separated from
the memory conguration in the LISA description and which allows linking code into external memories
that are outside the architecture model.
19.5.2 Simulator
Due to the large variety of architectures and the facility to develop models on different levels of abstraction
in the domain of time and architecture (see section 19.3), the LISA software simulator incorporates several
simulation techniques ranging from the most exible interpretive simulation to more application- and
architecture-specic compiled simulation techniques.
Compiled simulators offer a signicant increase in instruction (cycle) throughput, however, the
compiled simulation technique is not applicable in any case. To cope with this problem, the most appro-
priate simulation technique for the desired purpose (debugging, proling, verication), architecture
(instruction-accurate, cycle-accurate), and application (DSP kernel, operating system) can be chosen
before the simulation is run. An overview of the available simulation techniques in the generated LISA
simulator is given in the following:
The interpretive simulation technique is employed in most commercially available instruction set
simulators. In general, interpretive simulators run signicantly slower than compiled simulators,
however, unlike compiled simulation, this simulation technique can be applied to any LISA model
and application.
Dynamically scheduled, compiledsimulationreduces simulationtime by performing the steps of instruc-
tion decoding and operation sequencing prior to simulation. This technique cannot be applied to
models using external memories or applications consisting of self-modifying program code.
Besides the compilation steps performed in dynamic scheduling, static scheduling and code translation
additionally implement operation instantiation. While the latter technique is used for instruction-
accurate models, the former is suitable for cycle-accurate models including instruction pipelines.
Beyond, the same restrictions apply as for dynamically scheduled simulation.
A detailed discussion of the different compiled simulation techniques is given in the following sections,
while performance results are given in Section 19.7. The interpretive simulator is not discussed.
19.5.2.1 Compiled Simulation
The objective of compiled simulation is to reduce the simulation time. Considering instruction set sim-
ulation, efcient run-time reduction can be achieved by performing repeatedly executed operations only
once before the actual simulation is run, thus inserting an additional translation step between application
load and simulation. The preprocessing of the application code can be split into three major steps [35]:
1. Within the step of instruction decoding, instructions, operands, and modes are determined for
each instruction word found in the executable object le. In compiled simulation, the instruction
decoding is only performed once for each instruction, whereas interpretive simulators decode the
same instruction multiple times, for example, if it is part of a loop. This way, the instruction
decoding is completely omitted at run-time, thus reducing simulation time signicantly.
2. Operation sequencing is the process of determining all operations to be executed for the accom-
plishment of each instruction found in the application program. During this step, the program
is translated into a table-like structure indexed by the instruction addresses. The table lines con-
tain pointers to functions representing the behavioral code of the respective LISA operations.
Although all involved operations are identied during this step, their temporal execution order is
still unknown.
3. The determination of the operation timing (scheduling) is performed within the step of operation
instantiation and simulation loop unfolding. Here, the behavior code of the operations is instantiated
by generating the respective function calls for each instruction in the application program, thus
unfolding the simulation loop that drives the simulation into the next state.
Besides fully compiled simulation, which incorporates all of the above steps, partial implementations
of the compiled principle are possible by performing only some of these steps. The accomplishment of
each of these steps gives a further run-time reduction, but also requires a non-neglectable amount of
compilation time. The trade-off between compilation time and simulation time is (qualitatively) shown
in Figure 19.10.
There are two levels of compiled simulation which are of particular interest dynamic scheduling and
static scheduling resp. code translation. In case of the dynamic scheduling, the task of selecting operations
from overlapping instructions in the pipeline is performed at run-time of the simulation. The static
scheduling already schedules the operations at compile-time.
Simulation time
C
o
m
p
i
l
a
t
i
o
n

t
i
m
e
Fully
interpretive
Fully
compiled
Static scheduling, code translation
Compile-time
decoding
Dynamic scheduling Operation
instantiation
Operation
sequencing
FIGURE 19.10 Levels of compiled simulation.
19.5.2.2 Dynamic Scheduling
As shown in Figure 19.10, the dynamic scheduling performs instruction decoding and operation sequen-
cing at compile-time. However, the temporal execution order of LISA operations is determined at simulator
run-time. While the operation scheduling is rather simple for instruction-accurate models, it becomes a
complex task for models with instruction pipelines.
In order to reect the instructions timing exactly and to consider all possibly occurring pipeline effects
like ushes and stalls, a generic pipeline model is employed simulating the instruction pipeline at run-time.
The pipeline model is parameterized by the LISA model description and can be controlled via predened
LISA operations. These operations include:
Insertion of operations into the pipeline (stages)
Execution of all operations residing in the pipeline
Pipeline shift
Removal of operations (ush)
Halt of entire pipeline or particular stages (stall)
Unlike for statically scheduled simulation, operations are inserted into and removed from the pipeline
dynamically, that means, each operation injects further operations upon its execution. The information
about operation timing is provided in the LISA description, that is, by the activation section as well as the
assignment of operations to pipeline stages (see Section 19.3.2 timing-model ).
It is obvious that the maintenance of the pipeline model at simulation time is expensive. Execution
proling onthe generatedsimulators for the Texas Instruments TMS320C62xx [36] andTMS320C54x [37]
revealed that more than fty percent of the simulators run-time is consumed by the simulation of the
pipeline.
The situation could be improved by implementing the step of operation instantiation, consequently
superseding the need for pipeline simulation. This, in turn, implies static scheduling, in other words,
the determination of the operation timing due to overlapping instructions in the pipeline taking place at
compile-time.
Although there is no pipeline model in instruction-accurate processor models, it will be shown that
operation instantiation also gives a signicant performance increase for these models. Beyond that, opera-
tion instantiation is relatively easy to implement for instruction-accurate models (in contrast to pipelined
models).
19.5.2.3 Static Scheduling
Generally, operation instantiation can be described as the generation of an individual piece of (behavioral)
simulator code for each instruction found in the application program. While this is straightforward
for instruction-accurate processor models, cycle-true, pipelined models require a more sophisticated
approach.
Considering instruction-accurate models, the shortest temporal unit that can be executed is an instruc-
tion. That means, the actions tobe performedfor the executionof anindividual instructionare determined
by the instruction alone. In the simulation of pipelined models, the granularity is dened by cycles. How-
ever, since several instructions might be active at the same time due to overlapping execution, the actions
performed during a single cycle are determined by the respective state of the instruction pipeline. As a
consequence, instead of instantiating operations for each single instruction of the application program,
behavioral code for each occurring pipeline state has to be generated. Several of such pipeline states might
exist for each instruction, depending on the execution context of the instruction, that is, the instructions
executed in the preceding and following cycles.
As pointed out previously, the principle of compiled simulation relies on an additional translation step
taking place before the simulation is run. This step is performed by a so-called simulation compiler, which
implements the three steps presented in Section 5.2.1. Obviously, the simulation compiler is a highly
architecture-specic tool, which is therefore retargeted from the LISA model description.
19.5.2.3.1 Operation Instantiation
The objective of static scheduling is the determination of all possible pipeline states according to the
instructions found in the application program. For purely sequential pipeline ow, that is, in case that
no control hazards occur, the determination of the pipeline states can be achieved simply by overlapping
consecutive instructions subject to the structure of the pipeline. In order to store the generated pipeline
states, pipeline state tables are used, providing an intuitive representation of the instruction ow in the
pipeline. Inserting instructions into pipeline state tables is referred to as scheduling in the following.
A pipeline state table is a two-dimensional array storing pointers to LISA operations. One dimension
represents the locationwithinthe application, the other the locationwithinthe pipeline, that is, the stage in
which the operation is executed. When a new instruction has to be inserted into the state table, both intra-
instruction and inter-instruction precedence must be considered to determine the table elements, in which
the corresponding operations will be entered. Consequently, the actual time an operation is executed at
depends on the scheduling of the preceding instruction as well as the scheduling of the operation(s)
assigned to the preceding pipeline stage within the current instruction. Furthermore, control hazards
causing pipeline stalls and/or ushes inuence the scheduling of the instruction following the occurrence
of the hazard.
A simplied illustration of the scheduling process is given in Figure 19.11. Figure 19.11(a) shows the
pipeline state table after a branch instruction has been inserted, composed of the operations fetch, decode,
branch, and update_pc as well as a stall operation. The table columns represent the pipeline stages,
the rows represent consecutive cycles (with earlier cycles in upper rows). The arrows indicate activation
chains.
The scheduling of a new instruction always follows the intra-instruction precedence, that means, fetch
is scheduled before decode, decode before branch, and so on. The appropriate array element for fetch is
determined by its assigned pipeline stage (FE) and according to inter-instruction precedences. Since the
branch instruction follows the add instruction (which has already been scheduled), the fetch operation is
inserted belowthe rst operation of add (not shown in Figure 19.11[a]). The other operations are inserted
according to their precedences.
The stall of pipeline stage FE, which is issued from the decode operation of branch, is processed by
tagging the respective table element as stalled. Whenthe next instructionis scheduled, the stall is accounted
for by moving the decode operation to the next table row resp. next cycle (see Figure 19.11[b]). Pipeline
ushes are handled in a similar manner: if a selected table element is marked as ushed, the scheduling of
the current instruction is abandoned.
Assuming purely sequential instruction ow, the task of establishing a pipeline state table for the entire
application program is very straightforward. However, every (sensible) application contains a certain
amount of control ow(e.g., loops) interrupting this sequential execution. The occurrence of such control
ow instructions makes the scheduling process extremely difcult or in a few cases even impossible.
fetch
branch
upd_pc
add
write_r
sub incr
write_r
stalled
decode
decode
FE DC EX WB
fetch
branch
upd_pc
add
decode
decode
sub incr
write_r
write_r stalled
fetch
decode
FE
DC EX WB
(a) (b)
FIGURE 19.11 Inserting instructions into pipeline state table.
PF FE DC AC RD EX
BC i1
i6 i5 i4
i7 i6 i5 i4
addr BC i1
i4 addr BC i1
i5 i4 add BC i1
stall stall addr BC i1
addr BC
addr
i8 i7 i6 i5 i4
i9 i8 i7 i6 i5
i10 i9 i8 i7 i6
i4
i5
k1
k2 k1
k3 k2 k1

k4 k3 k2 k1
k5 k4 k3 k2 k1
Address Instruction
a1 i1
a2,a3 BC addr
a4 i4
a5 i5
... ...
b1 k1
C
y
c
l
e
Condition evaluated
FIGURE 19.12 Pipeline behavior for a conditional branch.
Generally, all instructions modifying the program counter cause interrupts in the control ow. Further-
more, only instructions providing an immediate target address, that is, branches and calls whose target
address is known at compile-time, can be scheduled statically. If indirect branches or calls occur, it is
inevitable to switch back to dynamic scheduling at run-time.
Fortunately, most control ow instructions can be scheduled statically. Figure 19.12 exemplarily shows
the pipeline states for a conditional branch instruction as found in the TMS320C54xs instruction-set.
Since the respective condition cannot be evaluated until the instruction is executed, scheduling has to
be performed for both eventualities (condition true resp. false), splitting the program into alternative
execution paths. The selection of the appropriate block of prescheduled pipeline states is performed by
switching among different state tables at simulator run-time. In order to prevent from doubling the entire
pipeline state table each time a conditional branch occurs, alternative execution paths are left as soon as
an already generated state has been reached. Unless several conditional instructions reside in the pipeline
at the same time, these usually have the length of a few rows.
19.5.2.3.2 Simulator Instantiation
After all instructions of the application program have been processed, and thus the entire operation
schedule has been established, the simulator code can be instantiated. The simulation compiler backend
thereby generates either C code or an operation table with the respective function pointers, both describ-
ing alternative representations of the application program. Figure 19.13 shows a simplied excerpt of
the generated C code for a branch instruction. Cases represent instructions, while a new line starts a new
cycle.
switch (pc) {
case 0x1584: fetch(); decode(); sub(); write_registers();
case 0x1585: fetch(); decode(); test_condition(); add();
case 0x1586: branch(); write_registers();
fetch(); update_pc();
fetch(); decode();
fetch(); decode(); load(); goto_0x1400_;
}
FIGURE 19.13 Generated simulator code.
19.5.2.4 Instruction-Based Code Translation
The need for a scheduling mechanism arises from the presence of an instruction pipeline in the LISA
model. However, even instruction-accurate processor models without pipeline benet from the step of
operation instantiation. The technique applied here is called instruction-based code translation. Due to
the absence of instruction overlap, simulator code can be instantiated for each instruction independently,
thus simplifying simulator generation to the concatenation of the respective behavioral code specied
in the LISA description.
In contrast to direct binary-to-binary translation techniques [38], the translation of target-specic into
host-specic machine code uses C source code as intermediate format. This keeps the simulator portable,
and thus independent from the simulation host.
Since the instruction-based code translation generates program code that linearly increases in size with
the number of instructions in the application, the use of this simulation technique is restricted to small
and medium sized applications (less than 10k instructions, depending on model complexity). For large
applications, the resultant worse cache utilization on the simulation host reduces the performance of the
simulator signicantly.
19.6 Architecture Implementation
As we are targeting the development of application specic instruction set processors (ASIP), which are
highly optimized for one specic application domain, the HDL code generated from a LISA processor
description has to fulll tight constraints to be an acceptable replacement for handwritten HDL code by
experienced designers. Especially power consumption, chip area, and execution speed are critical points
for this class of architectures. For this reason, the LPDP platform does not claim to be able to efciently
synthesize the complete HDL code of the target architecture. Especially the data path of an architecture is
highly critical and must in most cases be optimized manually. Frequently, full-custom design technique
must be used to meet power consumption and clock speed constraints. For this reason, the generated
HDL code is limited to the following parts of the architecture:
Coarse processor structure such as register set, pipeline, pipeline registers, and test-interface.
Instruction decoder setting data and control signals which are carried through the pipeline and
activate the respective functional units executed in context of the decoded instruction.
Pipeline controller handling different pipeline interlocks, pipeline register ushes and supporting
mechanisms such as data forwarding.
Additionally, hardware operations as they are described in the LISA model can be grouped to functional
units (see Section 19.3.2 micro-architecture model ). Those functional units are generated as wrappers,
that is, the ports of the functional units as well as the interconnects to the pipeline registers and other
functional units are generated automatically while the content needs to be lled manually with code.
Emerging driver conicts in context with the interconnects are resolved automatically by the insertion of
multiplexers.
The disadvantage of rewriting the data path in the HDL description by hand is that the behavior of
hardware operations within those functional units has to be described and maintained twice on the one
hand in the LISA model and on the other hand in the HDL model of the target architecture. Consequently,
a problem here is verication which will be addressed in future research.
19.6.1 LISA Language Elements for HDL Synthesis
The following sections will show in detail, how different parts of the LISA model contribute to the generated
HDL model of the target architecture.
19.6.1.1 The Resource Section
The resource section provides general information about the structure of the architecture (e.g., registers,
memories, and pipelines, see Section 19.3.2 resource/memory model ). Based on this information, the
coarse structure of the architecture can be generated automatically. Figure 19.14 shows an excerpt resource
declaration of the LISA model of the ICORE architecture [32], which was used in our case study.
The ICORE architecture has two different register sets one for general purpose use named R,
consisting of eight separate registers with 32 bits width and one for the address registers named AR,
consisting of four elements each with eleven bits. The round brackets indicate the maximum number of
simultaneously accesses allowed for the respective register bank six for the general purpose register R
and one for the address register set. From that, the respective number of access ports to the register banks
can be generated automatically. With this information bit-true widths, ranges and access ports the
register banks can be easily synthesized. Moreover, a data and program memory resource are declared
both 32 bits wide and with just one allowed access per cycle. Since various memory types are known and
are generally very technology dependent, however, cannot be further specied inthe LISAmodel, wrappers
are generated with the appropriate number of access ports. Before synthesis, the wrappers need to be lled
manually with code for the respective technology. The resources labelled as PORT are accessible from
outside the model and can be attached to a testbench in the ICORE the RESET and the STATE_BUS.
Besides the processor resources such as memories, ports, and registers, also pipelines and pipeline
registers are declared. The ICORE architecture contains a four stage instruction pipeline consisting of the
stages FI (instruction fetch), ID (instruction decode), EX (instruction execution), and WB (write-back
to registers). In between those pipeline stages, pipeline registers are located which forward information
about the instruction such as instruction opcode, operand registers, etc. The declared pipeline registers
are multiple instanced between each stage and are completely generated from the LISA model. For the
pipeline and the stages, entities are created which are in a subsequent phase of the HDL generator run
lled with code for functional units, instruction decoder, pipeline controller, etc.
RESOURCE
{
REGISTER S32 R([0..7])6; /* GP Registers */
REGISTER bit[11] AR([0..3]); /* Address Registers */
DATA_MEMORY S32 RAM([0..255]); /* Memory Space */
PROGRAM_MEMORY U32 ROM([0..255]);/* Instruction ROM */
PORT bit[1] RESET; /* Reset pin */
PORT bit[32] STATE_BUS; /* Processor state bus */
PIPELINE_REGISTER IN ppu_pipe {
bit[6] Opcode;
...
};
}
FIGURE 19.14 Resource declaration in the LISA model of the ICORE architecture.
Architecture entity
Register entity
FE entity FE/DC DC entity
Branch entity ALU entity Shifter entity
DC/EX EX entity
Base structure
Pipeline structure
Stage structureLISA entities
Pipline entity Memory entity
FIGURE 19.15 Entity hierarchy in generated HDL model.
19.6.1.2 Grouping Operations to Functional Units
As the LISA language describes the target architectures behavior and timing on the granularity of hardware
operations, however, the synthesis requires the grouping of hardware operations to functional units that
can then be lled with hand-optimized HDL code for the data path, a well known construct from the
VHDL language was adopted for this purpose: the ENTITY (see Section 19.3.2 micro-architecture
model ). Using the ENTITY to group hardware operations to a functional unit is not only an essential
information for the HDL code generator but also for retargeting the HLL C-compiler which requires
information about the availability of hardware resources to schedule instructions.
As indicated in Section 19.6.1.1, the HDL code derived fromthe LISAresource sectionalready comprises
a pipeline entity including further entities for each pipeline stage and the respective pipeline registers. The
entities dened in the LISA model are now part of the respective pipeline stages as shown in Figure 19.15.
Here, a Branch entity is placed into the entity of the Decode stage. Moreover, the EX stage contains an
ALU and a Shifter entity. As it is possible in LISA to assign hardware operations to pipeline stages, this
information is sufcient to locate the functional units within the pipeline they are assigned to.
As already pointed out, the entities of the functional units are wrappers which need to be lled with
HDL code by hand. Nevertheless, in Section 19.6.2.1 will be shown that by far the largest part of the target
architecture can be generated automatically from a LISA model.
19.6.1.3 Generation of the Instruction Decoder
The generated HDL decoder is derived from information in the LISA model on the coding of instructions
(see Section 19.3.2 instruction-set model ). Depending on the structuring of the LISA architecture
description, decoder processes are generated in several pipeline stages. The specied signal paths within
the target architecture can be divided into data signals and control signals. The control signals are a
straight forward derivation of the operation activation tree which is part of the LISA timing model (see
Section 19.3.2 timing model ). The data signals are explicitly modelled by the designer by writing values
into pipeline registers and implicitly xed by the declaration of used resources in the behavior sections of
LISA operations.
19.6.2 Implementation Results
The ICORE which was used in our case study is a low-power application specic instruction set processor
(ASIP) for DVB-Tacquisitionandtracking algorithms. It has beendevelopedincooperationwithInneon
Technologies. The primary tasks of this architecture are the FFT-window-position, sampling-clock syn-
chronization for interpolation/decimation and carrier frequency offset estimation. In a previous project
this architecture was completely designed by hand using semi-custom design. Thereby, a large amount of
effort was spent in optimizing the architecture towards extremely low power consumption while keeping
up the clock frequency at 120 MHz. At that time, a LISA model was already realized for architecture
exploration purposes and for verifying the model against the handwritten HDL implementation.
Instruction-
fetch
FI ID EX
WB
DAG
ZOLP
Branch
Pipeline
control
Decoder
Mem
Addsub
Bitmanip
IIC
MOVE
ALU
Minmax
Mult
Shifter
Registers
I/O
control
Memory
Data path
Control path
Autom. gen. process Entity
Manual entity
ICORE architecture
Pipeline
Write-
back
Decoder
FIGURE 19.16 The complete generated HDL model.
Except for the data path within functional units, the HDL code of the architecture has been generated
completely. Figure 19.16 shows the composition of the model.
The dark boxes have been lled manually with HDL code, whereas the light boxes and interconnects
are the result of the generation process.
19.6.2.1 Comparison of Development Time
The LISA model of the ICORE as well as the original handwritten HDL model of the ICORE architecture
have been developed by one designer. The initial manual realization of the HDL model (without the time
needed for architecture exploration) took approx. three months. As already indicated, a LISA model was
built in this rst realization of the ICORE for architecture exploration and verication purposes. It took
the designer approx. one month to learn the LISA language and to create a cycle-accurate LISA model.
After completion of the HDL generator, it took another two days to rene the LISA model to
RTL-accuracy. The handwritten functional units (data path), that were added manually to the generated
HDL model, could be completed in less than a week.
This comparisonclearly indicates, that the time expensive work inrealizing the HDLmodel was to create
structure, controller and decoder of the architecture. In addition, a major decrease of total architecture
design time can be seen, as the LISA model results from the design exploration phase.
19.6.2.2 Gate Level Synthesis
To verify the feasibility of generating automatically HDL code fromLISAarchitecture descriptions in terms
of power-consumption, clock speed, and chip area, a gate level synthesis was carried out. The model has
not been changed (i.e., manually optimized) to enhance the results.
19.6.2.2.1 Timing and Size Comparison
The results of the gate-level synthesis affecting timing and area optimization were compared to the
handwrittenICOREmodel, whichcomprisedthe same architectural features. Moreover, the same synthesis
scripts were used for both models. It shall be emphasized that the performance values are nearly the
same for both models. Furthermore, it is interesting that the same critical paths were found in both,
the handwritten and the generated model. The critical paths occur exclusively in the data path, which
conrms the presumption that the data path is the most critical part of the architecture and should thus
not be generated automatically from an abstract processor model.
19.6.2.2.2 Critical Path
The synthesis has been performed with a clock of 8 nsec, this equals a frequency of 125 MHz. The critical
path, starting from the pipeline register to the shifter unit and multiplexer to the next pipeline register,
violates this timing constraints by 0.36 nsec. This matches the handwritten ICORE model, which has been
improved from this point of state manually at gate-level.
The longest combinatoric path of the IDstage runs through the decoder and the DAGentity and counts
3.7 nsec. Therefore, the generated decoder does not affect the critical path in any way.
19.6.2.2.3 Area
The synthesized area has been a minor criteria, due to the fact that the constrains for the handwritten
ICORE model are not area sensitive. The total area of the generated ICORE model is 59,009 gates. The
combinational area takes 57% of the total area. The handwritten ICORE model takes a total area of 58,473
gates.
The most complex part of the generated ICORE is the decoder. The area of the automatically generated
decoder in the ID stage is 4693 gates, whereas the area of the handwritten equivalent is 5500 gates. This
result must be consideredcarefully as the control logic varies insome implementedfeatures for example,
the handwritten decoder and program ow controller support an idle and suspended state of the core.
19.6.2.2.4 Power Consumption Comparison
Figure 19.17 shows the comparisonof power consumptionof the handwrittenversus the generated ICORE
realization.
The handwritten model consumes 12,64 mW, whereas the implementation generated from a LISA
model consumes 14,51 mW. The reason for the slightly worse numbers in power consumption of the
generated model versus the handwritten is due to the early version of the LISA HDL generator which in its
current state allows access to all registers and memories within the model via the test-interface. Without
this unnecessary overhead, the same results as for the hand-optimized model are achievable.
16
14
12
10
8
6
4
2
0
P
o
w
e
r

c
o
n
s
u
m
p
t
i
o
n

(
m
W
)
14,51mW
12,64mW
Handwritten ICORE Generated ICORE
FIGURE 19.17 Power consumption of different ICORE realizations.
FIGURE 19.18 Graphical debugger frontend.
To summarize, it could be shown in this chapter that it is feasible to generate efcient HDL code from
architecture descriptions in the LISA language.
19.7 Tools for Application Development
The LPDP application software development tool-suite includes HLL C-compiler, assembler, linker,
simulator as well as a graphical debugger frontend. Providing these tools, a complete software devel-
opment environment is available which ranges from the C/assembly source le up to simulation within a
comfortable graphical debugger frontend.
The tools are an enhanced version of those tools used for architecture exploration. The enhancements
concern for the software simulator the ability to graphically visualize the debugging process of the applic-
ation under test. The LISA debugger frontend ldb is a generic GUI for the generated LISA simulator (see
Figure 19.18). It visualizes the internal state of the simulation process. Both the C-source code and the
disassembly of the application as well as all congured memories and (pipeline) registers are displayed.
All contents can be changed in the frontend at run-time of the application. The progress of the simulator
can be controlled by stepping and running through the application and setting breakpoints.
The code generation tools (assembler and linker) are enhanced in functionality as well. The assembler
supports more than30 commonassembler directives, labels, and symbols, named user sections, generation
of source listing and symbol table and provides detailed error report and debugging facilities, whereas the
linker is driven by a powerful linker command le with the ability to link sections into different address
spaces, paging support, and the possibility to dene user specic memory models.
19.7.1 Examined Architectures
To examine the quality of the generated software development tools, four different architectures have been
considered. The architectures were carefully chosen to cover a broad range of architectural characteristics
and are widely used in the eld of digital signal processing (DSP) and microcontrollers (C). Moreover,
the abstraction level of the models ranges fromphase-accuracy (TMS320C62x) to instruction-set accuracy
(ARM7):
ARM7. The ARM7 core is a 32 bit microcontroller of Advanced RISCMachines Ltd [39]. The realization
of a LISA model of the ARM7 C at instruction-set accuracy took approx. two weeks.
ADSP2101. The Analog Devices ADSP2101 is a 16 bit xed-point DSP with 20 bit instruction-word
width [40]. The realization of the LISA model of the ADSP2101 at cycle-accuracy took approx. 3 weeks.
TMS320C54x. The Texas Instruments TMS320C54x is a high performance 16 bit xed-point DSP with
a six stage instruction pipeline [37]. The realization of the model at cycle-accuracy (including pipeline
behavior) took approx. 8 weeks.
TMS320C62x. The Texas Instruments TMS320C62x is a general-purpose xed-point DSP based on a
very long instruction-word (VLIW) architecture containing an eleven stage pipeline [36]. The realization
of the model at phase-accuracy (including pipeline behavior) took approx. 6 weeks.
These architectures were modelled on the respective abstraction level with LISA and software devel-
opment tools were generated successfully. The speed of the generated tools was then compared with the
tools shipped by the respective tools of the architecture vendor. Of course the LISA tools are working
on the same level of accuracy as the vendor tools. The vendor tools are exclusively using the interpretive
simulation technique.
19.7.2 Efciency of the Generated Tools
Measurements took place on an AMD Athlon system with a clock frequency of 800 MHz. The system is
equipped with 256 MB of RAM and is part of the networking system. It runs under the operating system
Linux, kernel version 2.2.14. Tool compilation was performed with GNU GCC, version 2.92.
The generation of the complete tool-suite (HLL C-compiler, simulator, assembler, linker, and debugger
frontend) takes, depending on the complexity of the considered model, between 12 sec (ARM7 C
instruction-set accurate) and 67 sec (C6x DSP phase-accurate). Due to the early stage in research on the
retargetable compiler (see Section 19.8), no results on code quality are presented.
19.7.2.1 Performance of the Simulator
Figures 19.19 to 19.22 show the speed of the generated simulators in instructions per second and cycles
per second, respectively. Simulation speed was quantied by running an application on the respective
simulator and counting the number of processed instructions/cycles.
The set of simulated applications on the architectures comprises a simple 20 tap FIR lter, an ADPCM
G.721 (Adaptive Differential Pulse Code Modulation) coder/decoder and a GSM speech codec. For the
ARM7, an ATM-QFC protocol application was additionally run, which is responsible for ow control and
conguration in an ATM portprocessor chip.
As expected, the compiled simulation technique applied by the generated LISA simulators outperforms
the vendor simulators by one to two orders in magnitude.
40
35
30
25
20
15
10
5
0
S
p
e
e
d

(
i
n

m
e
g
a

i
n
s
t
r
.

p
e
r

s
e
c
o
n
d
s
)
FIR ADPCM ATM-QFC
LISAcompiled (code translation)
LISAcompiled (dynamic scheduling)
ARMulator interpretive
ARM7 (real hardware, 25MHz)
FIGURE 19.19 Speed of the ARM7 C at instruction-accuracy.
25
20
15
10
5
0
S
p
e
e
d

(
i
n

m
e
g
a

c
y
c
l
e
s

p
e
r

s
e
c
)
FIR ADPCM GSM
LISAcompiled (code translation)
Analog Devices xsim 2101interpretive
0,01Meg 0,01Meg 0,01Meg
FIGURE 19.20 Speed of the ADSP2101 DSP at cycle-accuracy.
6
5
4
3
2
1
0
S
p
e
e
d

(
i
n

m
e
g
a

c
y
c
l
e
s

p
e
r

s
e
c
)
FIR ADPCM GSM
0,075Meg 0,075Meg 0,075Meg
LISAcompiled (static scheduling)
Texas Instruments sim54x interpretive
FIGURE 19.21 Speed of C54x DSP at cycle-accuracy.
1200
1000
800
600
400
200
0
S
p
e
e
d

(
i
n

k
i
l
o

c
y
c
l
e
s

p
e
r

s
e
c
)
FIR ADPCM GSM
Texas Instruments sim54x interpretive
15K 15K 15K
FIGURE 19.22 Speed of the C6x DSP at cycle-accuracy.
As both the ARM7 and ADSP2101 LISA model contain no instruction pipeline, two different avors of
compiled simulation are applied in the benchmarks instruction-based code translation and dynamic
scheduling (see Section 19.5.2.4). It shows, that the highest possible degree of simulation compilation
offers an additional speed-up of a factor 27 compared to dynamically scheduled compiled simulation.
As explained in Section 19.5.2.4, the speed-up decreases with bigger applications due to cache misses on
the simulating host. It is interesting to see that considering an ARM7 C running at a frequency of
25 MHz the software simulator running at 31 MIPS even outperforms the real hardware. This makes
application development suitable before the actual silicon is at hand.
The LISA model of the C54x DSP is cycle-accurate and contains an instruction pipeline. Therefore, com-
piled simulation with static scheduling is applied (see Section 19.5.2.3). This pays off with an additional
speed-up of a factor of 5 compared to a dynamically scheduled compiled simulator.
Due tothe superscalar instructiondispatching mechanismusedinthe C62x architecture, whichis highly
run-time dependent, the LISA simulator for the C62x DSP uses only compiled simulation with dynamic
scheduling. However, the dynamic scheduled compiled simulator still offers a signicant speed-up of a
factor of 65 compared to the native TI simulator.
19.7.2.2 Performance of Assembler and Linker
The generated assembler and linker are not as time critical as the simulator is. It shall be mentioned though
that the performance (i.e., the number of assembled/linked instructions per second) of the automatically
generated tools is comparable to that of the vendor tools.
19.8 Requirements and Limitations
In this chapter the requirements and current limitations of different aspects of the processor design using
the LISA language are discussed. These affect the modelling capabilities of the language itself as well as the
generated tools.
19.8.1 LISA Language
Commontoall models describedinLISAis the underlying zero-delay model. This means that all transitions
are provided correctly at each control step. Control steps may be clock phases, clock cycles, instruction
cycles or even higher levels. Events between these control steps are not regarded. However, this property
meets requirements of current co-simulation environments [4143] on processor simulators to be used
for HW/SW co-design [44,45]. Besides, the LISA language currently contains no formalism to describe
memory hierarchies such as multi-level caches. However, existing C/C++ models of memory hierarchies
can easily be integrated into the LISA architecture model.
19.8.2 HLL C-compiler
Due to the early stage of research, no further details on the retargetable compiler are presented within
the scope of this chapter. At the current status, the quality of the generated code is only fair. However,
it is evident that the proposed new ASIP design methodology can only be carried out efciently at the
presence of an efcient retargetable compiler. In our case study presented in Section 19.6, major parts of
the application were realized in assembly code.
19.8.3 HDL Generator
As LISA allows modelling the architecture using a combination of both LISA language elements and pure
C/C++ code, certain coding guidelines need to be obeyed in order to generate synthesizable HDL code of
the target architecture. Firstly, only the LISA language elements are considered thus the usage of C-code
in the model needs to be limited to the description of the data path which is not taken into account for
HDL code generation anyway. Secondly, architectural properties, which can be modelled in LISA but are
not synthesizable include pipelined functional units and multiple instruction word decoders.
19.9 Conclusion and Future Work
In this chapter we presented the LISA processor design platform LPDP a novel framework for the
design of application specic integrated processors. The LPDP platform helps the architecture designer
in different domains: architecture exploration, implementation, application software design, and system
integration/verication.
In a case study it was shown that an ASIP, the ICORE architecture, was completely realized using this
novel design methodology from exploration to implementation. The implementation results concern-
ing maximum frequency, area and power consumption were comparable to those of the hand-optimized
version of the same architecture realized in a previous project.
Moreover, the quality of the generated software development tools was compared to those of the
semiconductor vendors. LISA models were realized and tools successfully generated for the ARM7 C, the
Analog Devices ADSP2101, the Texas Instruments C62x and the Texas Instruments C54x on instruction-
set/cycle/phase-accuracy respectively. Due to the usage of the compiled simulation principle, the generated
simulators run by one to two orders in magnitude faster than the vendor simulators. In addition, the
generated assembler and linker can compete well in speed with the vendor tools.
Our future work will focus on modelling further real-world processor architectures and improving
the quality of our retargetable C-compiler. In addition, formal ways to model memory hierarchies will
be addressed. For the HDL generator, data path synthesis will be examined in context of the SystemC
modelling language.
References
[1] M. Birnbaum and H. Sachs, How VSIA answers the SOC dilemma. IEEE Computer, 32,
4250, 1999.
[2] S. Pees, A. Hoffmann, V. Zivojnovic, and H. Meyr, LISAmachine description language for cycle-
accurate models of programmable DSP architectures. In Proceedings of the Design Automation
Conference (DAC). New Orleans, June 1999.
[3] V. ivojnovi c, S. Pees, and H. Meyr, LISA machine description language and generic machine
model for HW/SW co-design. In Proceedings of the IEEE Workshop on VLSI Signal Processing.
San Francisco, October 1996.
[4] K. Olukotun, M. Heinrich, and D. Ofelt, Digital system simulation: methodologies and examples.
In Proceedings of the Design Automation Conference (DAC), June 1998.
[5] J. Rowson, Hardware/software co-simulation. In Proceedings of the Design Automation Conference
(DAC), 1994.
[6] R. Stallman, Using and Porting the GNU Compiler Collection, gcc-2.95 ed. Free Software
Foundation, Boston, MA, 1999.
[7] G. Araujo, A. Sudarsanam, and S. Malik, Instruction set design and optimization for address com-
putation in DSP architectures. In Proceedings of the International Symposium on System Synthesis
(ISSS), 1996.
[8] C. Liem et al., Industrial experience using rule-driven retargetable code generation for multi-
media applications. In Proceedings of the International Symposium on System Synthesis (ISSS),
September 1995.
[9] D. Engler, VCODE: a retargetable, extensible, very fast dynamic code generation system.
InProceedings of the International Conference onProgramming Language DesignandImplementation
(PLDI), May 1996.
[10] D. Bradlee, R. Henry, and S. Eggers, The Marion system for retargetable instruction schedul-
ing. In Proceedings of the ACM SIGPLAN91 Conference on Programming Language Design and
Implementation. Toronto, Canada, 1991, pp. 229240.
[11] B. Rau, VLIW compilation driven by a machine description database. In Proceedings of the 2nd
Code Generation Workshop. Leuven, Belgium, 1996.
[12] M. Freericks, The nML machine description formalism. Technical Report 1991/15, Technische
Universitt Berlin, Fachbereich Informatik, Berlin, 1991.
[13] A. Fauth, J. Van Praet, and M. Freericks, Describing instruction set processors using nML.
In Proceedings of the European Design and Test Conference. Paris, March 1995.
[14] M. Hartoog et al., Generation of software tools from processor descriptions for hardware/software
codesign. In Proceedings of the Design Automation Conference (DAC), June 1997.
[15] W. Geurts et al., Design of DSP systems with chess/checkers. In Proceedings of the 2nd International
Workshop on Code Generation for Embedded Processors. Leuven, March 1996.
[16] J. Van Praet et al., A graph based processor model for retargetable code generation. In Proceedings
of the European Design and Test Conference (ED&TC), March 1996.
[17] V. Rajesh and R. Moona, Processor modeling for hardware software codesign. In Proceedings of the
International Conference on VLSI Design. Goa, India, January 1999.
[18] G. Hadjiyiannis, S. Hanono, and S. Devadas, ISDL: an instruction set description language for
retargetability. In Proceedings of the Design Automation Conference (DAC), June 1997.
[19] A. Halambi et al., EXPRESSION: a language for architecture exploration through com-
piler/simulator retargetability. In Proceedings of the Conference on Design, Automation & Test
in Europe (DATE), March 1999.
[20] P. Paulin, Design automation challenges for application-specic architecture platforms. In
Proceedings of the SCOPES 2001 Workshop on Software and Compilers for Embedded Systems,
March 2001.
[21] ACE Associated Compiler Experts, The COSY Compilation System, 2001. http://www.ace.nl/
products/cosy.html
[22] T. Morimoto, K. Saito, H. Nakamura, T. Boku, and K. Nakazawa, Advanced processor design using
hardware description language AIDL. In Proceedings of the Asia South Pacic Design Automation
Conference (ASPDAC), March 1997.
[23] I. Huang, B. Holmer, and A. Despain, ASIA: automatic synthesis of instruction-set architectures.
In Proceedings of the SASIMI Workshop, October 1993.
[24] M. Gschwind, Instruction set selection for ASIP design. In Proceedings of the International Workshop
on Hardware/Software Codesign, May 1999.
[25] S. Kobayashi et al., Compiler generation in PEAS-III: an ASIP development system. In Pro-
ceedings of the SCOPES 2001 Workshop on Software and Compilers for Embedded Systems,
March 2001.
[26] C.-M. Kyung, Metacore: an application specic DSP development system. In Proceedings of the
Design Automation Conference (DAC), June 1998.
[27] M. Barbacci, Instruction set processor specications (ISPS): the notation and its application. IEEE
Transactions on Computers, C-30, 2440, 1981.
[28] R. Gonzales, Xtensa: a congurable and extensible processor. IEEE Micro, 20, 2000.
[29] Synopsys, COSSAP. http://www.synopsys.com
[30] OPNET, http://www.opnet.com
[31] LISA Homepage, ISS, RWTH Aachen, 2001, http://www.iss.rwth-aachen.de/lisa
[32] T. Gloekler, S. Bitterlich, and H. Meyr, Increasing the power efciency of application-specic
instruction set processors using datapath optimization. In Proceedings of the IEEE Workshop on
Signal Processing Systems (SIPS). Lafayette, October 2001.
[33] Synopsys, DesignWare Components, 1999. http://www.synopsys.com/products/designware/
designware.html
[34] A. Hoffmann, A. Nohl, G. Braun, and H. Meyr, Generating production quality software devel-
opment tools using a machine description language. In Proceedings of the Conference on Design,
Automation & Test in Europe (DATE), March 2001.
[35] S. Pees, A. Hoffmann, and H. Meyr, Retargeting of compiled simulators for digital signal processors
using a machine description language. In Proceedings of the Conference on Design, Automation &
Test in Europe (DATE). Paris, March 2000.
[36] Texas Instruments, TMS320C62x/C67x CPU and Instruction Set Reference Guide, March 1998.
[37] Texas Instruments, TMS320C54x CPU and Instruction Set Reference Guide, October 1996.
[38] R. Sites et al., Binary translation. Communications of the ACM, 36, 6981, 1993.
[39] Advanced Risc Machines Ltd., ARM7 Data Sheet, December 1994.
[40] Analog Devices, ADSP2101 Users Manual, September 1993.
[41] Synopsys, Eaglei, 1999. http://www.synopsys.com/products/hwsw
[42] Cadence, Cierto, 1999. http://www.cadence.com/technology/hwsw
[43] Mentor Graphics, Seamless, 1999. http://www.mentor.com/seamless
[44] L. Guerra et al., Cycle and phase accurate DSP modeling and integration for HW/SW
co-verication. In Proceedings of the Design Automation Conference (DAC), June 1999.
[45] R. Earnshaw, L. Smith, and K. Welton, Challenges in cross-development. IEEE Micro, 17,
2836, 1997.
20
State-of-the-Art SoC
Communication
Architectures
Jos L. Ayala and
Marisa Lpez-Vallejo
Universidad Politcnica de Madrid
Davide Bertozzi and
Luca Benini
20.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-1
20.2 AMBA Bus. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-2
AMBA System Bus AMBA AHB Basic Operation
Advanced Peripheral Bus Advanced AMBA Evolutions
20.3 CoreConnect Bus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-7
Processor Local Bus On-Chip Peripheral Bus Device
Control Register Bus
20.4 STBus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-10
Bus Topologies
20.5 Wishbone . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-11
The Wishbone Bus Transactions
20.6 SiliconBackplane MicroNetwork . . . . . . . . . . . . . . . . . . . . . . . 20-12
System Interconnect Bandwidth Conguration Resources
20.7 Other On-Chip Interconnects . . . . . . . . . . . . . . . . . . . . . . . . . . 20-14
Peripheral Interconnect Bus Avalon CoreFrame
20.8 Analysis of Communication Architectures. . . . . . . . . . . . . 20-15
Scalability Analysis
20.9 Packet-Switched Interconnection Networks . . . . . . . . . . . 20-20
20.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20-21
20.1 Introduction
The current high levels of on-chip integration allow for the implementation of increasingly complex
Systems-on-Chip (SoC), consisting of heterogeneous components such as general-purpose processors,
Digital Signal Processors (DSPs), coprocessors, memories, I/Ounits, and dedicated hardware accelerators.
In this context, MultiProcessor Systems-on-Chip (MPSoC) are emerging as an effective solution to
meet the demand for computational power posed by application domains such as network processors
and parallel media processors. MPSoCs combine the advantages of parallel processing with the high
integration levels of SoCs.
It is expected that future MPSoCs will integrate hundreds of processing units and storage elements,
and their performance will be increasingly interconnect dominated [1]. Interconnect technology and
20-1
architecture will become the limiting factor for achieving operational goals, and the efcient design
of low-power, high-performance on-chip communication architectures will pose novel challenges. The
main issue regards scalability of system interconnects, since the trend for system integration is expected to
continue. State-of-the-art on-chip buses rely on shared communication resources and on an arbitration
mechanism that is in charge of serializing bus access requests. This widely adopted solution unfortunately
suffers from power and performance scalability limitations, therefore a lot of effort is being devoted to
the development of advanced bus topologies (e.g., partial or full crossbars, bridged buses) and protocols,
some of them are already implemented in commercially available products. In the long run, a more
aggressive approach will be needed, and a design paradigm shift will most probably lead to a packetized
on-chip communication based on micronetworks of interconnects or Networks-on-Chip (NoC) [2,3].
This chapter focuses on state-of-the-art SoC communication architectures, providing an overview of
the most relevant ones from an industrial and research viewpoint. Beyond describing the distinctive
features of each of them, the chapter sketches the main evolution guidelines for these architectures by
means of a protocol and topology analysis framework. Finally, some basic concepts on packet-switched
interconnection networks will be put forward. Open bus specications such as Advanced MicroControl-
ler Bus Architecture (AMBA) and CoreConnect will be obviously described more in detail, providing the
background which is needed to understand the more general description of proprietary industrial bus
architectures, while at the same time being able to assess their contribution to the advance in the eld.
20.2 AMBA Bus
AMBA is a bus standard which was originally conceived by ARM to support communication among ARM
processor cores. However, nowadays AMBA is one of the leading on-chip busing systems because it is
licensed and deployed for use with third party Intellectual Property (IP) cores [4]. Designed for custom
silicon, the AMBA specication provides standard bus protocols for connecting on-chip components,
custom logic and specialized functions. These bus protocols are independent of the ARM processor and
generalized for different SoC structures.
AMBA denes a segmented bus architecture, wherein two bus segments are connected with each other
via a bridge that buffers data and operations between them. Asystembus is dened, which provides a high-
speed, high-bandwidth communication channel between embedded processors and high-performance
peripherals. Two system buses are actually specied: the AMBA High-Speed Bus (AHB) and the Advanced
System Bus (ASB).
Moreover, a low-performance and low-power peripheral bus (called Advanced Peripheral Bus, APB) is
specied, which accommodates communication with general-purpose peripherals and is connected to the
system bus via a bridge, acting as the only APB master. The overall AMBA architecture is illustrated in
Figure 20.1.
20.2.1 AMBA System Bus
ASB is the rst generation of AMBA system bus, and sits above APB in that it implements the features
required for high-performance systems including burst transfers, pipelined transfer operation and mul-
tiple bus masters. AHB is a later generation of AMBAbus which is intended to address the requirements of
high-performance, high-clock synthesizable designs. ASB is used for simpler, more cost-effective designs
whereas more sophisticated designs call for the employment of the AHB. For this reason, a detailed
description of AHB follows.
The main features of AMBA AHB can be summarized as follows:
Multiple bus masters. Optimized system performance is obtained by sharing resources among different
bus masters. A simple request-grant mechanism is implemented between the arbiter and each bus master.
In this way, the arbiter ensures that only one bus master is active on the bus and also that when no masters
are requesting the bus a default master is granted.
SoC Communication Architectures 20-3
SDRAM
controller
Color
LCD
controller
Smart
card I/F
UART
Synchronous
serial port
AMBA(AHB)
System bus
AMBA(APB)
Peripheral bus
High speed
Low power
External
memory
B
r
i
d
g
e
Audio
codec I/F
Test I/F
controller
SRAM
ARM
CPU
FIGURE 20.1 Schematic architecture of AMBA bus.
Pipelined and burst transfers. Address and data phases of a transfer occur during different clock periods.
In fact, the address phase of any transfer occurs during the data phase of the previous transfer. This
overlapping of address and data is fundamental to the pipelined nature of the bus and allows for high-
performance operation, while still providing adequate time for a slave to provide the response to a transfer.
This also implies that ownership of the data bus is delayed with respect to ownership of the address bus.
Moreover, support for burst transfers allows for efcient use of memory interfaces by providing transfer
information in advance.
Split transactions. They maximize the use of bus bandwidth by enabling high latency slaves to release
the system bus during dead time while they complete processing of their access requests.
Wide data bus congurations. Support for high-bandwidth data-intensive applications is provided using
wide on-chip memories. System buses support 32-, 64-, and 128-bit data bus implementations with a
32-bit address bus, as well as smaller byte and half-word designs.
Nontristate implementation. AMBA AHB implements a separate read and write data bus in order to
avoid the use of tristate drivers. In particular, master and slave signals are multiplexed onto the shared
communication resources (read and write data buses, address bus, control signals).
A typical AMBA AHB system contains the following components:
AHB master. Only one bus master at a time is allowed to initiate and complete read and write trans-
actions. Bus masters drive out the address and control signals and the arbiter determines which master
has its signals routed to all of the slaves. A central decoder controls the read data and response signal
multiplexor, which selects the appropriate signals from the slave that has been addressed.
AHB slave. It signals back to the active master, the status of the pending transaction. It can indicate
if the transfer is completed successfully, or there was an error or the master should retry the transfer or
indicate the beginning of a split transaction.
AHB arbiter. The bus arbiter serializes bus access requests. The arbitration algorithm is not specied
by the standard and its selection is left as a design parameter (xed priority, round-robin, latency-driven,
etc.), although the request-grant based arbitration protocol has to be kept xed.
AHB decoder. This is used for address decoding and provides the select signal to the intended slave.
20.2.2 AMBA AHB Basic Operation
In a normal bus transaction, the arbiter grants the bus to the master until the transfer completes and the
bus can then be handed over to another master. However, in order to avoid excessive arbitration latencies,
the arbiter can break up a burst. In that case, the master must rearbitrate for the bus in order to complete
the remaining data transfers.
A basic AHB transfer consists of four clock cycles. During the rst one, the request signal is asserted,
and in the best case at the end of the second cycle a grant signal from the arbiter can be sampled by the
master. Then, address and control signals are asserted for slave sampling on the next rising edge, and
during the last cycle the data phase is carried out (read data bus-driven or information on the write data
bus sampled). A slave may insert wait states into any transfer, thus extending the data phase, and a ready
signal is available for this purpose.
Four-, eight-, and sixteen-beat bursts are dened in the AMBA AHB protocol, as well as undened-
length bursts. During a burst transfer, the arbiter rearbitrates the bus when the penultimate address has
been sampled, so that the asserted grant signal can be sampled by the relative master at the same point
where the last address of the burst is sampled. This makes bus master handover at the end of a burst
transfer very efcient.
For long transactions, the slave can decide to split the operation warning the arbiter that the master
should not be granted access to the bus until the slave indicates it is ready to complete the transfer. This
transfer splitting mechanism is supported by all advanced on-chip interconnects, since it prevents high
latency slaves from keeping the bus busy without performing any actual transfer of data.
On the contrary, split transfers can signicantly improve bus efciency, that is, reduce the number of
bus busy cycles used just for control (e.g., protocol handshake) and not for actual data transfers. Advanced
arbitration features are required in order to support split transfers, as well as more complex master and
slave interfaces.
20.2.3 Advanced Peripheral Bus
The AMBA APB is intended for general-purpose low-speed low-power peripheral devices. It enables the
connection to the main system bus via a bridge. All bus devices are slaves, the bridge being the only
peripheral bus master.
This is a static bus that provides a simple addressing, with latched addresses and control signals for easy
interfacing. ARM recommends a dual Read and Write bus implementation, but APB can be implemented
with a single tristated data bus.
The main features of this bus are the following:
Unpipelined architecture
Low-gate count
Low-power operation
(a) Reduced loading of the main system bus is obtained by isolating the peripherals behind the
bridge.
(b) Peripheral bus signals are only active during low-bandwidth peripheral transfers.
AMBA APB operation can be abstracted as a state machine with three states. The default state for the
peripheral bus is IDLE, which switches to SETUP state when a transfer is required. SETUP state lasts just
one cycle, during which the peripheral select signal is asserted. The bus then moves to ENABLE state,
which also lasts only one cycle and which requires the address, control, and data signals to remain stable.
Then, if other transfers are to take place, the bus goes back to SETUP state, otherwise to IDLE. As can be
observed, AMBAAPB should be used to interface to any peripherals which are low-bandwidth and do not
require the high-performance of a pipelined bus interface.
20.2.4 Advanced AMBA Evolutions
Recently, some advanced specications of AMBA bus have appeared, featuring increased performance
and better link utilization. In particular, the Multi-Layer AHB and the AMBA AXI interconnect schemes
will be briey addressed in the following subsections.
It should be observed that interconnect performance improvement can be achieved by adopting new
topologies and by choosing new protocols, at the expense of silicon area. The former strategy leads
from shared buses to bridged clusters, partial or full crossbars, and eventually to NoCs, in an attempt to
increase available bandwidth and to reduce local contention. The latter strategy instead tries to maximize
link utilization by adopting more sophisticated control schemes and thus permitting a better sharing of
existing resources.
Multi-Layer AHB can be seen as an evolution of bus topology while keeping the AHB protocol
unchanged. On the contrary, AMBA AXI represents an advanced interconnect fabric protocol.
20.2.4.1 Multi-Layer AHB
The Multi-Layer AHB specication emerges with the aim of increasing the overall bus bandwidth and
providing a more exible interconnect architecture with respect to AMBA AHB. This is achieved by using
a more complex interconnection matrix which enables parallel access paths between multiple masters and
slaves in a system [5].
Therefore, the multi-layer bus architecture allows the interconnection of unmodied standard AHB
master and slave modules with an increased available bus bandwidth. The resulting architecture becomes
very simple and exible: each AHB layer only has one master and no arbitration and master-to-slave
muxing is needed. Moreover, the interconnect protocol implemented in these layers can be very simple: it
does not have to support request and grant, nor retry or split transactions.
The additional hardware needed for this architecture with respect to the AHBis a multiplexor to connect
the multiple masters to the peripherals and some point arbitration is also required when more than one
master wants to access the same slave simultaneously.
Figure 20.2 shows a schematic view of the multi-layer concept. The interconnect matrix contains
a decode stage for every layer in order to determine which slave is required during the transfer. The
multiplexer is used to route the request from the specic layer to the desired slave.
The arbitration protocol decides the sequence of accesses of layers to slaves based on a priority assign-
ment. The layer with lowest priority has to wait for the slave to be freed. Different arbitration schemes can
be used, and every slave port has its own arbitration. Input layers can be served in a round-robin fashion,
changing every transfer or every burst transaction, or based on a xed priority scheme.
The number of input/output ports on the interconnect matrix is completely exible and can be adapted
to suit to system requirements. As the number of masters and slaves implemented in the system increases,
the complexity of the interconnection matrix can become signicant and some optimization techniques
have to be used: dening multiple masters on a single layer, multiple slaves appearing as a single slave to
the interconnect matrix, and dening local slaves to a particular layer.
Finally, it is interesting to outline the capability of this topology to support multi-port slaves. Some
devices, such as SDRAMcontrollers, work much more efciently when processing transfers fromdifferent
layers in parallel.
20.2.4.2 AMBA AXI Protocol
AXI is the latest generation AMBA interface. It is designed to be used as a high-speed submicron inter-
connect, and also includes optional extensions for low-power operation [6]. This high-performance
protocol provides exibility in the implementation of interconnect architectures while still keeping
backward-compatibility with existing AHB and APB interfaces.
AMBA AXI builds upon the concept of point-to-point connection. AMBA AXI does not provide
masters and slaves with visibility of the underlying interconnect, instead featuring the concept of master
interfaces and symmetric slave interfaces. This approach, besides allowing seamless topology scaling, has
Slave
Decode
Decode
Mux
Mux
Master
Master
Slave
Slave
Slave
FIGURE 20.2 Schematic view of the multi-layer AHB interconnect.
the advantage of simplifying the handshake logic of attached devices, which only need to manage a
point-to-point link.
To provide high scalability and parallelism, four different logical monodirectional channels are provided
in AXI interfaces: an address channel, a read channel, a write channel, and a write response channel.
Activity on different channels is mostly asynchronous (e.g., data for a write can be pushed to the write
channel before or after the write address is issued to the address channel), and can be parallelized, allowing
multiple outstanding read and write requests.
Figure 20.3(a) shows how a read transaction uses the read address and read data channels. The write
operation over the write address and write data channels is presented in Figure 20.3(b).
As can be observed, the data is transferred from the master to the slave using a write data channel, and
it is transferred from the slave to the master using a read data channel. In write transactions, in which all
the data ows from the master to the slave, the AXI protocol has an additional write response channel to
allow the slave to signal to the master the completion of the write transaction.
However, the AXI protocol is a master/slave to interconnect interface denition, and this enables a
variety of different interconnect implementations. Therefore, the mapping of channels, as visible by the
interfaces, to actual internal communication lanes is decided by the interconnect designer; single resources
might be shared by all channels of a certain type in the system, or a variable amount of dedicated signals
might be available, up to a full crossbar scheme. The rationale of this split-channel implementation is
based upon the observation that usually the required bandwidth for addresses is much lower than that
for data (e.g., a burst requires a single address but maybe four or eight data transfers). Availability of
independently scalable resources might, for example, lead to medium complexity designs sharing a single
internal address channel while providing multiple data read and write channels.
Finally, some of the key incremental features of the AXI protocol can be listed as follows:
Support for out-of-order completion of transactions.
Easy addition of register stages to provide timing closure.
Support for multiple address issuing.
Separate read and write data channels to enable low-cost Direct Memory Access (DMA).
Support for unaligned data transfers.
Read
data
Read
data
Read
data
Address
and
control
Read address channel
Read data channel
Master
interface
Slave
interface
(a)
Address
and
control
Write
response
Write
data
Write
data
Write
data
Write address channel
(b)
Write data channel
Write response channel
Master
interface
Slave
interface
FIGURE 20.3 Architecture of transfers: (a) read operation, (b) write operation.
20.3 CoreConnect Bus
CoreConnect is an IBM-developed on-chip bus that eases the integration and reuse of processor, subsystem
and peripheral cores within standard product platform designs. It is a complete and versatile architecture
clearly targeting high-performance systems, and many of its features might be overkill in simple embedded
applications [7].
The CoreConnect bus architecture serves as the foundation of IBM Blue Logic or other non-IBM
devices. The Blue Logic ASIC/SoC design methodology is the approach proposed by IBM [8] to extend
conventional ASIC design ows to current design needs: low-power and multiple-voltage products,
recongurable logic, custom design capability, and analog/mixed-signal designs. Each of these offer-
ings requires a well-balanced coupling of technology capabilities and design methodology. The use of this
bus architecture allows the hierarchical design of SoCs.
As can be seen in Figure 20.4, the IBM CoreConnect architecture provides three buses for inter-
connecting cores, library macros, and custom logic:
Processor Local Bus (PLB)
On-Chip Peripheral Bus (OPB)
Device Control Register (DCR) Bus
The PLB bus connects the processor to high-performance peripherals, such as memories, DMA con-
trollers, and fast devices. Bridged to the PLB, the OPB supports slower-speed peripherals. Finally, the DCR
bus is a separate control bus that connects all devices, controllers, and bridges and provides a separate
System
core
System
core
System
core
Peripheral
core
Peripheral
core
Arbiter
On-Chip peripheral bus
(OPB)
Processor local bus
(PLB)
On-Chip
Memory
Processor
core
Auxiliary
processor
DCR bus
DCR bus
CoreConnect bus
Bus
bridge
Arbiter
FIGURE 20.4 Schematic structure of the CoreConnect bus.
path to set and monitor the individual control registers. It is designed to transfer data between the CPUs
general-purpose registers and the slave logics device control registers. It removes conguration registers
from the memory address map, which reduces loading and improves bandwidth of the PLB.
This architecture shares many high-performance features with the AMBA bus specication. On
one hand both architectures allow split, pipelined, and burst transfers, multiple bus masters and 32-,
64-, or 128-bits architectures. On the other hand, CoreConnect also supports multiple masters in the
peripheral bus.
Please note that design toolkits are available for the CoreConnect bus and include functional models,
monitors, anda bus functional language todrive the models. These toolkits provide anadvancedvalidation
environment for engineers designing macros to attach to the PLB, OPB, and DCR buses.
20.3.1 Processor Local Bus
The PLB is the main system bus targeting high-performance and low-latency on-chip communication.
More specically, PLB is a synchronous, multi-master, arbitrated bus. It supports concurrent read and
write transfers, thus yielding a maximum bus utilization of two data transfers per clock cycle. Moreover,
PLB implements address pipelining, that reduces bus latency by overlapping a new write request with an
ongoing write transfer and up to three read requests with an ongoing read transfer [9].
Access to PLB is granted through a central arbitration mechanism that allows masters to compete
for bus ownership. This arbitration mechanism is exible enough to provide for the implementation of
various priority schemes. In fact, four levels of request priority for each master allow PLB implementation
with various arbitration priority schemes. Additionally, an arbitration locking mechanism is provided to
support master-driven atomic operations. PLB also exhibits the ability to overlap the bus request/grant
protocol with an ongoing transfer.
The PLB specication describes a system architecture along with a detailed description of the signals
and transactions. PLB-based custom logic systems require the use of a PLB macro to interconnect the
various master and slave macros.
The PLB macro is the key component of PLB architecture, and consists of a bus arbitration control unit
and the control logic required to manage the address and dataow through the PLB. Each PLB master is
attached to the PLB through separate address, read data, and write data buses and a plurality of transfer
qualier signals, while PLB slaves are attached through shared, but decoupled, address, read data, and write
data buses (each one with its own transfer control and status signals). The separate address and data buses
from the masters allow simultaneous transfer requests. The PLB macro arbitrates among them and sends
the address, data, and control signals from the granted master to the slave bus. The slave response is then
routed back to the appropriate master. Up to 16 masters can be supported by the arbitration unit, while
there are no restrictions in the number of slave devices.
20.3.2 On-Chip Peripheral Bus
Frequently, the OPB architecture connects low-bandwidth devices such as serial and parallel ports, UARTs,
timers, etc. and represents a separate, independent level of bus hierarchy. It is implemented as a multi-
master, arbitrated bus. It is a fully synchronous interconnect with a common clock, but its devices can run
with slower clocks, as long as all of the clocks are synchronized with the rising edge of the main clock.
This bus uses a distributed multiplexer attachment implementation instead of tristate drivers. The
OPB supports multiple masters and slaves by implementing the address and data buses as a distributed
multiplexer. This type of structure is suitable for the less data intensive OPB bus and allows adding
peripherals to a custom core logic design without changing the I/O on either the OPB arbiter or existing
peripherals. All of the masters are capable of providing an address to the slaves, whereas both masters and
slaves are capable of driving and receiving the distributed data bus.
PLB masters gain access to the peripherals on the OPB bus through the OPB bridge macro. The OPB
bridge acts as a slave device on the PLB and a master on the OPB. It supports word (32-bit), half-word
(16-bit), and byte read and write transfers on the 32-bit OPB data bus, bursts and has the capability
to perform target word rst line read accesses. The OPB bridge performs dynamic bus sizing, allowing
devices with different data widths to efciently communicate. When the OPB bridge master performs an
operation wider than the selected OPB slave can support, the bridge splits the operation into two or more
smaller transfers.
Some of the main features of the OPB specication are:
Fully synchronous
Dynamic bus sizing: byte, half-word, full-word, and double-word transfers
Separate address and data buses
Support for multiple OPB bus masters
Single cycle transfer of data between OPB bus master and OPB slaves
Sequential address (burst) protocol
16-cycle xed bus timeout provided by the OPB arbiter
Bus arbitration overlapped with last cycle of bus transfers
Optional OPB DMA transfers
20.3.3 Device Control Register Bus
The DCR bus provides an alternative path to the system for setting the individual device control registers.
These latter are on-chip registers that are implemented outside the processor core, from an architectural
viewpoint. Through the DCR bus, the host CPU can set up the device-control-register sets without
loading down the main PLB. This bus has a single master, the CPU interface, which can read or write
to the individual device control registers. The DCR bus architecture allows data transfers among OPB
peripherals to occur independently from, and concurrently with data transfers between processor and
memory, or among other PLB devices. The DCR bus architecture is based on a ring topology to connect
the CPU interface to all devices. The DCR bus is typically implemented as a distributed multiplexer across
the chip such that each subunit not only has a path to place its own DCRs on the CPU read path, but
also has a path which bypasses its DCRs and places another units DCRs on the CPU read path. DCR bus
consists of a 10-bit address bus and a 32-bit data bus.
This is a synchronous bus, wherein slaves may be clocked either faster or slower than the master,
although a synchronization of clock signals with the DCR bus clock is required.
Finally, bursts are not supported by this bus, and two-cycle minimumread or write transfers are allowed.
Optionally, they can be extended by slaves or by the single master.
20.4 STBus
STBus is an STMicroelectronics proprietary on-chip bus protocol. STBus is dedicated to SoC designed for
high-bandwidth applications such as audio/video processing [10]. The STBus interfaces and protocols are
closely related to the industry standardVCI (Virtual Component Interface). The components interconnec-
ted by an STBus are either initiators (which initiate transactions on the bus by sending requests), or targets
(which respond to requests). The bus architecture is decomposed into nodes (sub-buses in which initiators
and targets can communicate directly), and the internode communications are performed through First
In First Out (FIFO) buffers. Figure 20.5 shows a schematic view of the STBus interconnect.
STBus implements three different protocols that can be selected by the designer in order to meet the
complexity, cost, and performance constraints. From lower to higher, they can be listed as follows:
Type 1: Peripheral protocol. This type is the low-cost implementation for low/medium-performance.
Its simple design allows a synchronous handshake protocol and provides a limited transaction set. The
peripheral STBus is targeted at modules that require a low complexity medium data rate communication
path with the rest of the system. This typically includes standalone modules such as general-purpose
input/output or modules which require independent control interfaces in addition to their main memory
interface.
Type 2: Basic protocol. In this case, the limited operation set of the peripheral interface is extended to
a full operation set, including compound operations, source labeling and some priority and transaction
labeling. Moreover, this implementation supports split and pipelined accesses, and is aimed at devices
which need high-performance but do not require the additional system efciency associated with shaped
request/response packets or the ability to reorder outstanding operations.
Type 3: Advanced protocol. The most advanced implementation upgrades previous interfaces with
support for out-of-order execution and shaped packets, and is equivalent to the advanced VCI protocol.
Split and pipelined accesses are supported. It allows the improvement of performance either by allowing
more operations to occur concurrently, or by rescheduling operations more efciently.
A type 2 protocol preserves the order of requests and responses. One constraint is that, when commu-
nicating with a given target, an initiator cannot send a request to a new target until it has received all the
responses from the current target. The unresponded requests are called pending, and a pending request
controller manages them. A given type 2 target is assumed to send the responses in the same order as the
request arrival order. In type 3 protocol, the order of responses may not be guaranteed, and an initiator
can communicate with any target, even if it has not received all responses from a previous one.
Type 1 Type 2 Type 3
Type 1 Type 2 Type 3
Initiators (masters)
Targets (slaves)
Initiator IP
Any bus IF
Stbus IF
STBus IF
Anybus IF
Initiator IP
STBus
FIGURE 20.5 Schematic view of the STBus interconnect.
Associated with these protocols, hardware components have been designed in order to build complete
recongurable interconnections between initiators and targets. A toolkit has been developed around this
STBus (graphical interface) to automatically generate top level backbone, cycle accurate high-level models,
way to implementation, bus analysis (latencies, bandwidth) and bus verication (protocol and behavior).
An STBus system includes three generic architectural components. The node arbitrates and routes the
requests and optionally the responses. The converter is in charge of converting the requests from one
protocol to another (for instance, from basic to advanced). Finally, the size converter is used between two
buses of the same type but of different widths. It includes buffering capability.
The STBus can implement various strategies of arbitration and allows to change themdynamically. In a
simplied single-node system example, a communication between one initiator and a target is performed
in several steps:
A request/grant step between the initiator and the node takes place, corresponding to an atomic
rendezvous operation of the system.
The request is transferred from the node to the target.
A response-request/grant step is carried out between the target and the node.
The response-request is transferred from the node to the initiator.
20.4.1 Bus Topologies
STBus can instantiate different bus topologies, trading-off communication parallelism with architectural
complexity. In particular, system interconnects with different scalability properties can be instantiated
such as:
Single shared bus: suitable for simple low-performance implementations. It features minimum
wiring area but limited scalability.
Full crossbar: targets complex high-performance implementations. Large wiring area overhead.
Partial crossbar: intermediate solution, medium performance, implementation complexity, and
wiring overhead.
It is worth observing that STBus allows for the instantiation of complex bus systems such as hetero-
geneous multi-node buses (thanks to size or type converters) and facilitates bridging with different bus
architectures, provided proper protocol converters are made available (e.g., STBus and AMBA).
20.5 Wishbone
The Wishbone SoC interconnect [11] denes two types of interfaces, called master and slave. Master
interfaces are cores that are capable of generating bus cycles, while slave interfaces are capable of receiving
bus cycles. Some relevant Wishbone features that are worth mentioning are the multi-master capability
which enables multiprocessing, the arbitration methodology dened by end users attending to their
needs, and the scalable data bus widths and operand sizes. Moreover, the hardware implementation of
bus interfaces is simple and compact, and the hierarchical view of the Wishbone architecture supports
structured design methodologies [12].
The hardware implementation supports various IP core interconnection schemes, including: point-to-
point connection, shared bus, crossbar switch implementation, dataow interconnection, and off-chip
interconnection. The crossbar switch interconnection is usually used when connecting two or more
masters together so that every one can access two or more slaves. In this scheme, the master initiates an
addressable bus cycle to a target slave. The crossbar switch interconnection allows more than one master
to use the bus provided they do not access the same slave. In this way, the master requests a channel on
the switch and, once this is established, data is transferred in a point-to-point way.
On one hand the overall data transfer rate of the crossbar switch is higher than shared bus mechan-
isms, and can be expanded to support extremely high data transfer rates. On the other hand, the main
disadvantage is a more complex interconnection logic and routing resources.
20.5.1 The Wishbone Bus Transactions
The Wishbone architecture denes different transaction cycles attending to the action performed (read
or write) and the blocking/nonblocking access. For instance, single read/write transfers are carried out as
follows. The master requests the operation and places the slave address onto the bus. Then the slave places
data onto the data bus and asserts an acknowledge signal. The master monitors this signal and relies the
request signals when data have been latched. Two or more back-to-back read/write transfers can also be
strung together. In this case, the starting and stopping point of the transfers are identied by the assertion
and negation of a specic signal [13].
A Read-Modify-Write (RMW) transfer is also specied, which can be used in multiprocessor and
multitasking systems in order to allow multiple software processes to share common resources by using
semaphores. This is commonly done on interfaces for disk controllers, serial ports, and memory. The
RMW transfer reads and writes data to a memory location in a single bus cycle. For the correct imple-
mentation of this bus transaction, shared bus interconnects have to be designed in such a way that
once the arbiter grants the bus to a master, it will not rearbitrate the bus until the current master
gives it up. Also, it is important to note that a master device must support the RMW transfer in
order to be effective, and this is generally done by means of special instructions forcing RMW bus
transactions.
20.6 SiliconBackplane MicroNetwork
SiliconBackplane MicroNetwork is a family of innovative communication architectures licensed by Sonics
for use in SoC design. The Sonics architecture provides CPU independence, true mix-and-match of IP
cores, a unied communication medium, and a structure that makes a SOC design simpler to partition,
analyze, design, verify, and test [14].
The SiliconBackplane MicroNetwork allows high-speed pipelined transactions (data bandwidth of the
interconnect scales from 50 MB/sec to 4.8 Gbyte/sec) where the real-time Quality of Service (QoS) of
multiple simultaneous dataows is guaranteed. A network utilization of up to 90% can be achieved.
The SiliconBackplane relies on the SonicsStudio development environment for architectural explor-
ation, and the availability of pre-characterization results enables reliable performance analysis and
reduction of interconnect timing closure uncertainties. The ultimate goal is to avoid over-designing
interconnects.
The architecture can be described as a distributed communication infrastructure (thus facilitating place-
and-route) which can be extended hierarchically in the form of Tiles (collection of functions requiring
minimal assistance from the rest of the die) in an easy way. Among other features, the SiliconBackplane
MicroNetwork provides advanced error handling in hardware (features for SoC-wide error detection
and support mechanisms for software clean-up and recovery of unresponsive cores), runtime-operating
reconguration to meet changing application demands and data multicast.
The SiliconBackplane system consists of a physical interconnect bus congured with a combin-
ation of agents. Each IP core communicates with an attached agent through ports implementing
the Open Core Protocol (OCP) standard interface. The agents then communicate with each other
using a network of interconnects based on the SiliconBackplane protocol. This latter includes paten-
ted transfer mechanisms aiming at maximizing interconnect bandwidth utilization and optimized for
streaming multimedia applications [15]. Figure 20.6 shows a schematic view of the SiliconBackplane
system.
Master
Core
Slave Slave
Slave Slave
Master
Master
Core Core
System initiator System initiator / target System target
Initiator Target
ON-CHIP BUS
Request
Response
OCP
/ target
Bus
initiator
Bus
initiator
Bus
initiator
FIGURE 20.6 Schematic view of the SiliconBackplane system.
A few specic components can be identied in an agent architecture:
Initiators. Who implements the interface between the bus and the master core (CPU, DSP, DMA, etc.).
The initiator receives requests fromthe OCP, thentransmits the requests according to the SiliconBackplane
standard, and nally processes the responses from the target.
Targets. Who implements the interface between the physical bus and the slave device (memories, UARTs,
etc.). This module serves as the bridge between the system and the OCP.
Service agent. Who is an enhanced initiator that provides additional capabilities such as debug and test.
20.6.1 System Interconnect Bandwidth
One of the most interesting features of the SiliconBackplane network is the possibility of allocating
bandwidth based on a two-level arbitration policy. The system designer can preallocate bandwidth to
high priority initiators by means of the concept of Time-Division Multiple Access (TDMA). An initiator
agent with a preassigned time slot has the rights over that slot. If the owner does not need it, the slot is
reallocated in a round-robin fashion to one of the system devices, and this represents the second level of
the arbitration policy.
The TDMA approach provides fast access to variable-latency subsystems and is a simple mechanism
to guarantee QoS. The TDMA bandwidth allocation tables are stored in a conguration register at every
initiator, and can be dynamically over-written to t the system needs. On the other hand, the fair round-
robin allocation scheme can be used to guarantee bandwidth availability to initiators with less predictable
access patterns, since some or many of the TDMA slots may turn out to be left unallocated. Round-robin
arbitration policy is particularly suitable for best-effort trafc.
20.6.2 Conguration Resources
All the congurable IP cores implemented in the SiliconBackplane system can be congured either at
compile time or dynamically by means of specic conguration registers. These conguration devices are
accessible by the operating system.
Conguration registers are individually set for each agent, depending upon the services provided to the
attached cores. The types of conguration registers are:
Unbuffered registers hold conguration values for the agent or its subsystem core.
Buffered registers hold conguration values that must be simultaneously updated in all agents.
Broadcast conguration registers hold values that must remain identical in multiple agents.
20.7 Other On-Chip Interconnects
20.7.1 Peripheral Interconnect Bus
The PI bus was developed by several European semiconductor companies (Advanced RISC Machines,
Philips Semiconductors, SGS-THOMSON Microelectronics, Siemens, TEMIC/MATRA MHS) within the
framework of a European project (OMI, Open Microprocessor Initiative framework.)
1
After this, an
extended backward-compatible PI Bus protocol standard frequently used in many hardware systems has
been developed by Philips [16].
The high-bandwidthandlow-overheadof the PI Bus provide a comfortable environment for connecting
processor cores, memories, coprocessors, I/Ocontrollers and other functional blocks in high-performance
chips, for time-critical applications.
The PI Bus functional modules are arranged in macrocells, and a wide range of functions are provided.
Macrocells with a PI Bus interface can be easily integrated into a chip layout even if they are designed by
different manufacturers.
The potential bus agents require only a PI Bus interface of low complexity. Since there is no concrete
implementation specied, PI Bus can be adapted to the individual requirements of the target chip design.
For instance, the widths of the address and data bus may be varied. The main features of this bus are:
Processor independent implementation and design
Demultiplexed operation
Clock synchronous
Peak transfer rate of 200 MB/sec (50 MHz bus clock)
Address and data bus scalable (up to 32 bits)
8-, 16-, 32-bit data access
Broad range of transfer types from single to multiple data transfers
Multi-master capability
The PI Bus does not provide cache coherency support, broadcasts, dynamic bus sizing, and unaligned
data access. Finally, the University of Sussex has developed a VHDL toolkit to meet the needs of embedded
system designers using the PI bus. Macrocell testing for PI bus compliance is also possible using the
framework available in the toolkit [17].
20.7.2 Avalon
Avalon is Alteras parameterized interface bus used by the Nios embedded processor. The Avalon switch
fabric has a set of predened signal types with which a user can connect one or more IP blocks. It can
only be implemented on Altera devices using SOPC Builder, a system development tool that automatically
generates the Avalon switch fabric logic [18].
The Avalon switch fabric enables simultaneous multi-master operation for maximum system perform-
ance by using a technique called slave-side arbitration. It determines which master gains access to a certain
slave, in the event that multiple masters attempt to access the same slave at the same time. Therefore,
simultaneous transactions for all bus masters are supported and arbitration for peripherals or memory
interfaces that are shared among masters is automatically included.
The Avalon interconnect includes chip-select signals for all peripherals, even user-dened peripherals,
to simplify the design of the embedded system. Separate, dedicated address and data paths provide an easy
interface to on-chip user logic. User-dened peripherals are not required to decode data and address bus
cycles. Dynamic bus sizing allows developers to use low-cost, narrow memory devices that do not match
the native bus size of their CPU. The switch fabric supports each type of transfer supported by the Avalon
interface. Each peripheral port into the switch is generated with reduced amount of logic to meet the
requirements of the peripheral, including wait state logic, data width matching, and passing wait signals.
1
The PI Bus has been incorporated as OMI Standard OMI 324.3D.
Read and write operations with latency can be performed. Latent transfers are useful to masters wanting
to issue multiple sequential read or write requests to a slave, which may require multiple cycles for the rst
transfer but fewer cycles for subsequent sequential transfers. This can be benecial for instruction-fetch
operations and DMA transfers to or from SDRAM. In these cases, the CPU or DMA master may prefetch
(post) multiple requests prior to completion of the rst transfer and thereby reduce overall access latency.
Interestingly, the Avalon interface includes signals for streaming data between master/slave pairs. These
signals indicate the peripherals capacity to provide or accept data. A master does not have to access
status registers in the slave peripheral to determine whether the slave can send or receive data. Streaming
transactions maximize throughput between masterslave pairs, while avoiding data overowor underow
on the slave peripherals. This is especially useful for DMA transfers [19].
20.7.3 CoreFrame
The CoreFrame architecture has been developed by Palmchip Corporation and relies on point-to-point
signals and multiplexing instead of shared tristate lines. It aims at delivering high-performance while
simultaneously reducing design and verication time. The distinctive features of CoreFrame are [20]:
400 MB/sec bandwidth at 100 MHz (bus speed is scalable to technology and design requirements)
Unidirectional buses only
Central, shared memory controller
Single clock cycle data transfers
Zero wait state register accesses
Separate peripheral I/O and DMA buses
Simple protocol for reduced gate count
Low-capacitive loading for high-frequency operation
Hidden arbitration for DMA bus masters
Application-specic memory map and peripherals
The most distinctive feature of CoreFrame is the separation of I/Oand memory transfers onto different
buses. The PalmBus provides for the I/O backplane and allows the processor to congure and control
peripheral blocks while the MBus provides a DMAconnectionfromperipherals to mainmemory, allowing
a direct data transfer without processor intervention.
Other on-chip interconnects are not described here owing to lack of space: IPBus from IDT [21], IP
Interface from Motorola [22], MARBLE asynchronous bus from University of Manchester [23], Atlantic
from Altera [24], ClearConnect from ClearSpeed Techn. [25], and FISPbus from Mentor Graphics [26].
20.8 Analysis of Communication Architectures
Traditional SoC interconnects, as exemplied by AMBA AHB, are based upon low-complexity shared
buses, in an attempt to minimize area overhead. Such architectures, however, are not adequate to support
the trend for SoC integration, motivating the need for more scalable designs. Interconnect performance
improvement can be achieved by adopting new topologies and by choosing new protocols, at the expense
of silicon area. The former strategy leads from shared buses to bridged clusters, partial or full crossbars,
and eventually to NoC, in an attempt to increase available bandwidth and to reduce local contention. The
latter strategy instead tries to maximize link utilization by adopting more sophisticated control schemes
and thus permitting a better sharing of existing resources. While both approaches can be followed at the
same time, we perform separate analysis for the sake of clarity.
At rst, scalability of evolving interconnect fabric protocols is assessed. Three state-of-the-art shared
buses are stressedunder anincreasing trafc load: a traditional AMBAAHBlink is not only more advanced,
but also more expensive, evolutionary solutions as offered by STBus (type 3) and AMBAAXI (based upon
a Synopsys implementation).
These system interconnects were selected for analysis because of their distinctive features, which allow
to sketch the evolution of shared-bus based communication architectures. AMBA AHB makes two data
links (one for read, one for write) available, but only one of them can be active at any time. Only one bus
master can own the data wires at any time, preventing the multiplexing of requests and responses on the
interconnect signals. Transaction pipelining (i.e., split ownership of data and address lines) is provided,
but not as a means of allowing multiple outstanding requests, since address sampling is only allowed at
the end of the previous data transfer. Bursts are supported, but only as a way to cut down on rearbitration
times, and AHB slaves do not have a native burst notion. Overall, AMBA AHB is designed for a low silicon
area footprint.
The STBus interconnect (with shared bus topology) implements split request and response channels.
This means that, while a system initiator is receiving data from an STBus target, another one can issue
a second request to a different target. As soon as the response channel frees up, the second request can
immediately be serviced, thus hiding target wait states behind those of the rst transfer. The amount of
saved wait states depends on the depth of the prefetch FIFO buffers on the slave side. Additionally, the
split channel feature allows for multiple outstanding requests by masters, with support for out-of-order
retirement. An additional relevant feature of STBus is its low-latency arbitration, which is performed in a
single cycle.
Finally, AMBAAXI builds upon the concept of point-to-point connection and exhibits complex features,
such as multiple outstanding transaction support (with out-of-order or in-order delivery selectable by
means of transaction IDs) and time interleaving of trafc toward different masters on internal data lanes.
Four different logical monodirectional channels are provided in AXI interfaces, and activity on them can
be parallelized allowing multiple outstanding read and write requests. In our protocol exploration, to
provide a fair comparison, a shared bus topology is assumed, which comprises of a single internal lane
per each one of the four AXI channels.
Figure 20.7 shows an example of the efciency improvements made possible by advanced interconnects
in the test case of slave devices having two wait states, with three system processors and four-beat burst
READY1
READY2
READY3
READY3
READY2
READY1
READY1
READY1
READY2
READY2
READY3
READY3
(c)
(d)
(b)
(a)
CLOCK
1
1
1
1
1
1
1
1
2 3 4
2 3 4
4 3 2
1 2 3 4
4
1 2 3
1
2 3
3 4 2
2
2
2
3
3
3
4
4
4
1
1
1
2
2
2
3
1
FIGURE20.7 Concept waveforms showing burst interleaving for the three interconnects. (a) AMBAAHB, (b) STBus
(with minimal buffering), (c) STBus (with more buffering), and (d) AMBA AXI.
transfers. AMBA AHB has to pay two cycles of penalty per transferred datum. STBus is able to hide
latencies for subsequent transfers behind those of the rst one, with an effectiveness which is a function of
the available buffering. AMBA AXI is capable of interleaving transfers, by sharing data channel ownership
in time. Under conditions of peak load, when transactions always overlap, AMBA AHB is limited to
a 33% efciency (transferred words over elapsed clock cycles), while both STBus and AMBA AXI can
theoretically reach a 100% throughput.
20.8.1 Scalability Analysis
SystemC models of AMBA AHB, AMBA AXI (provided within the Synopsys CoCentric/Designware

[27]
suites), and STBus are used within the framework of the MPARM simulation platform [2830]. For the
STBus model, the depth of FIFOs instantiated by the target side of the interconnect is a congurable
parameter; their impact can be noticed on concept waveforms in Figure 20.7. 1-stage (STBus hereafter)
and 4-stage (STBus [B]) FIFOs were benchmarked.
The simulated on-chip multiprocessor consists of a congurable number of ARM cores attached to the
system interconnect. Trafc workload and pattern can easily be tuned by running different benchmark
code on the cores, by scaling the number of system processors, or by changing the amount of processor
cache, which leads to different amounts of cache rells. Slave devices are assumed to introduce one wait
state before responses.
To assess interconnect scalability, a benchmark independently but concurrently runs on every system
processor performing accesses to its private slave (involving bus transactions). This means that, while
producing real functional trafc patterns, the test setup was not constrained by bottlenecks owing to
shared slave devices.
Scalability properties of the system interconnects can be observed in Figure 20.8, reporting the execution
time variation when attaching an increasing amount of system cores to a single shared interconnect under
heavy trafc load. Core caches are kept very small (256 bytes) in order to cause many cache misses
and therefore signicant levels of interconnect congestion. Execution times are normalized against those
for a two-processor system, trying to isolate the scalability factor alone. The heavy bus congestion case is
considered here because the same analysis performed under light trafc conditions (e.g., with 1 kB caches)
shows that all of the interconnects perform very well (they are all always close to 100%), with only AHB
showing a moderate performance decrease of 6% when moving from two to eight running processors.
With 256 bytes caches, the resulting execution times, as Figure 20.8 shows, get 77% worse for AMBA
AHB when moving from two to eight cores, while AXI and STBus manage to stay within 12% and 15%. The
impact of FIFOs in STBus is noticeable, since the interconnect with minimal buffering shows execution
times 36% worse than in the two-core setup. The reason behind the behavior pointed out in Figure 20.8 is
that under heavy trafc load and with many processors, interconnect saturation takes place. This is clearly
indicated in Figure 20.9, which reports the fraction of cycles during which some transaction was pending
on the bus with respect to total execution time.
In such a congested environment, as Figure 20.10 shows, AMBA AXI and STBus (with 4-stage FIFOs) are
able to achieve transfer efciencies (dened as data actually moved over bus contention time) of up to 81%
and 83%, respectively, while AMBA AHB reaches 47% only near to its maximum theoretical efciency
of 50% (one wait state per data word). These plots stress the impact that comparatively low-area-overhead
optimizations can sometimes have in complex systems.
According to simulation results, some of the advanced features in AMBA AXI provided highly scalable
bandwidth, but at the price of latency in low-contention setups. Figure 20.11 shows the minimum and
average amount of cycles required to complete a single write and a burst read transaction in STBus
and AMBA AXI. STBus has a minimal overhead for transaction initiation, as low as a single cycle if
communication resources are free. This is conrmed by gures showing a best-case three-cycle latency for
single accesses (initiation, wait state, data transfer) and a nine-cycle latency for four-beat bursts. AMBA
AXI, owing to its complex channel management and arbitration, requires more time to initiate and close
a transaction: recorded minimum completion times are 6 and 11 cycles for single writes and burst reads,
AHB AXI STBus STBus (B)
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
2 Cores
4 Cores
6 Cores
8 Cores
R
e
l
a
t
i
v
e

e
x
e
c
u
t
i
o
n

t
i
m
e

(
%
)
FIGURE 20.8 Execution times with 256 bytes caches.
0
10
20
30
40
50
60
70
80
90
100
I
n
t
e
r
c
o
n
n
e
c
t

b
u
s
y

(
%
)
2 Cores
4 Cores
6 Cores
8 Cores
FIGURE 20.9 Bus busy time with 256 bytes caches.
respectively. As bus trafc increases, completion latencies of AMBA AXI and STBus get more and more
similar because the bulk of transaction latency is spent in contention.
It must be pointed out, however, that protocol improvements alone cannot overcome the intrinsic
performance bound owing to the shared nature of the interconnect resources. While protocol features can
push the saturation boundary further, and get near to a 100% efciency, trafc loads taking advantage of
more parallel topologies will always exist. The charts reported here already show some traces of saturation
even for the most advanced interconnects. However, the improved performance achieved by more parallel
topologies strongly depends on the kind of bus trafc. In fact, if the trafc is dominated by accesses to
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
95
100
I
n
t
e
r
c
o
n
n
e
c
t

u
s
a
g
e

e
f
f
i
c
i
e
n
c
y

(
%
)
2 Cores
4 Cores
6 Cores
8 Cores
FIGURE 20.10 Bus usage efciency with 256 bytes caches.
2 Cores 4 Cores 6 Cores 8 Cores
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
STBus (B) write avg
STBus (B) write min
STBus (B) read avg
STBus (B) read min
AXI write avg
AXI write min
AXI read avg
AXI read min
L
a
t
e
n
c
y

f
o
r

a
c
c
e
s
s

c
o
m
p
l
e
t
i
o
n

(
c
y
c
l
e
s
)
FIGURE 20.11 Transaction completion latency with 256 bytes caches.
shared devices (shared memory, semaphores, interrupt module), they have to be serialized anyway, thus
reducing the effectiveness of area-hungry parallel topologies. It is therefore evident that crossbars behave
best when data accesses are local and no destination conicts arise.
This is reected in Figure 20.12, showing average completion latencies in read accesses for different bus
topologies: shared buses (AMBA AHB and STBus), partial crossbars (STBus-32 and STBus-54), and full
crossbars (STBus-FC). Four benchmarks are considered, consisting of matrix multiplications performed
independently by each processor or in pipeline, with or without an underlying OS (Operating System)
Average time for read (cycles)
0
2
4
6
8
10
12
14
16
18
20
ASM-IND OS-IND ASM-PIP OS-PIP
AMBA
ST-BUS
ST-FC
ST-32
ST-54
FIGURE 20.12 Reads average latency.
(OS-IND, OS-PIP, ASM-IND, and ASM-PIP, respectively). IND benchmarks do not give rise to inter-
processor communication, which is instead at the core of PIP benchmarks. Communication goes through
the shared memory. Moreover, OS-assisted code implicitely uses both semaphores and interrupts, while
standalone ASM applications rely on an explicit semaphore polling mechanism for synchronization pur-
poses. Crossbars showa substantial advantage in OS-INDandASM-INDbenchmarks, wherein processors
only access private memories: this operationis obviously suitable for parallelization. BothST-FCandST-54
achieve the minimum theoretical latency where no conict on private memories ever arises. ST-32 trails
immediately behind ST-FC and ST-54, with rare conicts which do not occur systematically because exe-
cution times shift among conicting processors. OS-PIP still shows signicant improvement for crossbar
designs. ASM-PIP, in contrast, puts ST-BUS at the same level of crossbars, and sometimes the shared bus
even proves slightly faster. This can be explained with the continuous semaphore polling performed by
this (and only this) benchmark; while crossbars may have an advantage in private memory accesses, the
resulting speedup only gives processors more opportunities to poll the semaphore device, which becomes
a bottleneck. Unpredictability of conict patterns can then explain why a simple shared bus can sometimes
slightly outperform crossbars, therefore the selection of bus topology should carefully match the target
communication pattern.
20.9 Packet-Switched Interconnection Networks
Previous sections have illustrated on-chip interconnection schemes based on shared buses and on evolu-
tionary communication architectures. This section introduces a more revolutionary approach to on-chip
communication, known as Network-on-Chip [2,3].
The NoC architecture consists of a packet-switched interconnetion network integrated onto a single
chip, and it is likely to better support the trend for SoC integration. The basic idea is borrowed from
the wide-area networks domain, and envisions router (or switch)-based networks of interconnects on
which on-chip packetized communication takes place. Cores access the network by means of proper
interfaces, and have their packets forwarded to destination through a certain number of hops. SoCs
differ from wide area networks in their local proximity and because they exhibit less nondeterminism.
Local, high-performance networks such as those developed for large-scale multiprocessors have
similar requirements and constraints. However, some distinctive features, such as energy constraints and
design-time specialization, are unique to SoC networks.
Topology selection for NoCs is a critical design issue. It is determined by howefciently communication
requirements of anapplicationcanbe mappedontoa certaintopology, andby physical level considerations.
In fact, regular topologies can be designed with a better control on electrical parameters and therefore
on communication noise sources (such as crosstalk), although they might result in link under-utilization
or localized congestion from an application viewpoint. On the contrary, irregular topologies have to deal
with more complex physical designissues but are more suitable to implement customized, domain-specic
communication architectures. Two-dimensional mesh networks are a reference solution for regular NoC
topologies.
The scalable and modular nature of NoCs and their support for efcient on-chip communication
potentially leads to NoC-based multiprocessor systems characterized by high structural complexity and
functional diversity. On one hand, these features need to be properly addressed by means of new design
methodologies, while on the other hand more efforts have to be devoted to modeling on-chip communic-
ation architectures and integrating them into a single modeling and simulation environment combining
both processing elements and communication architectures. The development of NoC architectures and
their integration into a complete MPSoC design ow is the main focus of an ongoing worldwide research
effort [3033].
20.10 Conclusions
This chapter addresses the critical issue of on-chip communication for gigascale MPSoCs. An overview of
the most widely used on-chip communication architectures is provided, and evolution guidelines aiming
at overcoming scalability limitations are sketched. Advances regard both communication protocol and
topology, although it is becoming clear that in the long term more aggressive approaches will be required
to sustain system performance, namely packet-switched interconnection networks.
References
[1] R. Ho, K.W. Mai, and M.A. Horowitz. The future of wires. Proceedings of the IEEE, 89:
490504, 2001.
[2] L. Benini and G. De Micheli. Networks on chips: a new SoC paradigm. IEEE Computer, 35:
7078, 2002.
[3] J. Henkel, W. Wolf, and S. Chakradhar. On-chip networks: a scalable, communication-centric
embedded system design paradigm. In Proceedings of the International Conference on VLSI Design,
January 2004, pp. 845851.
[4] ARM. AMBA Specication v2.0, 1999.
[5] ARM. AMBA Multi-Layer AHB Overview, 2001.
[6] ARM. AMBA AXI Protocol Specication, 2003.
[7] IBM Microelectronics. CoreConnect Bus Architecture Overview, 1999.
[8] G.W. Doerre and D.E. Lackey. The IBM ASIC/SoC methodology. A recipe for rst-time success.
IBM Journal of Research & Development, 46: 649660, 2002.
[9] IBM Microelectronics. The CoreConnect Bus Architecture White Paper, 1999.
[10] P. Wodey, G. Camarroque, F. Barray, R. Hersemeule, and J.P. Cousin. LOTOS code generation for
model checking of STBus based SoC: the STBus interconnection. In Proceedings of ACM and IEEE
International Conference on Formal Methods and Models for Co-Design, June 2003, pp. 204213.
[11] Richard Herveille. Combining WISHBONE Interface Signals, Application Note, April 2001.
[12] Richard Herveille. WISHBONE System-on-Chip (SoC) Interconnection Architecture for Portable IP
Cores. Specication, 2002.
[13] Rudolf Usselmann. OpenCores SoC Bus Review, 2001.
[14] Sonics Inc. -Networks. Technical Overview, 2002.
[15] Sonics Inc. SiliconBackplane III MicroNetwork IP. Product Brief, 2002.
[16] Philip de Nier. Property checking of PI-bus modules. In Proceedings of the Workshop on Circuits,
Systems and Signal Processing (ProRISC99), J.P. Veen, Ed. STW, Technology Foundation, Mierlo,
The Netherlands, 1999, pp. 343354.
[17] ESPRIT, 1996, http://www.cordis.lu/esprit/src/results/res_area/omi/omi10.htm
[18] Altera. AHB to Avalon & Avalon to AHB Bridges, 2003.
[19] Altera. Avalon Bus Specication, 2003.
[20] Palmchip. Overview of the CoreFrame Architecture, 2001.
[21] IDT. IDT Peripheral Bus (IPBus). Intermodule Connection Technology Enables Broad Range of
System-Level Integration, 2002.
[22] Motorola. IP Interface. Semiconductor Reuse Standard, 2001.
[23] W.J. Bainbridge and S.B. Furber. MARBLE: an asynchronous on-chip macrocell bus. Micropro-
cessors and Microsystems, 24: 213222, 2000.
[24] Altera. Atlantic Interface. Functional Specication, 2002.
[25] ClearSpeed. ClearConnect Bus. Scalable High Performance On-Chip Interconnect, 2003.
[26] Summary of SoC Interconnection Buses, 2004, http://www.silicore.net/uCbusum.htm
[27] Synopsys CoCentric, 2004, http://www.synopsys.com
[28] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino. SystemC cosimulation and
emulation of multiprocessor SoC designs. IEEE Computer, 36: 5359, 2003.
[29] F. Poletti, D. Bertozzi, A. Bogliolo, and L. Benini. Performance analysis of arbitration policies
for SoC communication architectures. Journal of Design Automation for Embedded Systems,
8: 189210, 2003.
[30] M. Loghi, F. Angiolini, D. Bertozzi, L. Benini, andR. Zafalon. Analyzing on-chipcommunicationin
a MPSoCenvironment. In Proceedings of the IEEE DesignAutomation and Test in Europe Conference
(DATE04), February 2004, pp. 752757.
[31] E. Rijpkema, K. Goossens, and A. Radulescu. Trade-offs in the design of a router with both
guaranteed and best-effort services for networks on chip. In Proceedings of Design Automation and
Test in Europe, March 2003, pp. 350355.
[32] K. Lee et al. A 51 Mw 1.6 GHz on-chip network for low power heterogeneous SoC platform.
In ISSCC Digest of Technical Papers, 2004, pp. 152154.
[33] E. Bolotin, I. Cidon, R. Ginosar, and A. Kolodny. QNoC: QoS architecture and design process for
network on chip. The Journal of Systems Architecture, Special Issue on Networks on Chip, 50(23):
105128, February 2004.
21
Network-on-Chip
Design for Gigascale
Systems-on-Chip
Davide Bertozzi and
Luca Benini
Giovanni De Micheli
Stanford University
21.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-1
21.2 Design Challenges for On-Chip Communication
Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-3
21.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-4
21.4 NoC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-5
Network Link Switch Network Interface
21.5 NoC Topology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-13
Domain-Specic NoC Synthesis Flow
21.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-16
Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21-17
21.1 Introduction
The increasing integration densities made available by shrinking device geometries will have to be
exploited to meet the computational requirements of parallel applications, such as multimedia processing,
automotive, multiwindow TV, ambient intelligence, etc.
As an example, systems designed for ambient intelligence will be based on high-speed digital signal
processing with computational loads ranging from 10 MOPS for lightweight audio processing, 3 GOPS
for video processing, 20 GOPS for multilingual conversation interfaces, and up to 1 TOPS for synthetic
video generation. This computational challenge will have to be addressed at manageable power levels and
affordable costs [1].
Such a performance cannot be provided by a single processor, but requires a heterogeneous on-chip
multiprocessor system containing a mix of general-purpose programmable cores, application specic
processors, and dedicated hardware accelerators.
In this context, performance of gigascale Systems-on-Chip (SoC) will be communication dominated,
and only an interconnect-centric system architecture will be able to cope with this problem. Current
on-chip interconnects consist of low-cost shared arbitrated buses, based on the serialization of bus access
requests; only one master at a time can be granted access to the bus. The main drawback of this solution
is its lack of scalability, which will result in unacceptable performance degradation for complex SoCs
21-1
Core
Core
Core
Core
Core
NI
NI
NI
NI
NI
S
S
S S
S
NI
Core
NI network
interface
Sswitch
FIGURE 21.1 Example of NoC architecture.
(more than a dozen of integrated cores). Moreover, the connection of new blocks to a shared bus increases
its associated load capacitance, resulting in more energy consuming bus transactions.
A scalable communication infrastructure that better supports the trend of SoC integration consists
of an on-chip micronetwork of interconnects, generally known as Network-on-Chip (NoC) architecture
[24]. The basic idea is borrowed from the wide-area networks domain, and envisions router (or switch)-
based networks on which on-chip packetized communication takes place, as depicted in Figure 21.1. Cores
access the network by means of proper interfaces, and have their packets forwarded to destination through
a certain number of hops.
The scalable and modular nature of NoCs and their support for efcient on-chip communication
potentially leads to NoC-based multiprocessor systems characterized by high structural complexity and
functional diversity. On one hand, these features need to be properly addressed by means of new design
methodologies [5], while on the other hand more efforts have to be devoted to modeling on-chip commu-
nicationarchitectures andintegrating themintoa single modeling andsimulationenvironment combining
both processing elements and communication infrastructures [68]. These efforts are needed to include
on-chip communication architecture in any quantitative evaluation of system design during design space
exploration [9,10], so as to be able to assess the impact of the interconnect on achieving a target system
performance.
An important design decision for NoCs regards the choice of topology. Several researchers [4,5,11,12]
envision NoCs as regular tile-based topologies (such as mesh networks and fat trees), which are suitable
for interconnecting homogeneous cores in a chip multiprocessor. However, SoCcomponent specialization
(used by designers to optimize performance at low power consumption and competitive cost) leads to
the on-chip integration of heterogeneous cores having varied functionality, size, and communication
requirements. If a regular interconnect is designed to match the requirements of a few communication-
hungry components, it is bound to be largely overdesigned with respect to the needs of the remaining
components. This is the main reason why most of the current SoCs use irregular topologies, such as
bridged buses and dedicated point-to-point links [13].
This chapter introduces basic principles and guidelines for the NoC design. At rst, the motivation for
the design paradigm shift of SoC communication architectures from shared buses to NoCs is examined.
Then, the chapter goes into the details of NoC building blocks (switch, network interface, and switch-to-
switchlinks), discussing the designguidelines andpresenting a case study where some of the most advanced
concepts in NoC design have been applied to a real NoC architecture (called Xpipes and developed at
University of Bologna [14]).
Finally, the challenging issue of heterogeneous NoCdesign will be addressed, and the effects of mapping
the communication requirements of an application onto a domain-specic NoC, instead of a network
with regular topology, will be detailed by means of an illustrative example.
NoC Design for Gigascale SoC 21-3
21.2 Design Challenges for On-Chip Communication
Architectures
SoC design challenges that are driving the evolution of traditional bus architectures toward NoCs can be
outlined as follows:
Technology issues. While gate delays scale down with technology, global wire delays typically increase or
remain constant as repeaters are inserted. It is estimated that in 50 nm technology, at a clock frequency
of 10 GHz, a global wire delay might range from 6 to 10 clock cycles [2]. Therefore, limiting the on-chip
distance traveled by critical signals will be key to guarantee the performance of the overall system, and will
be a common design guideline for all kinds of system interconnects. On the contrary, other challenges
posed by deep submicron technologies are leading to a paradigmshift in the design of SoCcommunication
architectures. For instance, global synchronization of cores on future SoCs will be unfeasible due to deep
submicron effects (clock skew, power associated with clock distribution tree, etc.), and an alternative
scenario consists of self-synchronous cores that communicate with one another through a network-centric
architecture [15]. Finally, signal integrity issues (crosstalk, power supply noise, soft errors, etc.) will lead to
more transient and permanent failures of signals, logic values, devices, and interconnects, thus raising the
reliability concern for on-chip communication [16]. In many cases, on-chip networks can be designed as
regular structures, allowing electrical parameters of wires to be optimized and well controlled. This leads
to lower communication failure probabilities, thus enabling the use of lowswing signaling techniques [17],
andtothe capability of exploiting performance optimizationtechniques, suchas wavefront pipelining [18].
Performance issues. In traditional buses, all communication actors share the same bandwidth. As a
consequence, performance does not scale with the level of system integration, but degrades signicantly.
Though, once the bus is granted to a master, access occurs with no additional delay. On the contrary, NoCs
can provide much better performance scalability. No delays are experienced for accessing the communica-
tion infrastructure, since multiple outstanding transactions originated by multiple cores can be handled
at the same time, resulting in a more efcient network resources utilization. However, given a certain
network dimension (e.g., number of instantiated switches), large latency uctuations for packet delivery
could be experienced as a consequence of network congestion. This is unacceptable when hard real-time
constraints of an application have to be met, and two solutions are viable: network overdimensioning
(for NoCs designed to support Best Effort [BE] trafc only) or implementation of dedicated mechanisms
to provide guarantees for timing constrained trafc (e.g., loss-less data transport, minimal bandwidth,
bounded latency, minimal throughput, etc.) [19].
Design productivity issues. It is well known that synthesis and compiler technology development do not
keep up with ICmanufacturing technology development [20]. Moreover, time-to-market needs to be kept
as lowas possible. Reuse of complex preveried design blocks is an efcient mean to increase productivity,
and regards both computation resources and the communication infrastructure [21]. It would be highly
desirable to have processing elements that could be employed in different platforms by means of a plug-
and-play design style. To this purpose, a scalable and modular on-chip network represents a more efcient
communication infrastructure compared with shared-bus-based architectures. However, the reuse of
processing elements is facilitated by the denition of standard network interfaces, which also make the
modularity property of the NoCeffective. The Virtual Socket Interface Alliance (VSIA) has attempted to set
the characteristics of this interface industry-wide [22]. OpenCore Protocol (OCP) [23] is another example
of standard interface sockets for cores. It is worth remarking that such network interfaces also decouple the
development of newcores fromthe evolutionof newcommunicationarchitectures. The core developer will
not have to make assumptions about the system, when the core will be plugged into. Similarly, designers of
new on-chip interconnects will not be constrained by the knowledge of detailed interfacing requirements
for particular legacy SoC components. Finally, let us observe that NoC components (e.g., switches or
interfaces) can be instantiated multiple times in the same design (as opposed to the arbiter of traditional
shared buses, which is instance-specic) and reused in a large number of products targeting a specic
application domain.
The development of NoC architectures and protocols is fueled by the aforementioned arguments,
in spite of the challenges represented by the need for new design methodologies and an increased
complexity of system design.
21.3 Related Work
The need to progressively replace on-chip buses with micronetworks was extensively discussed in [2,4].
A number of NoC architectures have been proposed in the literature so far.
Sonics MicroNetwork [24] is an on-chip network making use of communication architecture-
independent interface sockets. The MicroNetwork is an example of evolutionary solutions [25], which
move from a physical implementation as a shared bus, and propose generalizations to support higher
bandwidth (such as partial and full crossbars).
STBUS interconnect from STMicroelectronics is another example of evolutionary architecture that
provides designers with the capability to instantiate both shared bus or partial or full crossbar interconnect
congurations.
Even though these architectures provide higher bandwidth than simple buses, addressing the wiring
delay and scalability challenge in the long term requires more radical solutions.
One of the earliest contributions in this area is the Maia heterogeneous signal processing architecture,
proposed by Zhang et al. [26], based on a hierarchical mesh network. Unfortunately, Maias interconnect
is fully instance-specic. Furthermore, routing is static at conguration time: network switches are pro-
grammed once for all for a given application (as in a Field Programmable Gate Array [FPGA]). Thus,
communication is based on circuit switching, as opposed to packet switching.
In this direction, Dally and Lacy [27] sketch the architecture of a VLSI multicomputer using 2009 tech-
nology. Achipwith64processor-memory tiles is envisioned. Communicationis basedonpacket switching.
This seminal work draws upon past experiences in designing parallel computers and recongurable
architectures (FPGAs and their evolutions) [2830].
Most proposed NoC platforms are packet switched and exhibit regular structure. An example is a
mesh interconnection, which can rely on a simple layout and the switch independence on the network
size. The NOSTRUM network described in Reference 5 takes this approach: the platform includes both
a mesh architecture and the design methodology. The Scalable Programmable Integrated Network (SPIN)
described in Reference 31 is another regular, fat-tree-based network architecture. It adopts cut-through
switching to minimize message latency and storage requirements in the design of network switches. The
Linkoeping SoCBUS [32] is a two-dimensional mesh network that uses a packet connected circuit (PCC)
to set up routes through the network: a packet is switched through the network locking the circuit as it
goes. This notion of virtual circuit leads to deterministic communication behavior but restricts routing
exibility for the rest of the communication trafc.
The need to map communication requirements of heterogeneous cores may lead to the adoption of
irregular topologies. The motivation for such architectures lies in the fact that each block can be opti-
mized for a specic application (e.g., video or audio processing), and link characteristics can be adapted
to the communication requirements of the interconnected cores. Supporting heterogeneous architectures
requires a major design effort and leads to coarser-granularity control of physical parameters. Many recent
heterogeneous SoCimplementations are still based on shared buses (such as the single chip MPEG-2 codec
reportedinReference 33, but the growing complexity of customizable media embeddedprocessor architec-
tures for digital media processing will soon require NoC-based communication architectures and proper
hardware/software development tools. The Aethereal NoC design framework presented in Reference 34
aims at providing a complete infrastructure for developing heterogeneous NoC with end-to-end quality
of service guarantees. The network supports guaranteed throughput (GT) for real-time applications and
BE trafc for timing unconstrained applications.
Support for heterogeneous architectures requires highly congurable network building blocks, custom-
izable at instantiation time for a specic application domain. For instance, the Proteo NoC [35] consists of
a small library of predened, parameterized components that allow the implementation of a large range
of different topologies, protocols, and congurations.
Xpipes interconnect [14] and its synthesizer XpipesCompiler [36] push this approach to the limit, by
instantiating an application-specic NoC from a library of composable soft macros (network interface,
link, and switch). The components are highly parameterizable and provide reliable and latency-insensitive
operation.
21.4 NoC Architecture
Most of the terminology for on-chip packet-switched communication is adapted from computer network
and multiprocessor domain. Messages that have to be transmitted across the network are usually parti-
tioned into xed-length packets. Packets in turn are often broken into message ow control units called
its. In the presence of channel width constraints, multiple physical channel cycles can be used to transfer
a single it. A phit is the unit of information that can be transferred across a physical channel in a single
step. Flits represent logical units of information, as opposed to phits that correspond to physical quantities.
In many implementations, a it is set to be equal to a phit. The basic building blocks for packet-switched
communication across NoCs are:
1. Network link
2. Switch
3. Network interface
and will be described hereafter.
21.4.1 Network Link
The performance of interconnect is a major concern in scaled technologies. As geometries shrink, gate
delay improves much faster than the delay in long wires. Therefore, the long wires increasingly determine
the maximum clock rate, and hence performance, of the entire design. The problem becomes particularly
serious for domain-specic heterogeneous SoCs, where the wire structure is highly irregular and may
include both short and extremely long switch-to-switch links. Moreover, it has been estimated that only a
fraction of the chip area (between 0.4 and 1.4%) will be reachable in one clock cycle [37].
A solution to overcome the interconnect-delay problem consists of pipelining interconnects [38,39].
Wires can be partitioned into segments (or relay stations, which have a function similar to one of the
latches on a pipelined data path) whose length satises predened timing requirements (e.g., desired clock
speed of the design). In this way, link delay is changed into latency, but data introduction rate is not
bounded by the link delay any more. Now, the latency of a channel connecting two modules may end up
being more than one clock cycle. Therefore, if the functionality of the design is based on the sequencing of
the signals and not on their exact timing, then link pipelining does not change the functional correctness
of the design. This requires the system to be made of modules whose behavior does not depend on the
latency of the communication channels (latency-insensitive operation). As a consequence, the use of
interconnect pipelining can be seen as a part of a new and more general methodology for deep submicron
(DSM) designs, which can be envisioned as synchronous distributed systems composed by functional
modules that exchange data on communication channels according to a latency-insensitive protocol. This
protocol ensures that functionally correct modules behave correctly independently of the channel latencies
[38]. The effectiveness of the latency-insensitive design methodology is strongly related to the ability of
maintaining a sufcient communication throughput in the presence of increased channel latencies.
The International Technology Roadmap for Semiconductors (ITRS) 2001 [15] assumes that interconnect
pipelining is the strategy of choice in its estimates of achievable clock speeds for MPUs. Some industrial
designs already make use of interconnect pipelining. For instance, the NETBURST microarchitecture of
Pentium4 contains instances of a stage dedicated exclusively to handle wire delays: in fact, a so-called drive
stage is used only to move signals across the chip without performing any computation and, therefore,
can be seen as a physical implementation of a relay station [40].
Xpipes interconnect makes use of pipelined links and of latency-insensitive operation in the imple-
mentation of its building blocks. Switch-to-switch links are subdivided into basic segments whose length
guarantees that the desired clock frequency (i.e., the maximum speed provided by a certain technology)
can be used. In this way, the system operating frequency is not bound by the delay of long links. According
to the link length, a certain number of clock cycles is needed by a it to cross the interconnect. If network
switches are designed in such a way that their functional correctness depends on the it arriving order and
not on their timing, input links of the switches can be different and of any length. These design choices
are at the basis of latency-insensitive operation of the NoC and allow the construction of an arbitrary
network topology and hence support for heterogeneous architectures.
Figure 21.2 illustrates the link model, which is equivalent to a pipelined shift register. Pipelining has
been used both for data and control lines. The gure also illustrates how pipelined links are used to
support latency-insensitive link-level error control, ensuring robustness against communication errors.
The retransmission of a corrupted it between two successive switches is represented. Multiple outstand-
ing its propagate across the link during the same clock cycle. When its are correctly received at the
destination switch, an ACK is propagated back to the source, and after N clock cycles (where N is the
C B A
D B C
D C
B
D C B
A D C B
D C B
D C B A
D C B A
D
A
A B
A
C B A
D
Flits at source switch Link Destination switch
ACK=1
ACK_Valid=1
ACK=1
ACK_Valid=1
ACK=0
ACK_Valid=1
ACK=0
ACK_Valid=1
C B A
D C B A
Transmission
Flit acknowledgment
Detection of corrupted flit
ACK/NACK propagation
Retransmission
Go-Back-N
FIGURE 21.2 Pipelined link model and latency-insensitive link-level error control.
length of the link expressed in number of repeater stages) the it will be discarded from the buffer of
the source switch. On the contrary, a corrupted it is NACKed and will be retransmitted in due time.
The implemented retransmission policy is GO-BACK-N, to keep the switch complexity as low as possible.
21.4.2 Switch
The task of the switch is to carry packets injected into the network to their nal destination, following a
statically dened or dynamically determined routing path. The switch transfers packets from one of its
input ports to one or more of its output ports.
Switch design is usually characterized by a power-performance trade-off: power-hungry switch memory
resources can be required by the need to support high-performance on-chip communication. A specic
design of a switch may include both input and output buffers or only one type of buffer. Input queuing uses
fewer buffers, but suffers from head-of-line blocking. Virtual output queuing has a higher performance,
but at the cost of more buffers.
Network ow control (or routing mode) specically addresses the limited amount of buffering resources
in switches. Three policies are feasible in this context [41].
In store-and-forward routing, an entire packet is received and entirely stored before being forwarded
to the next switch. This is the most demanding approach in terms of memory requirements and switch
latency. Also virtual cut-through routing requires buffer space for an entire packet, but allows lower latency
communication, in that a packet is forwarded as soon as the next switch guarantees that the complete
packet will be accepted. If this is not the case, the current router must be able to store the whole packet.
Finally, a wormhole routing scheme can be employed to reduce switch memory requirements and to
permit low latency communication. The rst it of a packet contains routing information, and header
it decoding enables the switches to establish the path and subsequent its simply follow this path in a
pipelined fashion by means of switch output port reservation. A it is passed to the next switch as soon as
enough space is available to store it, even though there is not enough space to store the whole packet. If a
certain it faces a busy channel, subsequent its have to wait at their current locations and are therefore
spread over multiple switches, thus blocking the intermediate links. This scheme avoids buffering the full
packet at one switch and keeps end-to-end latency low, although it is more sensitive to deadlock and may
result in low link utilization.
Guaranteeing quality of service in switch operation is another important design issue, which needs
to be addressed when time-constrained (hard or soft real-time) trafc is to be supported. Throughput
guarantees or latency bounds are examples of time-related guarantees.
Contention related delays are responsible for large uctuations of performance metrics, and a fully
predictable system can be obtained only by means of contention free routing schemes. With circuit
switching, a connection is setup over which all subsequent data is transported. Therefore, contention
resolution takes place at setup at the granularity of connections, and time-related guarantees during data
transport can be given. In time division circuit switching (see Reference 24 for an example), bandwidth is
shared by time division multiplexing connections over circuits.
In packet switching, contention is unavoidable since packet arrival cannot be predicted. Therefore
arbitration mechanisms and buffering resources must be implemented at each switch, thus delaying data
in an unpredictable manner and making it difcult to provide guarantees. BE NoC architectures can
mainly rely on network overdimensioning to bound uctuations of performance metrics.
The Aethereal NoC architecture makes use of a router that tries to combine GT and BE services [34].
The GT router subsystem is based on a time-division multiplexed circuit switching approach. A router
uses a slot table to (1) avoid contention on a link, (2) divide up bandwidth per link between connections,
and (3) switch data to the correct output. Every slot table T has S time slots (rows), and N router outputs
(columns). There is a logical notion of synchronicity: all routers in the network are in the same xed-
duration slot. In a slot s at most one block of data can be read/written per input/output port. In the next
slot, the read blocks are written to their appropriate output ports. Blocks thus propagate in a store and
forward fashion. The latency a block incurs per router is equal to the duration of a slot and bandwidth
is guaranteed in multiples of block size per S slots. The BE router uses packet switching, and it has been
showed that both input queuing with wormhole routing or virtual cut-through routing and virtual output
queuing with wormhole routing are feasible in terms of buffering cost. The BE and GT router subsystem
are combined in the Aethereal router architecture of Figure 21.3. The GT router offers a xed end-to-end
latency for its trafc, which is given the highest priority by the arbiter. The BE router uses all the bandwidth
(slots) that has not been reserved or used by GT trafc. GT router slot tables are programmed by means
of BE packets (see the arrow program in Figure 21.3). Negotiations, resulting in slot allocation, can be
done at compile time, and be congured deterministically at runtime. Alternatively, negotiations can be
done at runtime.
A different perspective has been taken in the design of the switch for the BE Xpipes NoC. Figure 21.4
shows an example conguration with four inputs, four outputs, and two virtual channels multiplexed
across the same physical output link. A physical link is assigned to different virtual channels on a it-
by-it basis, thereby improving network throughput. Switch operation is latency-insensitive, in that
correct operation is guaranteed for arbitrary link pipeline depth. In fact, as explained above, network
BE
Arbitration
Program
Low priority High priority
GT
BE
GT
Control path
Data path Switch Buffers
Preempt
Program
(b) (a)
FIGURE 21.3 A combined GTBE router: (a) conceptual view; (b) hardware view.
Out[1]
Out[0]
Out[3]
Out[2]
2N+M flits
In[3]
In[2]
In[0]
In[1]
Switch
FIGURE 21.4 Example of switch conguration with two virtual channels.
links in Xpipes interconnect are pipelined with a exible number of stages, thereby decoupling link data
introduction rate from its physical length.
For latency-insensitive operation, the switch has virtual channel registers to store 2N + M its, where
N is the link length (expressed as number of basic repeater stages) and M is a switch architecture-related
contribution (12 cycles in this design). The reason is that each transmitted it has to be acknowledged
before being discarded from the buffer. Before an ACK is received, the it has to travel across the link
(N cycles), an ACK/NACK decision has to be taken at the destination switch (a portion of M cycles), the
ACK/NACK signal has to be propagated back (N cycles) and recognized by the source switch (remaining
portion of M cycles). During this time, other 2N + M its are transmitted but not yet ACKed.
Output buffering was chosen for Xpipes switches, and the resulting architecture is reported in
Figure 21.5. It consists of a replication of the same output module, accepting all input ports as its
own inputs. Flow control signals generated by each output block are directed to a centralized module, that
takes care of generating proper ACKs or NACKs for the incoming its from the different input ports.
Each output module is deeply pipelined (seven pipeline stages) so as to maximize the operating clock
frequency of the switch. Architectural details on the pipelined output module are illustrated in Figure 21.6.
Forward ow control is used and a it is transmitted to the next switch only when adequate storage is
available. The CRC decoders for error detection work are in parallel with the switch operation, thereby
hiding their impact on switch latency.
In[1]
In[0]
In[1]
In[2]
In[3]
In[0]
Out[0]
Out[0]
Out[1]
ACK management
Out[3]
Out[3] In[3]
Internal_flow_control[0]
flow_control[1]
portOUTtot_switch[0].h
In[0]
In[1]
In[2]
In[3]
In[0]
In[1]
In[2]
In[3]
flow_control[0]
flow_control[3]
flow_control[2]
Out[2]
In[2]
Out[1]
FIGURE 21.5 Architecture of output buffered Xpipes switch.
CRC_decoder[0]
CRC_decoder[1]
CRC_decoder[2]
CRC_decoder[3]
crc_ACK[3]
crc_ACK[2]
crc_ACK[1]
crc_ACK[0]
In[0]
ACK
NACK
ACK
NACK
ACK
ACK
ACK_Valid
which_in
Output module
Out[0]
Forward
flow
control
Matching
input and
output
port
Virtual
channel
registers
MUXstage
Arbiter Virtual
channel
arbiter
Output
link
arbiter
NACK
In[1]
In[2]
In[3]
Error
detection
logic
FIGURE 21.6 Architecture of an Xpipes switch output module.
The rst pipeline stage checks the header of incoming packets on different input ports to determine
whether those packets have to be routed through the output port under consideration. Only matching
packets are forwarded to the second stage, which resolves contention based on a round robin policy.
Arbitration is carried out against receipt of the tail its of preceding packets, so that all other its of a
packet can be propagated without contention resolution at this stage. A NACK for its of nonselected
packets is generated. The third stage is just a multiplexer, which selects the prioritized input port. The
following arbitration stage keeps the status of virtual channel registers and decides whether the its can
be stored into the registers or not. A header it is sent to the register with more free locations, followed
by successive its of the same packet. The fth stage is the actual buffering stage, and the ACK/NACK
message at this stage indicates whether a it has been successfully stored or not. The following stage takes
care of forward ow control and nally a last arbitration stage multiplexes the virtual channels on the
physical output link.
Finally, the switch is highly parameterizable. Design parameters are: number of I/O ports, it width,
number of virtual channels, length of switch-to-switch links, size of output registers.
21.4.3 Network Interface
The most relevant tasks of the network interface are: (1) hiding the details about the network com-
munication protocol to the cores, so that they can be developed independently of the communication
infrastructure, (2) communication protocol conversion (from end-to-end to network protocol), (3) data
packetization (packet assembly, delivery, and disassembly).
The former objective can be achieved by means of standard interfaces. For instance, the VSIA vision
[22] is to specify open standards and specications that facilitate the integration of software and hardware
virtual components from multiple sources. Different complexity interfaces are described in the standard,
from Peripheral Virtual Complexity Interfaces (VCI) to Basic VCI and Advanced VCI.
Another example of standard socket to interface cores to networks is represented by Open Core Protocol
(OCP) [23]. Its main characteristics are a high degree of congurability to adapt to the cores functionality
and the independence of request and response phases, thus supporting multiple outstanding requests and
pipelining of transfers.
Data packetization is a critical task for the network interface, and has an impact on the communication
latency, besides the latency of the communication channel. The packet-preparation process consists of
building packet header, payload, and packet tail. The header contains the necessary routing and network
control information (e.g., source and destination address). When source routing is used, the destination
address is ignored and replaced with a route eld that species the route to the destination. This overhead
in terms of packet header is counterbalanced by the simpler routing logic at the network switches: they
simply have to look at the route eld and route the packet over the specied switch output port. The packet
tail indicates the end of a packet and usually contains parity bits for error-detecting or error-correcting
codes.
An insight in the Xpipes network interface implementation will provide an example of these concepts.
It provides a standardized OCP-based interface to network nodes. The network interface for cores that
initiate communication (initiators) needs to turn OCP-compliant transactions into packets to be trans-
mitted across the network. It represents the slave side of an OCP end-to-end connection, and it is therefore
referred to as network interface slave (NIS). Its architecture is showed in Figure 21.7.
Initiator
core
STATIC_PACKETING
Datastream
DP_FAST
BUFFER_OUT
Output
to
the network
Input
from
the network
SYNCHRO
Datastream
RECEIVE_RESPONSE
NIS
enable_new_read Master
if
req_tx_datastream
busy_receive_response
Lutword
Flitout
teq_tx_flitout
busy_buffer
numSB
busy_dpfast
req_tx_datastream
start_receive_response
Request
phase
(OCP)
Response
phase
(OCP)
FIGURE 21.7 Architecture of the Xpipes NIS.
The NIS has to build the packet header, which has to be spread over a variable number of its depending
on the length of the path to the destination node. In fact, Xpipes relies on a static routing algorithm called
street sign routing. Routes are derived by the network interface by accessing a look-up table based on the
destination address. Such information consists of direction bits read by each switch and indicating the
output port of the switch to which its belonging to a certain packet have to be directed to.
The look-up table is accessed by the STATIC_PACKETING block, a nite state machine that forwards
the routing information numSB (number of hops to destination) and lutword (word read from the look-
up table) as well as the request-related information datastream from the initiator core to the DP_FAST
block, provided the enable signal busy_dpfast is not asserted.
Based on the input data, module DP_FAST has the task of building the its to be transmitted via the
output buffer BUFFER_OUT, according to the mechanism illustrated in Figure 21.8. Let us assume that a
packet requires numSB = 5 hops to get to destination, and that the direction to be taken at each switch
is expressed by DIR. Module DP_FAST builds the rst it by concatenating the it type eld with path
information. If there is some space left in the it, it is lled with header information derived fromthe input
datastream. The unused part of the datastream is stored in a regpark register, so that a new datastream
can be read from the STATIC_PACKETING block. The following header and payload its will be formed
by combining data stored in regpark and reg_datastream. No partially lled its are transmitted to make
transmission more efcient. Finally, module BUFFER_OUT stores its to be sent across the network, and
allows the NIS to keep preparing successive its when the network is congested. Size of this buffer is a
design parameter.
The response phase is carriedout by means of twomodules. SYNCHROreceives incoming its andreads
out only useful information (e.g., it discards route elds). At the same time, it contains buffering resources
DIR
Lutword
numSB=5
DIR DIR DIR DIR
Path
Header
Header
reg_datastream
Datastream
WFTYPEWD
FTYPE
W
Flit
Regflit
Regpark
FIGURE 21.8 Mechanism for building header its.
to synchronize the networks requests to transmit remaining packet its with the core consuming rate.
The RECEIVE_RESPONSE module translates useful header and payload information into OCP-compliant
response elds.
When a read transaction is initiated by the master core, the STATIC_PACKETING block asserts a
start_receive_response signal that triggers the waiting phase of the RECEIVE_RESPONSE module for the
requested data. As a consequence, the NIS supports only one outstanding read operation to keep interface
complexity low. Although no read after read transactions can be initiated unless the previous one has
completed, an indenite number of write transactions can be carried out after an outstanding read has
been initiated.
The architecture of a network interface master is similar to the one just described, and is not reported
here for lack of space. At instantiation time, the main network interface related parameters to be set are:
total number of core blocks, it width, and maximum number of hops across the network.
21.5 NoC Topology
The individual components of SoCs are inherently heterogeneous with widely varying functionality and
communication requirements. The communication infrastructure should optimally match communica-
tion patterns among these components accounting for the individual component needs.
As an example, consider the implementation of an MPEG4 decoder [42], depicted in Figure 21.9(b),
where blocks are drawn roughly to scale and links represent interblock communication. First, the embed-
ded memory (SDRAM) is much larger than all other cores and it is a critical communication bottleneck.
Block sizes are highly nonuniform and the oorplan does not match the regular, tile-based oorplan
shown in Figure 21.9(a). Second, the total communication bandwidth to/from the embedded SDRAM
is much larger than that required for communication among the other cores. Third, many neighboring
blocks do not need to communicate. Even though it may be possible to implement MPEG4 onto a homo-
geneous fabric, there is a signicant risk of either underutilizing many tiles and links, or, at the opposite
extreme, of achieving poor performance because of localized congestion. These factors motivate the use
of an application-specic on-chip network [26].
Withanapplication-specic network, the designer is facedwiththe additional task of designing network
components (e.g., switches) with different congurations (e.g., different I/Os, virtual channels, buffers)
and interconnecting them with links of uneven length. These steps require signicant design time and the
need to verify network components and their communications for every design.
The library-based nature of network building blocks seems the more appropriate solution to support
domain-specic custom NoCs. Two relevant examples have been reported in the open literature: Proteo
and Xpipes Interconnects. Proteo consists of a fully reusable and scalable component library where the
Core
AU
RAST
SRAM
MCPU
SDRAM
BAB UPS ADSP
RISC SRAM
DSP
VU
(a) (b)
FIGURE 21.9 Homogeneous versus heterogeneous architectural template: (a) tile-based on-chip multiprocessor;
(b) MPEG4 SoC.
components can be used to implement networks from very simple bus emulation structures to complex
packet networks. It uses standardized VCI interface between the functional cores and the communication
network. Proteo is described using synthesizable VHDL and relies on an interconnect node architecture
that targets exible on-chip communication. It is used as a testing platformwhen the efciency of network
topologies and routing schemes are investigated for on-chip environments. The node is constructed from
a collection of parameterized and reusable hardware blocks, including components such as FIFO (rst in
rst out) buffers, routing controllers, and standardized interface wrappers. A node can be tuned to fulll
the desired characteristics of communication by properly selecting the internal architecture of the node
itself.
Xpipes NoC takes a similar approach. As described throughout this chapter, its network building blocks
have been designed as highly congurable and design-time composable soft macros described in SystemC
at the cycle-accurate level.
An optimal system solution will also require an efcient mapping of high-level abstractions on to
the underlying platform. This mapping procedure involves optimizations and trade-offs among many
complex constraints, including quality of service, real-time response, power consumption, area, etc. Tools
are urgently needed to explore this mapping process, and assist and automate optimization where possible.
The rst challenge for these tools is to bridge the gap in building custom NoCs that optimally match
the communication requirements of the system. The network components they build should be highly
optimized for that particular NoC design, providing large savings in area, power, and latency with respect
to standard NoCs based on regular structures.
Inthe following section, anexample of designmethodology for heterogeneous SoCs is briey illustrated.
It is relative to Xpipes interconnect and relies on a tool that automatically instantiates an application-
specic NoC for heterogeneous on-chip multiprocessors (called XpipesCompiler [36]).
21.5.1 Domain-Specic NoC Synthesis Flow
The complete XpipesCompiler-based NoC design ow is depicted in Figure 21.10. From the specication
of an application, the designer (or a high-level analysis and exploration tool) creates a high-level view of
the SoC oorplan, including nodes (with their network interfaces), links, and switches. Based on clock
NI
files
Switch
files
Link
files
SystemC
Files
of whole
design
Xpipes Library
Application
Application
specific
NoC
Instantiation software
XpipesCompiler tool
Routing
tables
Core
source
files
SystemC
files
of whole
design
FIGURE 21.10 NoC synthesis ow with XpipesCompiler.
AU
Media
CPU
RAST
190
0.5
60
40
600
40
250
500
670
32
0.5
910
SDRAM
SRAM1
SRAM2
iDCT,
etc.
173
VU
ADSP
Up
SAMP
BAB
RISC
FIGURE 21.11 Core graph representation of an example MPEG4 design with annotated average communication
requirements.
speed target and link routing, the number of pipeline stages for each link is also specied. The information
on the network architecture is specied in an input le for the XpipesCompiler. Routing tables for the
network interfaces are also specied. The tool takes as additional input the SystemC library of soft network
components. The output is a SystemC hierarchical description, which includes all switches, links, network
nodes, and interfaces, and species their topological connectivity. The nal description can then be
compiled and simulated at the cycle-accurate and signal-accurate level. At this point, the description can
be fed to backend register transfer level (RTL) synthesis tools for silicon implementation.
In a nutshell, the XpipesCompiler generates a set of network component instances that are custom-
tailored to the specication contained in its input network description le. This tool allows a very
instructive comparison of the effects (in terms of area, power, and performance) of mapping applications
on customized domain-specic NoCs and regular mesh NoCs.
Let us focus on the MPEG4 decoder already introduced in this chapter. Its core graph representation
together with its communication requirements are reported in Figure 21.11. The edges are annotated
with the average bandwidth requirements of the cores in MB/sec. Customized application-specic NoCs
that closely match the applications communication characteristics have been manually developed and
compared to a regular mesh topology. The different NoC congurations are reported in Figure 21.12.
In the MPEG4 design considered, many of the cores communicate with each other through the shared
SDRAM. So a large switch is used for connecting the SDRAM with other cores (Figure 21.12[b]) while
smaller switches are used for other cores. An alternate custom NoC is also considered (Figure 21.12[c]): it
is an optimized mesh network, with superuous switches and switch I/Os removed.
Area (in 0.1 m technology) and power estimates for the different NoC congurations are reported in
Table 21.1. Since all cores communicate with many other cores, many switches are needed and therefore
area savings are not extremely signicant for custom NoCs.
Based on the average trafc through each network component, the power dissipation for each NoC
design has been calculated. Power savings for the custom solutions are not very signicant, as most of the
trafc traverses the larger switches connected to the memories. As power dissipation on a switch increases
nonlinearly with increase in switch size, there is more power dissipation in the switches of custom NoC1
(that has an 8 8 switch) than the mesh NoC. However most of the trafc traverses short links in this
custom NoC, thereby giving marginal power savings for the whole design.
vu raster
izer
SRAM
Media
CPU
RISC
CPU
SRAM
s1
s2
s1
iDCT,
etc.
Audio
DSP
up
samp
BAB
Calc
s2
DDR
SDRAM
DDR
SDRAM
Media
CPU
BAB
calc
Audio
DSP
raster
izer
RISC
CPU
SRAM
s3 3 3
s8 8 8
s1 3 3
s2 4 4
s3 5 5
SRAM
au
s1
s1 s2
s2
s2
s3
s3
s2
au vu
S3
S3
S3
S3
up
samp
iDCT
etc.
S3
S8
au vu
raster
izer
SRAM
DDR
SDRAM
SRAM
iDCT,
etc.
s1 5 5
s2 3 3
s3 4 4
Audio
DSP
up
samp
BAB
Calc
Media
CPU
RISC
CPU
s2
s2
s1
s1
s3
s2 s2
(a)
(b)
(c)
FIGURE 21.12 NoC congurations for MPEG4 decoder: (a) mesh NoC, (b) application-specic NoC1, and
(c) application-specic NoC2.
TABLE 21.1 Area and Power Estimates for the
MPEG4-Related NoC Congurations
NoC Area Ratio Power Ratio
conguration (mm
2
) mesh/cust (mW) mesh/cust
Mesh 1.31 114.36
Custom 1 0.86 1.52 110.66 1.03
Custom 2 0.71 1.85 93.66 1.22
Figure 21.13 reports the variation of average packet latency (for 64B packets, 32 bit its) with link
bandwidth. Custom NoCs, as synthesized by XpipesCompiler, have lower packet latencies as the average
number of switches and link traversals is lower. At the minimum plotted bandwidth value, almost 10%
latency saving is achieved. Moreover, the latency increases more rapidly with the mesh NoC as the link
bandwidth decreases. Also, custom NoCs have better link utilization: around 1.5 times the link utilization
of a mesh topology.
Area, power, and performance optimizations by means of custom NoCs turn out to be more dif-
cult for MPEG4 than other applications, such as Video Object Plane Decoders and MultiWindow
Displayer [36].
21.6 Conclusions
This chapter has described the motivation for packet-switched networks as communication paradigm for
deep submicron SoCs. After an overview of NoC proposals from the open literature, the chapter has gone
into the details of NoC architectural components (switch, network interface, and point-to-point links),
illustrating the Xpipes library of composable soft macros as a case study. Finally, the challenging issue of
heterogeneous NoC design has been addressed, showing an example NoC synthesis ow and detailing
area, power, and performance metrics of customized application-specic NoC architectures with respect
to regular mesh topologies. The chapter aims at highlighting the main guidelines and open issues for NoC
design on gigascale SoCs.
A
v
e
r
a
g
e

P
a
c
k
e
t

L
a
t
e
n
c
y

(
i
n

C
y
)
Mesh
Cust1
Cust2
BW (in Gb/sec)
50
48
46
44
42
40
38
36
34
32
3.2 3 2.8 2.6 2.4 2.2
FIGURE 21.13 Average packet latency as a function of the link bandwidth.
Acknowledgment
This work was supported in part by MARCO/DARPA Gigascale Silicon Research Center.
References
[1] F. Boekhorst. Ambient Intelligence, the Next Paradigm for Consumer Electronics: How will it
Affect Silicon? In ISSCC 2002, Vol. 1, February 2002, pp. 2831.
[2] L. Benini and G. De Micheli. Networks on Chips: a New SoC Paradigm. IEEE Computer, 35, 2002,
7078.
[3] P. Wielage and K. Goossens. Networks on Silicon: Blessing or Nightmare? In Proceedings of the
Euromicro Symposium on Digital System Design DSD02, September 2002, pp. 196200.
[4] W.J. Dally and B. Towles. Route Packets, not Wires: On-Chip Interconnection Networks.
In Proceedings of the Design and Automation Conference DAC01, June 2001, pp. 684689.
[5] S. Kumar, A. Jantsch, J.P. Soininen, M. Forsell, M. Millberg, J. Oeberg, K. Tiensyrja, andA. Hemani.
ANetwork on Chip Architecture and Design Methodology. In IEEE Symposiumon VLSI ISVLSI02,
April 2002, pp. 105112.
[6] L. Benini, D. Bertozzi, D. Bruni, N. Drago, F. Fummi, and M. Poncino. SystemC Cosimulation and
Emulation of Multiprocessor SoC Designs. IEEE Computer, 36, 2003, 5359.
[7] S. Nugent, D.S. Wills, and J.D. Meindl. A Hierarchical Block-Based Modeling Methodology
for SOC in GENESYS. In Proceedings of the IEEE ASIC/SOC Conference, September 2002,
pp. 239243.
[8] P. Gerin, S. Yoo, G. Nicolescu, and A.A. Jerraya. Scalable and Flexible Cosimulation of SoC Designs
with Heterogeneous Multi-Processor Target Architecture. In Proceedings of the ASP-DAC 2001,
January/February 2001, pp. 6368.
[9] H. Blume, H. Huebert, H.T. Feldkaemper, and T.G. Noll. Model-Based Exploration of the Design
Space for Heterogeneous Systems on Chip. In Proceedings of the IEEE Conference on Application-
Specic Systems, Architectures and Processors ASAP02, 2002.
[10] P.G. Paulin, C. Pilkington, and E. Bensoudane. StepNP: a System-level Exploration Platform for
Network Processors. IEEE Design and Test of Computers, NovemberDecember 2002, pp. 1726.
[11] P. Guerrier and A. Greiner. A Generic Architecture for On-Chip Packet Switched Interconnections.
In Proceedings of the Design, Automation and Testing in Europe DATE00, March 2000, pp. 250256.
[12] S.J. Lee et al. An 800 MHz Star-Connected On-Chip Network for Application to Systems on a
Chip. In ISSCC03, February 2003.
[13] H. Yamauchi et al. A 0.8 W HDTV Video Processor with Simultaneous Decoding of Two MPEG2
MP@HL Streams and Capable of 30 Frames/s Reverse Playback. In ISSCC02, Vol. 1, February
2002, pp. 473474.
[14] M. DallOsso, G. Biccari, L. Giovannini, D. Bertozzi, and L. Benini. Xpipes: a Latency Insen-
sitive Parameterized Network-on-Chip Architecture for Multi-Processor SoCs. In ICCD03,
October 2003.
[15] ITRS. 2001, http://public.itrs.net/Files/2001ITRS/Home.htm.
[16] D. Bertozzi, L. Benini, and G. De Micheli. Energy-Reliability Trade-Off for NoCs. In Networks
on Chip, A. Jantsch and Hannu Tenhunen, Eds., Kluwer Academic Press, Boston, MA, 2003,
pp. 107129.
[17] H. Zhang, V. George, and J.M. Rabaey. Low-Swing On-Chip Signaling Techniques: Effectiveness
and Robustness. IEEE Transactions on VLSI Systems, 8, 2000, 264272.
[18] J. Xu, and W. Wolf, Wave Pipelining for Application-Specic Networks-on-Chips. In CASES02,
October 2002, pp. 198201.
[19] K. Goossens, J. Dielissen, J. van Meerbergen, P. Poplavko, A. Radulescu, E. Rijpkema,
E. Waterlander, and P. Wielage. Guaranteeing the Quality of Services in Networks on Chip. In
Networks on Chip, A. Jantsch and Hannu Tenhunen, Eds., Kluwer Academic Press, Boston, MA,
2003, pp. 6182.
[20] ITRS, 1999, http://public.itrs.net/les/1999_SIA_Roadmap/.
[21] A. Jantsch and H. Tenhunen. Will Networks on Chip Close the Productivity Gap? In Networks on
Chip, A. Jantsch and Hannu Tenhunen, Eds., Kluwer Academic Press, Boston, MA, 2003, pp. 318.
[22] VSI Alliance. Virtual Component Interface Standard, 2000.
[23] OCP International Partnership. Open Core Protocol Specication, 2001.
[24] D. Wingard. MicroNetwork-Based Integration for SoCs. In Design Automation Conference DAC01,
June 2001, pp. 673677.
[25] D. Flynn. AMBA: enabling Reusable On-Chip Designs. IEEE Micro, 17, 1997, 2027.
[26] H. Zhang et al. A 1V Heterogeneous Recongurable DSP IC for Wireless Baseband Digital Signal
Processing. IEEE Journal of SSC, 35, 2000, 16971704.
[27] W.J. Dally and S. Lacy. VLSI Architecture: Past, Present and Future. In Conference of Advanced
Research in VLSI, 1999, pp. 232241.
[28] D. Culler, J.P. Singh, and A. Gupta. Parallel Computer Architecture. An Hardware/Software
Approach. Morgan Kaufmann, San Francisco, CA, 1999.
[29] K. Compton and S. Hauck. Recongurable Computing: a Survey of System and Software. ACM
Computing Surveys, 34, 2002, 171210.
[30] R. Tessier and W. Burleson. Recongurable Computing and Digital Signal Processing: a Survey.
Journal of VLSI Signal Processing, 28, 2001, 727.
[31] J. Walrand and P. Varaja. High Performance Communication Networks. Morgan Kaufmann,
San Francisco, CA, 2000.
[32] Dale Liu et al. SoCBUS: The Solution of High Communication Bandwidth on Chip and Short
TTM, invited paper in Real Time and Embedded Computing Conference, September 2002.
[33] S. Ishiwata et al., A Single Chip MPEG-2 Codec Based on Customizable Media Embedded
Processor. IEEE JSSC, 38, 2003, 530540.
[34] E. Rijpkema, K. Goossens, A. Radulescu, J. van Meerbergen, P. Wielage, and E. Waterlander. Trade
Offs in the Design of a Router with both Guaranteed and Best-Effort Services for Networks on
Chip. In Design Automation and Test in Europe DATE03, March 2003, pp. 350355.
[35] I. Saastamoinen, D. Siguenza-Tortosa, and J. Nurmi. Iterconnect IP Node for Future Systems-
on-Chip Designs. IEEE Workshop on Electronic Design, Test and Applications, January 2002,
pp. 116120.
[36] A. Jalabert, S. Murali, L. Benini, and G. De Micheli. XpipesCompiler: a Tool for Instantiating
Application Specic Networks-on-Chip, DATE 2004, pp. 884889.
[37] V. Agarwal, M.S. Hrishikesh, S.W. Keckler, and D. Burger. Clock Rate Versus IPC: The End of
the Road for Conventional Microarchitectures. In Proceedings of the 27th Annual International
Symposium on Computer Architecture, June 2000, pp. 248250.
[38] L.P. Carloni, K.L. McMillan, and A.L. Sangiovanni-Vincentelli. Theory of Latency-Insensitive
Design. IEEE Transactions on CAD of ICs and Systems, 20, 2001, 10591076.
[39] L. Scheffer. Methodologies and Tools for Pipelined On-Chip Interconnects, International
Conference on Computer Design, 2002, pp. 152157.
[40] P. Glaskowsky. Pentium 4 (Partially) Previewed. Microprocessor Report, 14, 2000, 1013.
[41] J. Duato, S. Yalamanchili, and L. Ni. Interconnection Networks: an Engineering Approach. IEEE
Computer Society Press, Washington, 1997.
[42] E.B. Van der Tol and E.G.T. Jaspers. Mapping of MPEG4 Decoding on a Flexible Architecture
Platform. In SPIE 2002, January 2002, pp. 113.
22
Platform-Based Design
for Embedded Systems
Luca P. Carloni,
Fernando De Bernardinis,
Claudio Pinello,
Alberto L.
Sangiovanni-Vincentelli,
and Marco Sgroi
University of California at Berkeley
22.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-1
22.2 Platform-Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-3
22.3 Platforms at the Articulation Points of the
Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-4
(Micro-)Architecture Platforms API Platform System
Platform Stack
22.4 Network Platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-7
Denitions Quality of Service Design of Network
Platforms
22.5 Fault-Tolerant Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-9
Types of Faults and Platform Redundancy Fault-Tolerant
Design Methodology The API Platform (FTDF Primitives)
Fault-Tolerant Deployment Replica Determinism
22.6 Analog Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-15
Denitions Building Performance Models Mixed-Signal
Design Flow with Platforms Recongurable Platforms
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22-24
22.1 Introduction
Platform-Based Design (PBD) [1,2] has emerged as an important design style as the electronics industry
faced serious difculties owing to three major factors:
1. The disaggregation (or horizontalization) of the electronics industry has begun about a decade ago
and has affected the structure of the industry favoring the move from a vertically oriented business
model to a horizontally oriented one. In the past, electronic systemcompanies used to maintain full
control of the product development cycle, from product denition to nal manufacturing. Today,
the identication of a new market opportunity, the denition of the detailed system specications,
the development and assembly of the components, and the manufacturing of the nal product
are tasks performed more and more frequently by distinct organizations. In fact, the complexity
of electronic designs and the number of technologies that must be mastered to bring winning
products to market have forced electronic companies to focus on their core competence. In this
22-1
scenario, the integration of the design chain becomes a serious problem at the hand-off points from
one company to another.
2. The pressure for reducing time-to-market of electronic products in the presence of exponentially
increasing complexity has forced designers to adopt methods that favor component reuse at all
levels of abstraction. Furthermore, each organization that contributes a component to the nal
product naturally strives for the exibility in their design approach that allows to make continuous
adjustments and accommodate last-minute engineering changes.
3. The dramatic increase in NonRecurring Engineering (NRE) costs owing to mask making at the
Integrated Circuit (IC) implementation level (a set of masks for the 90 nm technology node costs
more than two million US dollars), development of production plants (a new fab costs more
than two billion US dollars), and design cost (a new generation microprocessor design requires
more than 500 designers with all the associated costs in tools and infrastructure!) has created,
on one hand, the necessity of correct-the-rst-time designs and on the otherhand, the push for
consolidation of efforts in manufacturing.
1
The combination of these factors has caused several system companies to substantially reduce their
ASIC (Application Specic Integrated Circuits) design efforts. Traditional paradigms in electronic system
and IC design have to be revisited and readjusted or altogether abandoned. Along the same line of
reasoning, IC manufacturers are moving toward the development of parts that have guaranteed high-
volume production forma single mask set (or that are likely to have high-volume production, if successful)
thus moving differentiation and optimization to recongurability and programmability.
Platform-Based Design has emerged over the years as a way of coping with the problems listed earlier.
The termplatform has been used in several domains: from service providers to system companies, from
tier-one suppliers to IC companies. In particular, IC companies have been very active, lately, to espouse
platforms. The TI OMAP platform for cellular phones, the Philips Viper and Nexperia platforms for
consumer electronics, the Intel Centrino platform for laptops, are a few examples. Recently, Intel has been
characterized by its CEO Ottellini as a platform company.
As is often the case for fairly radical new approaches, the methodology emerged as a sequence of
empirical rules and concepts, but we have reached a point where a rigorous design process was needed
together with supporting EDA environments and tools. The PBD:
Sets the foundation for developing economically feasible design ows because it is a structured
methodology that theoretically limits the space of exploration, yet still achieves superior results in the
xed time constraints of the design.
Provides a formal mechanism for identifying the most critical hand-off points in the design chain.
The hand-off point between system companies and IC design companies and the one between
IC design companies (or divisions) and IC manufacturing companies (or divisions) represent the
articulation points of the overall design process.
Eliminates expensive design iterations because it fosters design reuse at all abstraction levels thus
enabling the design of an electronic product by assembling and conguring platform components
in a rapid and reliable fashion.
Provides an intellectual framework for the complete electronic design process.
This chapter presents the foundations of this discipline and outlines a variety of domains where the
PBD principles can be applied. In particular, Section 22.2 denes the main principles of PBD. Our goal
is to provide a precise reference that may be used as the basis for reaching a common understanding
in the electronic system and circuit design community. Then, we present the platforms that dene the
articulationpoints betweensystemdenitionandimplementation(Section22.3). Inthe following sections
1
The cost of fabs has changed the landscape of ICmanufacturing in a substantial way forcing companies to teamup
for developing newtechnology nodes (see, e.g., the recent agreement among Motorola, Philips, andSTMicroelectronics
and the creation of Renesas in Japan).
Platform-Based Design for Embedded Systems 22-3
we show that the PBD paradigm can be applied to all levels of design: from very high levels of abstraction
such as communication networks (Section 22.4) and fault-tolerant platforms for the design of safety-
critical feedback-control systems (Section 22.5) to low levels such as analog parts (Section 22.6), where
performance is the main focus.
22.2 Platform-Based Design
The basic tenets of PBD are:
The identication of design as a meeting-in-the-middle process, where successive renements of
specications meet with abstractions of potential implementations.
The identication of precisely dened layers where the renement and abstraction process take
place. Each layer supports a design stage that provides an opaque abstraction of lower layers that
allows accurate performance estimations. This information is incorporated in appropriate param-
eters that annotate design choices at the present layer of abstraction. These layers of abstraction are
called platforms to stress their role in the design process and their solidity.
Aplatformis a library of components that can be assembled to generate a design at that level of abstraction.
This library not only contains computational blocks that carry out the appropriate computation but also
communication components that are used to interconnect the functional components. Each element of the
library has a characterization in terms of performance parameters together with the functionality it can
support. For every platformlevel, there is a set of methods used to map the upper layers of abstraction into
the platform and a set of methods used to estimate performances of lower level abstractions. As illustrated
in Figure 22.1, the meeting-in-the-middle process is the combination of two efforts:
Top-down: mapaninstance of the topplatformintoaninstance of the lower platformandpropagate
constraints.
Bottom-up: build a platformby dening the library that characterizes it and a performance abstrac-
tion (e.g., number of literals for technology independent optimization, area, and propagation delay
for a cell in a standard cell library).
A platform instance is a set of architecture components that are selected from the library and whose
parameters are set. Often the combination of two consecutive layers and their lling can be interpreted
as a unique abstraction layer with an upper view, the top abstraction layer, and a lower view, the
bottom layer. A platform stack is a pair of platforms, along with the tools and methods that are used to
map the upper layer of abstraction onto the lower layer. Note that we can allow a platform stack to include
several sub-stacks if we wish to span a large number of abstractions.
Upper layer of abstraction
Lower layer of abstraction
C
o
n
s
t
r
a
i
n
t
s

p
r
o
p
a
g
a
t
i
o
n
P
e
r
f
o
r
m
a
n
c
e

e
s
t
i
m
a
t
i
o
n
FIGURE 22.1 Interactions between abstraction layers.
Platforms should be dened to eliminate large loop iterations for affordable designs: they should restrict
design space via new forms of regularity and structure that surrender some design potential for lower
cost and rst-pass success. The library of function and communication components is the design space
that we can explore at the appropriate level of abstraction.
Establishing the number, location, and components of intermediate platforms is the essence of PBD.
In fact, designs with different requirements and specication may use different intermediate platforms,
hence different layers of regularity and design-space constraints. A critical step of the PBD process
is the denition of intermediate platforms to support predictability, which enables the abstraction of
implementationdetail to facilitate higher-level optimization, and veriability, that is, the ability to formally
ensure correctness.
The trade-offs involved in the selection of number and characteristics of platforms relate to the size of
the design space to be explored and the accuracy of the estimation of the characteristics of the solution
adopted. Naturally, the larger the step across platforms, the more difcult is the prediction of performance,
optimizing at the higher levels of abstraction, and providing a tight lower bound. In fact, the design space
for this approach may actually be smaller than the one obtained with smaller steps because it becomes
harder to explore meaningful design alternatives and the restriction on search impedes complete design-
space exploration. Ultimately, predictions/abstractions may be so inaccurate that design optimizations
are misguided and the lower bounds are incorrect.
It is important to emphasize that the PBD paradigm applies to all levels of design. While it is rather easy
to grasp the notion of a programmable hardware platform, the concept is completely general and should
be exploited through the entire design ow to solve the design problem. In the following sections, we will
showthat platforms canbe appliedtolowlevels of abstractionsuchas analog components, where exibility
is minimal and performance is the main focus, as well as to very high levels of abstraction such as networks,
where platforms have to provide connectivity and services. In the former case platforms abstract hardware
to provide (physical) implementation, while in the latter communication services abstract software layers
(protocol) to provide global connectivity.
22.3 Platforms at the Articulation Points of the Design Process
As we mentionedinSection22.2, the key tothe applicationof the designprinciple is the careful denitionof
the platformlayers. Platforms can be dened at several point of the design process. Some levels of abstrac-
tion are more important than others in the overall design trade-off space. In particular, the articulation
point between system denition and implementation is a critical one for design quality and time. Indeed,
the very notion of PBD originated at this point (see [1,35]). In References 1, 2, and 5, we have discovered
that at this level there are indeed two distinct platforms forming a system platform stack. These need to be
dened together with the methods and the tools necessary to link them: a (micro-)architecture platform
and an Application Programming Interface (API) platform. The API platform allows system designers to
use the services that a (micro-)architecture offers. In the world of Personal Computers (PCs), this concept
is well known and is the key to the development of application software on different hardware that share
some commonalities allowing the denition of a unique API.
22.3.1 (Micro-)Architecture Platforms
Integrated circuits used for embedded systems will most likely be developed as an instance of a particular
(micro-)architecture platform. That is, rather than being assembled from a collection of independently
developed blocks of silicon functionalities, they will be derived froma specic family of micro-architectures,
possibly oriented toward a particular class of problems, that can be extended or reduced by the system
developer. The elements of this family are a sort of hardware denominator that could be shared across
multiple applications. Hence, an architecture platform is a family of micro-architectures that share some
commonality, the library of components that are used to dene the micro-architecture. Every element
of the family can be obtained quickly through the personalization of an appropriate set of parameters
controlling the micro-architecture. Often, the family may have additional constraints on the components
of the library that can or should be used. For example, a particular micro-architecture platform may
be characterized by the same programmable processor and the same interconnection scheme, while
the peripherals and the memories of a specic implementation may be selected from the predesigned
library of components depending on the given application. Depending on the implementation platform
that is chosen, each element of the family may still need to go through the standard manufacturing
process including mask making. This approach then conjugates the need of saving design time with the
optimization of the element of the family for the application at hand. Although it does not solve the mask
cost issue directly, it should be noted that the mask cost problem is primarily owing to the generation of
multiple mask sets for multiple design spins, which is addressed by the architecture platformmethodology.
The less constrained the platform, the more freedom a designer has in selecting an instance and the
more potential there is for optimization, if time permits. However, more constraints mean stronger
standards and easier addition of components to the library that denes the architecture platform (as with
PCplatforms). Note that, the basic concept is similar to the cell-based design layout style, where regularity
and the reuse of library elements allow faster design time at the expense of some optimality. The trade-
off between design time and design quality needs to be kept in mind. The economics of the design
problem must dictate the choice of the design style. The higher the granularity of the library, the more
leverage we have in shortening the design time. Given that the elements of the library are reused, there is a
strong incentive to optimize them. In fact, we argue that the macro-cells should be designed with great
care and attention given to area and performance. It also makes sense to offer a variation of cells with
the same functionality but with implementations that differ in performance, area, and power dissipation.
Architecture platforms are, in general, characterized by (but not limited to) the presence of programmable
components. Then, each of the platform instances that can be derived from the architecture platform
maintains enough exibility to support an application space that guarantees the production volumes
required for economically viable manufacturing.
The library that denes the architecture platform may also contain recongurable components, which
comes in two avors. With runtime recongurability, FPGA (Field Programmable Gate Arrays) blocks
can be customized by the user without the need of changing mask set, thus saving both design cost and
fabrication cost. With design-time recongurability, where the silicon is still application specic, only
design time is reduced.
An architecture platform instance is derived from an architecture platform by choosing a set of compo-
nents fromits library andby setting parameters of recongurable components of the library. The exibility,
or the capability of supporting different applications, of a platform instance is guaranteed by program-
mable components. Programmability will ultimately be of various forms. One is software programmability
to indicate the presence of a microprocessor, Digital Signal Processor (DSP) or any other software pro-
grammable component. Another is hardware programmability to indicate the presence of recongurable
logic blocks such as FPGAs, whereby logic function can be changed by software tools without requiring
a custom set of masks. Some of the new architecture and/or implementation platforms being offered in
the market mix the two types into a single chip. For example, Triscend, Altera, and Xilinx are offering
FPGA fabrics with embedded hard processors. Software programmability yields a more exible solution,
since modifying software is, in general, faster and cheaper than modifying FPGA personalities. On the
other hand, logic functions mapped on FPGAs execute orders of magnitude faster and with much less
power than the corresponding implementation as a software program. Thus, the trade-off here is between
exibility and performance.
22.3.2 API Platform
The concept of architecture platform by itself is not enough to achieve the level of application software
reuse we require. The architecture platform has to be abstracted at a level where the application soft-
ware sees a high-level interface with the hardware that we call API or Programmer Model. A software
layer is used to perform this abstraction. This layer wraps the essential parts of the architecture
platform:
The programmable cores and the memory subsystem via a Real-Time Operating System (RTOS).
The I/O subsystem via the device drivers.
The network connection via the network communication subsystem.
In our framework, the API is a unique abstract representation of the architecture platformvia the software
layer. Therefore, the application software can be reused for every platform instance. Indeed, the API is a
platform itself that we can call the API platform. Of course, the higher the abstraction level at which a
platform is dened, the more instances it contains. For example, to share the source code, we need to have
the same operating system but not necessarily the same instruction set, while to share the binary code, we
need to add the architectural constraints that force us to use the same ISA (Instruction Set Architecture),
thus greatly restricting the range of architectural choices.
The RTOS is responsible for the scheduling of the available computing resources and of the
communication between them and the memory subsystem. Note that, in several embedded system
applications, the available computing resources consist of a single microprocessor. In others, such as
wireless handsets, the combination of a Reduced Instruction Set Computer (RISC) microprocessor or
controller and DSP has been used widely in 2G, and now for 2.5G and 3G, and beyond. In set-top boxes,
a RISC for control and a media processor have also been used. In general, we can imagine a multiple core
architecture platform where the RTOS schedules software processes across different computing engines.
22.3.3 System Platform Stack
The basic idea of system platform stack is captured in Figure 22.2. The vertex of the two cones repres-
ents the combination of the API and the architecture platform. A system designer maps its application
onto the abstract representation that includes a family of architectures that can be chosen to optimize
cost, efciency, energy consumption, and exibility. The mapping of the application onto the actual
architecture in the family specied by the API can be carried out, at least in part, automatically if a set of
appropriate software tools (e.g., software synthesis, RTOS synthesis, device-driver synthesis) is available.
It is clear that the synthesis tools have to be aware of the architecture features as well as of the API. This
set of tools makes use of the software layer to go from the API platform to the architecture platform. Note
that, the system platform effectively decouples the application development process (the upper triangle)
fromthe architecture implementation process (the lower triangle). Note also that, once we use the abstract
denition of API as described earlier, we may obtain extreme cases such as traditional PC platforms on
Platform
design-space
export
Platform
mapping
Architectural space
Application space
Application instance
Platform instance
System
platform
FIGURE 22.2 System platform stack.
one side and full hardware implementation on the other. Of course, the programmer model for a full
custom hardware solution is trivial since there is a one-to-one map between functions to be implemented
and physical blocks that implement them. In the latter case, PBD amounts to adding to traditional design
methodologies some higher level of abstractions.
22.4 Network Platforms
In distributed systems the design of the protocols and channels that support the communication among
the systemcomponents is a difcult task owing to the tight constraints on performances and cost. To make
the communication design problem more manageable, designers usually decompose the communication
function into distinct protocol layers, and design each layer separately. According to this approach, of
which the Open Systems Interconnection (OSI) Reference Model is a particular instance, each protocol
layer together with the lower layers dene a platform that provides CSs to the upper layers and to the
application-level components. Identifying the most effective layered architecture for a given application
requires one to solve a trade-off between performances, which increase by minimizing the number of
layers, and design manageability, which improves with the number of the intermediate steps. Present
embedded system applications, owing to their tight constraints, increasingly demand the codesign of
protocol functions that in less-constrained applications are assigned to different layers and considered
separately (e.g., cross-layer protocol design of MAC and routing protocols in sensor networks). The
denition of an optimal layered architecture, the design of the correct functionality for each protocol
layer, and the design-space exploration for the choice of the physical implementation must be supported
by tools and methodologies that allow to evaluate the performances and guarantees the satisfaction of the
constraints after each step. For these reasons, we believe that the PBDprinciples and methodology provide
the right framework to design communication networks. In this section, rst, we formalize the concept of
Network Platform (NP). Then, we outline a methodology for selecting, composing, and rening NP [6].
22.4.1 Denitions
ANetwork Platformis a library of resources that can be selected and composed together to forma Network
Platform Instance (NPI) and support the interaction among a group of interacting components.
The structure of a NPI is dened by abstracting computation resources as nodes and communication
resources as links. Ports interface nodes with links or with environment of the NPI. The structure of
a node or a link is dened by its input and output ports, the structure of a NPI is dened by a set of nodes
and links connecting them.
The behaviors and the performances of a NPI are dened in terms of the type and the quality of the
CSs it offers. We formalize the behaviors of a NPI using the Tagged Signal Model [7]. NPI components
are modeled as processes and events model the instances of the send and receive actions of the processes.
An event is associated with a message that has a type and a value, and with tags that specify attributes of
the corresponding action instance (e.g., when it occurs in time). The set of behaviors of a NPI is dened
by the intersection of the behaviors of the component processes.
A NPI is dened as a tuple, NPI = (L, N, P, S), where:
L = {L
1
, L
2
, . . . , L
Nl
} is a set of directed links.
N = {N
1
, N
2
, . . . , N
Nn
} is a set of nodes.
P = {P
1
, P
2
, . . . , P
Np
} is a set of ports. A port P
i
is a triple (N
i
, L
i
, d), where N
i
N is a node,
L
i
L Env is a link or the NPI environment, and d = in if it is an input port, d = out
if it is an output port. The ports that interface the NPI with the environment dene the sets
P
in
= {(N
i
, Env, in)} P, P
out
= {(N
i
, Env, out)} P.
S =
Nn+Nl
R
i
is the set of behaviors, where R
i
indicates the set of behaviors of a resource that
can be a link in L or a node in N.
The basic services provided by a NPI are called Communication Services (CSs). A CS consists of a sequence
of message exchanges through the NPI from its input to its output ports. A CS can be accessed by NPI
users through the invocation of send and receive primitives whose instances are modeled as events. A NPI
API consists of the set of methods that are invoked by the NPI users to access the CS. For the denition
of a NPI API it is essential to specify not only the service primitives but also the type of CS they provide
access to (e.g., reliable send, out-of-order delivery etc.). Formally, a CS is a tuple (
P
in
,

P
out
, M, E, h, g, <
t
),
where

P
in
P
in
is a nonempty set of NPI input ports,

P
out
P
out
is a nonempty set of NPI output ports,
M is a nonempty set of messages, E is a nonempty set of events, h is a mapping h : E (
P
in

P
out
) that
associates each event with a port, g is a mapping g : E M associating each event with a message, <
t
is
a total order on the events in E.
A CS is dened in terms of the number of ports, that determine, for example, if it is a unicast, multicast,
or broadcast CS, the set M of messages representing the exchanged information, the set E including the
events that are associated with the messages in M and model the instances of the send and receive methods
invocations. The CS concept is useful to express the correlation among events, and explicit, for example,
if two events are from the same source or are associated with the same message.
22.4.2 Quality of Service
NPIs can be classied according to the number, the type, the quality, and the cost of the CS they offer.
Rather than in terms of event sequences, a CS is more conveniently described using Quality of Service
(QoS) parameters such as error rate, latency, throughput, jitter, and cost parameters such as consumed
power and manufacturing cost of the NPI components. QoS parameters can be simply dened by using
the annotation functions that associate individual events with quantities, such as the time when an event
occurs and the power consumed by an action. Hence, one can compare the values of pairs of input and
output events associated with the same message to quantify the error rate, or compare the timestamp of
events observed at the same port to compute the jitter. The most relevant QoS parameters are dened
using a notation where e
i,j
e
M,(
P
in
P
out
)
indicates an event carrying the i-th message and observed at
the j-th port, v(e) and t (e) represents, respectively, the value of the message carried by event e and the
timestamp of the action modeled by event e.
Delay. The communication delay of a message is given by the difference between the timestamps of
the input and output events carrying that message. Assuming that the i-th message is transferred from
input port j
1
to output port j
2
, the delay
i
of the i-th message, the average delay
Av
and the peak delay
Peak
are dened, respectively, as
i
= t (e
j
2
,i
) t (e
j
1
,i
),
Av
=
|M|
i=1
(t (e
j
2
,i
) t (e
j
1
,i
))/|M|,
Peak
=
max
i
{t (e
j
2
,i
) t (e
j
1
,i
)}.
Throughput. The throughput is given by the number of output events in an interval (t
0
, t
1
), that is, the
cardinality of the set = {e
i
E|h(e
i
)

P
out
, t (e
i
) (t
0
, t
1
)}.
Error rate. The Message Error Rate (MER) is given by the ratio between the number of lost or corrupted
output events and the total number of input events. Given LostM = {e
i
E|h(e
i
)

P
in
, e
j
E
s.t. h(e
j
)

P
out
g(e
j
) = g(e
i
)}, CorrM = {e
i
E|h(e
i
)

P
in
, e
j
E s.t. h(e
j
)

P
out
, g(e
j
) =
g(e
i
), v(e
j
) = v(e
i
)} and In M = {e
i
E|h(e
i
)

P
in
}, the MER = (|LostM| + |CorrM|)/|InM|. Using
information on message encoding MER can be converted to packet and bit error rate.
The number of CS that a NPI can offer is large, so the concept of Class of Communication Services
(CCSs) is introduced to simplify the description of a NPI. CCS dene a new abstraction (and therefore
a platform) that groups together CS of similar type and quality. For example, a CCS may include all the
CS that transfer a periodic stream of messages with no errors, another CCS may include all the CS that
transfer a stream of input messages arriving at a bursty rate with a 1% error rate. CCS can be identied
based on the type of messages (e.g., packets, audio samples, video pixels etc.), the input arrival pattern
(e.g., periodic, bursty etc.), and the range of QoS parameters. For each NPI supporting multiple CS, there
are several ways to group them into CCS. It is the task of the NPI designer to identify the CCS and provide
the proper abstractions to facilitate the use of the NPI.
22.4.3 Design of Network Platforms
The design methodology for NPs derive a NPI implementation by successive renement from the spec-
ication of the behaviors of the interacting components and the declaration of the constraints that a NPI
implementation must satisfy. The most abstract NPI is dened by a set of end-to-end direct logical links
connecting pairs of interacting components. Communication renement of a NPI denes at each step a
more detailed NPI

by replacing one or multiple links in the original NPI with a set of components or
NPIs. During this process another NPI can be used as a resource to build other NPIs. A correct renement
procedure generates a NPI

that provides CS equivalent to those offered by the original NPI with respect
to the constraints dened at the upper level. A typical communication renement step requires to dene
both the structure of the rened NPI
, that is, its components and topology, and the behavior of these
components, that is, the protocols deployed at each node. One or more NP components (or predened
NPIs) are selected from a library and composed to create CS of better quality. Two types of compositions
are possible. One consists of choosing a NPI and extending it with a protocol layer to create CS at a higher
level of abstraction (vertical composition). The other is based on the concatenation of NPIs using an
intermediate component called adapter (or gateway) that maps sequences of events between the ports
being connected (horizontal composition).
22.5 Fault-Tolerant Platforms
The increasing role of embedded software in real-time feedback-control systems drives the demand for
fault-tolerant design methodologies [8]. The aerospace and automotive industries offer many examples of
systems whose failure may have unacceptable costs (nancial, human, or both). Designing cost-sensitive
real-time control systems for safety-critical applications require a careful analysis of the cost/coverage
trade-offs of fault-tolerant solutions. This further complicates the task of deploying the embedded soft-
ware that implements the control algorithms on the execution platform. The latter is often distributed
around the plant as it is typical, for instance, in automotive applications. In this section, we present a
synthesis-based design methodology that relieves the designers from the burden of specifying detailed
mechanisms for addressing the execution platform faults, while involving them in the denition of the
overall fault-tolerance strategy. Thus, they can focus on addressing plant faults within their control
algorithms, selecting the best components for the execution platform, and dening an accurate fault
model. Our approach is centered on a new model of computation, Fault-Tolerant Data Flows (FTDF),
that enables the integration of formal validation techniques.
22.5.1 Types of Faults and Platform Redundancy
In a real-time feedback-control system, like the one in Figure 22.3, the controller interacts with the plant by
means of sensors and actuators. Acontroller is a hardwaresoftware systemwhere the software algorithms
that implement the control law run on an execution platform. An execution platform is a distributed
system that is typically made of a software layer (RTOS, middleware services, . . . ) and a hardware layer
(a set of processing elements, called Electronic Control Units or ECUs, connected via communication
channels such as buses, crossbars, or rings). The design of these heterogeneous reactive distributed systems
is made even more challenging by the requirement of making them resilient to faults. Technically, a fault
is the cause of an error, an error is the part of the system state which may cause a failure, and a failure is
the deviation of the system from the specication [9]. A deviation from the specication may be owing to
the designers mistakes (bugs) or the accidents occurring while the system is operating. The latter can
be classied into two categories that are relevant for feedback-control systems: plant faults and execution
platformfaults. Theoretically, all bugs can be eliminated before the systemis deployed. In practice, they are
minimizedby using designenvironments that are basedonprecise Models of Computation(MoCs), whose
well-dened semantics enable formal validation techniques [1012], (e.g., synchronous languages [13]).
Embedded software
Plant
Sensor
Sensor
Actuator
Actuator
Execution platform
Control law algorithms
RTOS and middle ware
Hardware architecture
ECU
ECU
ECU
ECU
ECU
ECU
ECU
ECU
Actuator
driver
Actuator
driver
Sensor
driver
Sensor
driver
Controller
FIGURE 22.3 A real-time control system.
Instead, plant faults and execution platformfaults must be dealt with online. Hence, they must be included
in the specication of the system to be designed.
Plant faults, including faults in sensors and actuators, must be handled at the algorithmic level using
estimation techniques and adaptive control methods. For instance, a drive-by-wire system [14, 15] might
need to handle properly a tire puncture or the loss of one of the four brakes. Faults in the execution
platform affect the computation, storage, and communication elements. For instance, a loss of power may
turn off an ECU, momentarily or forever. System operation can be preserved in spite of platform faults if
alternative resources supplying the essential functionality of the faulty one are available. Hence, the process
of making the platformfault-tolerant usually involves the introductionof redundancy with obvious impact
on the nal cost. While the replication of a bus or the choice of a faster microprocessor may not affect
sensibly the overall cost of a new airplane, their impact is quite signicant for high-volume products
like the ones of the automotive industry. The analysis of the trade-offs between higher redundancy and
lower costs is a challenging hardwaresoftware codesign task that designers of fault-tolerant systems for
cost-sensitive applications must face in addition to the following two: (1) how to introduce redundancy,
and (2) how to deploy the redundant design on a distributed execution platform. Since these activities
are both tedious and error prone, designers often rely on off-the-shelf solutions to address fault tolerance,
such as Time-Triggered Architecture (TTA) [16]. One of the main advantages of off-the-shelf solutions
is that the application does not need to be aware of the fault-tolerant mechanisms that are transparently
provided by the architecture to cover the execution platform faults. Instead, designers may focus their
attention on avoiding design bugs and tuning the control algorithms to address the plant faults. However,
the rigidity of off-the-shelf solutions may lead to suboptimal results from a design cost viewpoint.
22.5.2 Fault-Tolerant Design Methodology
We present an interactive design methodology that involves designers in the exploration of the
redundancy/cost trade-off [17]. To do so efciently, we need automatic tools to bridge the different
platforms in the system platform stack. In particular, we introduce automatic synthesis techniques that
process simultaneously the algorithm specication, the characteristics of the chosen execution platform,
and the corresponding fault model. Using this methodology, the designers focus on the control algorithms
and the selection of the components and architecture for the execution platform. In particular, they also
specify the relative criticality of each algorithm process. Based on a statistical analysis of the failure rates,
which should be part of the characterization of the execution platforms library, designers specify the
expected set of platform faults, that is, the fault model. Then, we use this information to (1) automatically
deduce the necessary software process replication, (2) distribute each process on the execution platform,
and (3) derive an optimal scheduling of the processes on each ECU to satisfy the overall timing constraints.
Together, the three steps (replication, mapping, and scheduling) result in the automatic deployment of the
embedded software on the distributed execution platform. Platforms export performance estimates, and
we can determine for each control process its worst case execution time (WCET) on a given component.
2
Then, we can use a set of verication tools to assess the quality of the deployment, most notably we have
a static timing analysis tool to predict the worst case latency from sensors to actuators. When the nal
results do not satisfy the timing constraints for the control application, precise guidelines are returned
to the designers who may use them to rene the control algorithms, modify the execution platform, and
revisit the fault model. While being centered on a synthesis step, our approach does not exclude the use of
predesigned components, such as TTA modules, communication protocols such as TTP [19] and fault-
tolerant operating systems. These components can be part of a library of building blocks that the designer
uses to further explore the fault-coverage/cost trade-off. Finally, the proposed methodology is founded
on a new MoC, FTDF, thus making it amenable to the integration of formal validation techniques. The
corresponding API platform consists primarily of the FTDF MoC.
22.5.2.1 Fault Model
For the sake of simplicity we assume fail silence: components either provide correct results or do not
provide any result at all. Recent work shows that fail-silent platforms can be realized with limited
area overhead and virtually no performance penalty [20]. The fail silence assumption can be relaxed
if invalid results are detected otherwise, as in the case of CRC-protected communication and voted com-
putation [21]. However, it is important to note that the proposed API platform (FTDF) is fault model
independent. For instance, the presence of value errors, where majority voting is needed, can be accounted
for in the implementation of the FTDF communication media (see Section 22.5.3). The same is true for
Byzantine failures, where components can have any behavior, including malicious ones like coordinating
to bring the system down to a failure [22]. In addition to the type of faults, a fault model also species the
number (or even the mix) of faults to be tolerated [23]. A statistical analysis of the various components
MTBFs (Mean Time Between Faults), their interactions, and MTBR (Mean Time Between Repairs), should
determine which subsystems have a compound MTBF that is so short to be of concern, and should be
part of the platform component characterization. The use of failure patterns to capture effectively these
interactions was proposed in Reference 24, which is the basis of our approach [17].
22.5.2.2 Setup
Consider the feedback-control system in Figure 22.3. The control system repeats the following sequence at
each period T
max
: (1) sensors are sampled, (2) software routines are executed, and (3) actuators are updated
with the newly processed data. The actuator updates are applied to the plant at the end of the period to
help minimize jitter, a well-known technique in the real-time control community [25, 26]. In order to
guarantee correct operation, the WCET among all possible iterations, that is, the worst case latency from
sensors to actuators, must be smaller than the given period T
max
(the real-time constraint), which is
determined by the designers of the controller based on the characteristics of the application. Moreover,
the critical subset of the control algorithms must be executed in spite of the specied platform faults.
22.5.2.3 Example
Figure 22.4 illustrates a FTDF graph for a paradigmatic feedback-control application, the inverted
pendulum control system. The controller is described as a bipartite directed graph G where the
vertices, called actors and communication media, represent software processes and data communication.
2
See Reference 18 for some issues and techniques to estimate WCETs.
m
m
Coarse
control
task
Fine
control
task
m
m
m
m
m
Sensor
Input Arbiter
m
Output
Actuator
m Actuator
Inverted pendulum
(the plant)
Sensor
Sensor
FIGURE 22.4 Controlling an inverted pendulum.
CH1
CH0
ECU0 ECU1 ECU2
FIGURE 22.5 A simple platform graph.
Figure 22.5 illustrates a possible platform graph (PG), where vertices represent ECUs and communication
channels and edges describe their interconnections.
22.5.2.4 Platform Characteristics
Each vertex of PG is characterized by its failure rate and by its timing performance. A failure pattern is
a subset of vertices of PG that may fail together during the same iteration, with a probability so high to
be of concern. A set of failure patterns identify the fault scenarios to be tolerated. Based on the timing
performance, we can determine the WCETof actors on the different ECUs and the worst case transmission
time of data on channels. Graphs G and PG are related in two ways:
Fault-tolerance binding: for each failure pattern the execution of a corresponding subset of the
actors of G must be guaranteed. This subset is identied a priori based on the relative criticality
assignment.
Functional binding : a set of mapping constraints and performance estimates indicate where on PG
each vertex of G may be mapped and the corresponding WCET.
These bindings are the basis to derive a fault-tolerant deployment of G on PG. We use software replication
to achieve fault tolerance: critical routines are replicated statically (at compile time) and executed on
separate ECUs and the processed data are routed on multiple communication paths to withstand channel
failures. In particular, to have a correct deployment in absence of faults, it is necessary that all actors and
data communications are mapped onto ECUs and channels in PG. Then, to have a correct fault-tolerant
deployment, critical elements of G must be mapped onto additional PG vertices to guarantee their correct
and timely execution under any possible failure pattern in the fault model.
22.5.2.5 Design Flow
Using the interactive design ow of Figure 22.6 designers:
Specify the controller (the top-left FTDF graph)
Assemble the execution platform (the top-right PG)
Specify a set of failure patterns (subsets of PG)
Specify the fault-tolerance binding (fault behavior)
Specify the functional binding
All this information contributes to specifying what the system should do and how it should be
implemented. A synthesis tool automatically:
Introduces redundancy in the FTDF graph
Maps actors and their replicas onto PG
Schedules their execution
Finally, a verication tool checks whether the fault-tolerant behavior and the timing constraints are met.
If no solution is found, the tool returns a violation witness that can be used to revisit the specication and
to provide hints to the synthesis tool.
Fault behavior
Mapping
Fine
CTRL
Coarse
CTRL
Sens
Sens
Sens
Act
Act
Input
Arbiter
best
Output
ECU0
ECU1
ECU2
CH0
CH1
Sens
Sens
Sens
Input
Input
Coarse
CTRL
Coarse
CTRL
Fine
CTRL
Arbiter
best
Arbiter
best
Output
Output
Act
Act
FIGURE 22.6 Interactive design ow.
22.5.3 The API Platform (FTDF Primitives)
In this section we present the structure and general semantics of the FTDF MoC. The basic building
blocks are actors and communication media. FTDF actors exchange data tokens at each iteration with
synchronous semantics [13].
An actor belongs to one of six possible classes: sensors, actuators, inputs, outputs, tasks, or arbiters.
Sensor and actuator actors read and update, respectively, the sensor and actuator devices interacting
with the plant. Input actors perform sensor fusion, output actors are used to balance the load on the
actuators, while task actors are responsible for the computation workload. Arbiter actors mix the values
that come from actors with different criticality to reach to the same output actor (e.g., braking command
and Antilock Braking System [ABS]).
3
Finally, state memories are connected to actors and operate as
one-iteration delays. With a slight abuse of terminology the terms state memory and memory actor are
used interchangeably in this section.
22.5.3.1 Tokens
Each token consists of two elds: Data, the actual data being communicated; Valid, a boolean ag
indicating the outcome of fault detection on this token. When Valid is false either no data is available
for this iteration, or the available data is not correct. In both cases the Data eld should be ignored. The
Valid ag is just an abstraction of more concrete and robust fault detection implementations.
22.5.3.2 Communication Media
Communication occurs via unidirectional (possibly many-to-many) communication media. All replicas
of the same source actor write to the same medium, and all destination actors read from it. Media act
as both mergers and repeaters sending the single merged result to all destinations. More formally, the
medium provides the correct merged result or an invalid token if no correct result is determined.
Assuming fail silence, merging amounts to selecting any of the valid results; assuming value errors,
majority voting is necessary; assuming Byzantine faults requires rounds of voting (see the consensus
problem [27]). Communication media must be distributed to withstand platform faults. Typically, this
means to have a repeater on each source ECU and a merger on each destination ECU (broadcasting
communication channels help reducing message trafc greatly). Using communication media, actors
always receive exactly one token per input and the application behavior is independent of the type of plat-
formfaults. The transmissionof tokens is initiatedby the active elements: regular actors andmemory actors.
22.5.3.2.1 Regular Actors
When an actor res, its sequential code is executed. This code is: stateless (state must be stored in memory
actors), deterministic (identical inputs generate identical outputs), nonblocking (once red, it does not
await for further tokens, data, or signals from other actors), and terminating (bounded WCET). The ring
rule species which subsets of input tokens must be valid to re the actor, typically all of them (AND
ring rule). However, the designer may need to specify partial ring rules for input and arbiter actors. For
example, an input actor reading data from three sensors may produce a valid result even when one of the
sensors cannot deliver data (e.g., when the ECU where the sensor is mapped is faulty).
22.5.3.2.2 Memory Actors (State Memories)
A memory provides its state at the beginning of an iteration and has a source actor, possibly replicated,
that updates its state at every iteration. State memories are analogous to latches in a sequential digital
circuit: they store the results produced during the current iteration for use in the next one.
Finally FTDF graphs can express redundancy, that is, one or more actors may be replicated. All the
replicas of an actor v A are denoted by R(v) A. Note that any two actors in R(v) are of the same type
and must compute the same function. This basic condition is motivated in Section 22.5.5 where replica
3
We advocate running nonsafety critical tasks, for example, door controllers, on separate hardware. However, some
performance enhancement tasks, for example, side-wind compensation, may share sensors and actuators with critical
tasks (steer-by-wire). It may be protable to have them share the execution platform as well.
determinismis discussed. Note that the replicationof sensors andactuators is not performedautomatically
because they may have a major impact on cost, we discuss the implications of this choice in Reference 17.
22.5.4 Fault-Tolerant Deployment
The result of the synthesis is a redundant mapping L, that is, an association of elements of the FTDF
network to multiple elements of the execution platform, and for each element in the execution platform
a schedule S, that is, a total order in which actors should be executed and data should be transmitted.
A pair (L, S) is called a deployment. To avoid deadlocks, the total orders dened by S must be compatible
with the partial order in L, which in turn derives directly from the partial order in which the FTDF actors
in the application must be executed. To avoid causality problems, memory actors are scheduled before
any other actor, thus using the results of the previous iteration. Schedules based on total orders are called
static: there are no runtime decisions to make, each ECU and each channel controller simply follows the
schedule. However, in the context of a faulty execution platform an actor may not receive enough valid
inputs to re and this may lead to starvation. This problem is solved by skipping an actor if it cannot re
and by skipping a communication if no data is available [24].
22.5.5 Replica Determinism
Given a mapping L, it is important to preserve replica determinism: if two replicas of a same actor re, they
produce identical results. For general MoCs the order of arrival of results must also be the same for all
replicas. Synchrony of FTDF makes this check unnecessary. Clearly, the execution platform must contain
the implementation of a synchronization algorithm [28].
Replica determinism in FTDF can be achieved enforcing two conditions: (1) all replicas compute the
same function, and (2) for any failure pattern, if two replicas get a ring subset of inputs they get the
same subset of inputs. Condition (1) is enforced by construction by allowing only identical replicas.
Condition (2) amounts to a consensus problemand it can either be checked at runtime (like for Byzantine
agreement rounds of voting), or it can be analyzed statically at compile time (if the fault model is milder).
Our interest in detectably faulty execution platforms makes the latter approach appear more promising
and economical. Condition (2) is trivially true for all actors with the AND ring rule. For input and
arbiter actors the condition must be checked and enforced [17].
22.6 Analog Platforms
Emerging applications such as multimedia devices (video cell phones, digital cameras, wireless PDAs to
mention but a few) are driving the SoC market towards the integration of analog components in almost
every system. Today, system-level analog design is a design process dominated by heuristics. Given a set of
specications/requirements that describes the system to be realized, the selection of a feasible (let alone
optimal) implementation architecture comes mainly out of experience. Usually, what is achieved is just
a feasible point at the system level, while optimality is sought locally at the circuit level. This practice is
caused by the number of second order effect that are very hard to deal with at high level without actually
designing the circuit. Platform-based design can provide the necessary insight to develop a methodology
for analog components that takes into consideration system level specications and can choose among a
set of possible solutions including digital approaches wherever it is feasible to do so. If the productivity
gap between analog and digital components is not overcome, time-to-market and design quality of SoC
will be seriously affected by the small analog sections required to interface with the real world. Moreover,
SoC designs will expose system level explorations that would be severely limited if the analog section is
not provided with a proper abstraction level that allows systemperformance estimation in an efcient way
and across the analog/digital boundary. Therefore, there is a strong need to develop more abstract design
techniques that can encapsulate analog design into a methodology that could shorten design time without
compromising the quality of the solutions, leading to a hardware/software/analog co-design paradigm for
embedded systems.
22.6.1 Denitions
The platform abstraction process can be extended to analog components in a very natural way. Deriving
behavioral and performance models, however, is more involved due to the tight dependency of analog
components on device physics that requires the use of continuous mathematics for modeling the rela-
tions among design variables. Formally, an Analog Platform (AP) consists of a set of components, each
decorated with:
a set of input variables u U, a set of output (performance) variables y Y, a set of internal vari-
ables (including state variables) x X , a set of conguration parameters K; some parameters
take values in a continuous space, some take values in a discrete set, for example when they encode
the selection of a particular alternative.
a behavioral model that expresses the behavior of the component represented implicitly as
F(u, y, x, ) = 0, where F(), may include integro-differential components; in general, this set
determines uniquely x and y given u and . Note that the variables considered here can be function
of time and that the functional F includes constraints on the set of variables (for example, the
initial conditions on the state variables).
a feasible performance model. Let
y
(u, ) denote the map that computes the performance y
corresponding to a particular value of u and by solving the behavioral model. The set of feas-
ible analog performance (such as gain, distortion, power), is the set described by the relation
P(y (u )) = 1
, y (u ) =
y
(
, u ).
validity laws L(u, y, x, ) 0 i.e., constraints (or assumptions) on the variables and parameters
of the component that dene the range of the variables for which the behavioral and performance
models are valid.
Note that there is no real need to dene the feasible performance model since the necessary information
is all contained in the behavioral model. We prefer to keep them separate because of the use we make of
them in explaining our approach.
At the circuit level of abstraction, the behavioral models are the circuit equations with x being the
voltages, currents and charges, y being a subset of x and/or a function of x and when they express
performance gures such as power or gain. To compute performance models, we need to solve the
behavioral models that implies solving ordinary differential equations, a time consuming task. In the past,
methods to approximate the relation between y and (the design variables) with an explicit function were
proposed. In general, to compute this approximation, a number of evaluations of the behavioral model
for a number of parameters is performed (by simulation, for example) and then an interpolation or
approximation scheme is used to derive the approximation to the map
y
. We see in Section 22.6.2 how
to compute an approximation to the feasible performance set directly.
Example 22.1 Considering an OTA for an arbitrary application, we can start building a platform from the
circuit level by dening:
U as the set of all possible input voltages V
in
(t ) s.t. |V
in
| < 100 mV and bandwidth V
in
< 3 MHz;
Y as the space of vectors {V
out
(t ), gain, IIP3, r
out
} (IIP3 is the third order intermodulation intercept
point referredto the input, r
out
is the output resistance) X the set of all internal current andvoltages,
and K the set of transistor sizings.
for a transistor level component, the behavioral model F consists of the solution of the circuit
equations, e.g. through a circuit simulator.

y
(u, ) as the set of all possible y.
validity laws L are obtained from Kirchoff laws when composing individual transistors and other
constraints, e.g. maximum power ratings of breakdown voltages.
We can build a higher level (level 1) OpAmp platform where:
U
1
is the same, Y
1
is the output voltage of the OpAmp, X is empty, K
1
consists of possible {gain,
IIP3, r
out
} triples (thus it is a projection of Y
0
);
F
1
can be expressed in explicit form,
y
1
(t ) = h (t ) (a
1
u (t ) + a
3
u (t )
3
) + noise
y
2
= a
1
; y
3
=
4
3

a
1
a
3

y
is the set of possible y;
there are no validity constraints, L < 0 always.
When a platform instance is considered, we have to compose the models of the components to obtain
the corresponding models for the instance. The platform instance is then characterized by
a set of internal variables of the platform = [
1
,
2
, ...,
n
] ,
a set of inputs of the platform, h H
a set of performances ,
a set of parameters Z.
The variable names are different from the names used to denote the variables of the components to
stress that there may be situations where some of the component variables change roles (for example, an
input variable of one component may become an internal variable; a new parameter can be identied in the
platform instance that is not visible or useful at the component level). To compose the models, we have to
include in the platform the composition rules. The legal compositions are characterized by the interconnect
equations that specify which variables are shared when composing components and by constraints that
dene when the composition is indeed possible. These constraints may involve range of variables as well
as nonlinear relations among variables. Formally, a connection is establishing a pairwise equality between
internal variables for example
i
=
j
, inputs and performance; we denote the set of interconnect relations
with c (h, , , ) = 0 that are in general a set of linear equalities. The composition constraints are denoted
by L(h, , , ) 0, that are in general, non linear inequalities. Note that in the platform instance all
internal variables of the components are present as well as all input variables. In addition, there is no
internal or input variable of the platform instance that is not an internal or input variable of one of the
components. The behavioral model of the platform instance is the union of all behavioral models of
the components conjoined with the interconnect relations. The validity laws are the conjunction of the
validity laws of the components and of the composition constraints. The feasible performance model may
be dened anew on the platform instance but it may also be obtained by composition of the performance
models of the components. There is an important and interesting case when the composition may be
done considering only the feasible performance models of the components obtained by appropriate
approximation techniques. In this case, the composition constraints assume the semantics of dening
when the performance models may be composed. For example, if we indicate with the parameters
related to internal nodes that characterizes the interface in Figure 22.7(a) (e.g. input/output impedance
in the linear case), then matching between has to be enforced during composition. In fact, both P
A
and P
B
were characterized with specic s (Figure 22.7[b]), so L has to constrain A B composition
consistently with performance models. In this case, an architectural exploration step consisting of forming
different platform instances out of the component library and evaluating them, can be performed very
quickly albeit possibly with restrictions on the space of the considered instances caused by the composition
constraints.
Platform composition A driving B with interface paramater l
l
A B
A
eq
B
l
S
B
eq
A
l
L
Characterization setup for platform A and B
(a)
(b)
FIGURE 22.7 Interface parameter during composition AB and characterization of A and B.
Example 22.2 We can build a level 2 platform consisting of an OpAmp (OA) and a unity gain buffer
following it (UB, the reader can easily nd a proper denition for it), then we can dene a higher level
OpAmp platform component so that:

1
= V
OA
in
,
2
= V
OA
out
,
3
= V
UB
in
,
4
= V
UB
in
and connect them in series specifying
2
=
3
;
h connected to
1
is the set of input voltages V
in
(t );
is the space of
1
(t ), the cascade response in time,
2
= gain,
3
= IIP3. In this case
2
immediately equals y
OA
2
, while
3
is a non linear function of y
OA
and y
UB
;
Z consists of all parameters specifying a platforminstance, inthis case we may have Z = Y
OA
Y
UB
.
a platform instance composability law L requires that the load impedance Z
L
> 100r
out
both at
the output of the OpAmp and the unity buffer.
22.6.2 Building Performance Models
An important part of the methodology is obtaining performance models. We already mentioned that we
need to approximate the set

Y explicitly eliminating the dependence on the internal variables x. To do so
a simulation-based approach is proposed.
22.6.2.1 Performance Model Approximation
In general terms, simulation maps a conguration set (typically connected) K into a performance set in
Y, thus establishing a relation among points belonging to the mapped set. Classic regression schemes
provides an efcient approximation to the mapping function (), however our approach requires dealing
with performance data in two different ways. The rst one, referred to as performance model P, allows
discriminating between points in

Y and points in Y\

Y. A second one, () =
1
(), implementing the
inverse mapping from

Y into K, used to map down from a higher-level platform layer to a lower one.
However, fundamental issues (i.e. () being an invertible function) and accuracy issues (a regression from
R
m
into R
n
) suggest a table-lookup implementation for (), possibly followed by a local optimization
phase to improve mapping. Therefore, we will mainly focus on basic performance models P.
The set

Y Y denes a relation in Y denoted with P. We use Support Vector Machines (SVMs) as
a way of approximating the performance relation P [29]. SVMs provide approximating functions of the
form
f (x) = sign(
i
e
|xx
i
|
2
) (22.1)
where x is the vector to be classied, x
i
are observed vectors,
i
s are weighting multipliers, is a biasing
constant and is a parameter controlling the t of the approximation. More specically, SVMs exploit
mapping to Hilbert spaces so that hyperplanes can be exploited to perform classication. Mapping to
high dimensional spaces is achieved through kernel functions, so that a kernel k(, ) is associated at each
point . Since the only general assumption we can make on () is continuity and on K is connectivity
4
,
we can only deduce that

Y is connected as well. Therefore, the radial basis function Gaussian kernel is
chosen, k(,
) = e

2
, where is a parameter of the kernel and controls the widthof the kernel
function around . We resort to a particular formulation of SVMs known as one-class SVM where an
optimal hyperplane is determined to separate data from the origin. The optimal hyperplane be computed
very efciently through a quadratic problem, as detailed in [30].
22.6.2.2 Optimizing the Approximation Process
Sampling schemes for approximating unknown functions are exponentially dependent on the size of the
function support. In the case of circuit, none but very simple circuits could be realistically characterized
in this way. Fortunately, there is no need to sample the entire space K since we can use additional
information obtained from design considerations to exclude parts of the parameter space. The set of
interesting parameters is delimited by a set of constraints of two types:
topological constraints derived from the use of particular circuit structures, such as two stacked
transistor sharing the same current or a set of V
DS
summing to zero;
physical constraints induced by device physics, such as V
GS
V
DS
relation to enforce saturation
or g
m
I
D
relations;
performance constraints on circuit performances, such as minimum gain or minimum phase
margin, that can be achieved.
Additional constraints can be added as designers understanding of circuit improves. The more constraints
we add, the smaller the interesting conguration space K. However, if a constraint is tight, i.e., it either
denes lower dimensional manifolds for example when the constraint is an equality, or the measure of
the manifold is small, the more likely it is to introduce some bias in the sampling mechanism because of
the difculty in selecting points in these manifolds. To eliminate this ill-conditioning effect, we relax
these constraints to include a larger set of interesting parameters. We adopt a statistical means of relaxing
constraints by introducing random errors with the aim of dithering systematic errors and recovering
accuracy in a statistical sense. Given an equality constraint f () = 0 and its approximation

f () = 0,
we derive a relaxation |
f ()| . For each constraint f some statistics have to be gathered on so as to

minimize the overhead on the size of K for introducing it.
Once we have this set of constraints, we need to take theminto account to dene the space of interesting
parameters. To do so, we can establish an ordering of the constraints so that the evaluation of the space
is faster and sampling can be restricted effectively. Analog Constraint Graphs (ACGs) are introduced as a
bipartite graph representation of conguration constraints. One set of nodes corresponds to equations,
the other to variables . Bipartite graphs are a common form for dealing with system of equations [31].
We exploit ACGs to nd solutions to conguration constraints thus providing an executable conguration
sampler to be used in our platformconguration framework. Amaximal bipartite matching in the ACGis
used to compute an evaluation order for equations that is then translated into executable code capable of
generating congurations in K by construction. In our experience, even with reasonably straightforward
constraints, ratios of the order of 10
6
were observed for
size(

K)
size(K)
with K R
16
.
When we deal with the intersection of achievable performance and performance constraints in the
top-down phase of the design, we can add to the set of constraints we use to restrict the sampling space
the performance constraints so that the results reported above are even more impressive.
4
More in general, a union of a nite number of connected sets.
22.6.3 Mixed-Signal Design Flow with Platforms
The essence of platform-based design is building a set of abstractions that facilitate the design of complex
systems by a successive renement/abstraction process. The abstraction takes place when an existing set
of components forming a platform at a given level of abstraction is elevated to a higher level by building
appropriate behavioral and performance models together with the appropriate validity laws. This process
can take either components at a level of abstraction and abstract each of them, or abstract a set of
platform instances. Since both platform instances and platform components are described at the same
level of abstraction the process is essentially the same. What changes is the exploration approach. On the
other side of the coin, the top-down phase progresses through renement. Design goals are captured as
constraints and cost function. At the highest level of abstraction, the constraints are intersected with the
feasible performance set to identify the set of achievable performance that satisfy design constraints. The
cost function is then optimized with respect to the parameters of the platform instances at the highest level
of abstraction ensuring they lie in the intersection of the constraint and the feasible performance set. This
constrained optimization problem yields a point in the feasible performance space and in the parameter
space for the platform instances at the highest level of abstraction. Using the inverse of the abstraction
map
y
, these points are mapped back at a lower level of abstraction where the process is repeated to yield
a new point in the achievable performance set and in the parameter space until we reach a level where the
circuit diagrams and even a layout is available. If the abstraction map is a conservative map, then every
time we map down, we always nd a consistent solution that can be achieved. Hence the design process
can be shortened considerably. The crux of the matter is how many feasible points are not considered
because of the conservative approximation. Thus the usual design speed versus design quality trade-off
has to be explored.
In mathematical terms, the bottom-up phase consists of dening an abstraction
l
that maps the
inputs, performance, internal variables, parameters, behavioral and performance models, and validity
laws of a component or platform instance at level l into the corresponding objects at the next level (l + 1).
The map is conservative if all feasible performance vectors y
l +1
correspond to feasible performance vectors
y
l
. Note that if approximations are involved in dening the models and the maps, this is not necessarily
true, i.e., abstraction maps may be non conservative. In other words, a feasible performance vector at
level l + 1 may not correspond to a feasible one at level l . A simplied diagram of the bottom-up phase
for circuit level components is shown in Figure 22.8. For each library component, we dene a behavioral
model and a performance model. Then, a suitably topology is determined, an ACG is derived to constrain
the conguration space K and a performance model is generated. This phase can be iterated, leading to a
library that can span multiple topologies as reported in Figure 22.9.
The top-down phase then proceeds formulating a top-level design problem as an optimization problem
with a cost function C(y
top
) and a set of constraints dened in the Y
top
space, g
top
(y
top
) 0 that identies
a feasible set in Y
top
. The complete optimization problem has to include the set
Y
top
that denes the set
of achievable performance at the top level. The intersection of the two sets dene the feasible set for the
optimization process. The result of the process is a y
top
opt
. Then the process is to map back the selected
point to the lower levels of the hierarchy. If the abstractions are conservative, the top-down process is
straightforward. Otherwise, at each level of the hierarchy, we have to verify using the performance models,
the behavioral models and the validity laws. In some cases, a better design may be obtained by introducing
in the top-down phase cost functions and constraints that are dened only at a particular abstraction
level. In this case, the space of achievable performances intersected with this new set of constraints denes
the search space for the optimization process. At times, it is more convenient to project down the cost
function and the constraints of the higher-level abstraction to the next level down. In this case, then the
search space is the result of the intersection of three sets in the performance space and the cost function is
a combination of the projected cost function and the one dened at this level. A ow chart summarizing
the top-down ow with platforms is shown in Figure 22.10. In Figure 22.11 the set of congurations
evaluated during an optimization run for the UMTS frontend in [32] is reported visualizing how multiple
topologies are exploited in selecting optimal points.
Select new topology Select new topology
Derive ACG and
nominal configuration
Generate P
Derive ACG and
nominal configuration
Generate P
Define behavioral
model
Define performance
model P
FIGURE 22.8 Bottom-up phase for generating AP.
npinput
Wideband
ActiveL
LNA
Tuned P
T
(G, NF, P, IP
3
, f
0
, Q)
P(G, NF, P, IP
3
)
P
np
(G, NF, P, IP
3
, f
0
, Q, IP
2
)
P
L
(G, NF, P, IP
3
, f
0
, Q, Q)
P
W
(G, NF, P, IP
3
, f
3dB
)
FIGURE 22.9 Sample model hierarchy for an LNA platform. The root node provides performance constraints for
a generic LNA, which is then rened by more detailed P for specic classes of LNAs.
The peculiarity of a platformapproachtomixedsignal designresides inthe accurate performance model
constraints P that propagate to the top-level architecture related constraints. For example, a platform
stack can be built where multiple analog implementation architectures are presented at a common level
of abstraction together with digital enhancement platforms (possibly including several algorithms and
hardware architectures), each component being annotated with feasible performance spaces. Solving the
system design problem at the top level where the platforms contain both analog and digital components,
allows selecting optimal platform instances in terms of analog and digital solutions, comparing how
different digital solutions interact with different analog topologies and nally selecting the best tradeoff.
The nal verication step is also greatly simplied by the platform approach since, at the end, models
and performances used in the top-down phase were obtained with a bottom-up scheme. Therefore, a
consistency check of models, performances and composition effects is all that is required at a hierarchical
Build system with APs
Define a formal set
of conditions for feasibility
Define an objective
function for optimization
Optimize system constraining
behavioral models to their P
Refine/add platforms
Return optimal performances
and candidate solutions
FIGURE 22.10 Top-down phase for analog design-space exploration.
level, followed by more costly, low-level simulations that check for possible important effects that were
neglected when characterizing the platform.
22.6.4 Recongurable Platforms
Analog platforms can also be used to model programmable fabrics. In the digital implementation platform
domain, FPGAs provide a very intuitive example of platform, for example including microprocessors on
chip. The appearance of Field Programmable Analog Arrays [33] constitutes a new attempt to build
recongurable Analog Platform. A platform stack can be built by exploiting the software tools that allow
mapping complex functionalities (lters, ampliers, triggers and so on) directly on the array. The top level
platform, then, provides an API to map and congure analog functionalities, exposing analog hardware
at the software level. By exploiting this abstraction, not only design exploration is greatly simplied,
but new synergies between higher layers and analog components can be leveraged to further increase
the exibility/recongurability and optimize the system. From this abstraction level, implementing a
functionality with digital signal processing (FPGA) or analog processing (FPAA) becomes subject to
system level optimization while exposing the same abstract interface. Moreover, very interesting tradeoffs
can be explored exploiting different partitionings between analog and digital components and leveraging
the recongurability of the FPAA. For example, limited analog performances can be mitigated by proper
reconguration of the FPAA, so that a tight interaction between analog and digital subsystems can provide
a new optimum from the system level perspective.
We dened PBD as an all-encompassing intellectual framework in which scientic research, design tool
development, and design practices can be embedded and justied. In our denition, a platform is simply
0.0014 0.0082 0.015 0.022 0.029
2.1
3.6
5.2
6.8
8.4
10
Pd
N
F
Optimization trace
FIGURE 22.11 Example of architecture selection during top-down phase. In the picture, an LNA is being selected.
Circles correspond to architecture 1 instances, crosses to architecture 2 instances. The black circle is the optimal LNA
conguration. It can be inferred that after an initial exploration phase alternating both topologies, simulated annealing
nally focuses on the architecture 1 to converge.
an abstraction layer that hides the details of the several possible implementation renements of the
underlying layer. PBD allows designers to trade-off various components of manufacturing, NRE and
design costs, while sacricing as little as possible potential design performance. We presented examples
of these concepts at different key articulation points of the design process, including system platforms as
composed of two platforms (micro-architecture and API), NPs, and AP.
This concept can be used to interpret traditional design steps in ASIC development such as synthesis
and layout. In fact, logic synthesis takes a level of abstraction consisting of HDL representation (HDL
platform) and maps it onto a set of gates that are dened in a library. The library itself is the gate-level
platform. The logic synthesis tools are the mapping methods that select a platform instance (a particular
netlist of gates that implements the functionality described at the HDL platform level) according to a
cost function dened on the parameters that characterize the quality of the elements of the library in
view of the overall design goals. The present difculties in achieving timing closure in this ow indicate
the need for a different set of characterization parameters for the implementation platform. In fact, in
the gate-level platform the cost associated to the selection of a particular interconnection among gates is
not reected, a major problem since the performance of the nal implementation depend critically on
this. The present solution of making a larger step across platforms by mixing mapping tools such as logic
synthesis, placement, and routing may not be the right one. Instead, a larger pay-off could be had by
changing levels of abstractions and including better parametrization of the implementation platform.
We argued in this chapter that the value of PBD can be multiplied by providing an appropriate set of
tools and a general framework where platforms can be formally dened in terms of rigorous semantics,
manipulated by appropriate synthesis, and optimization tools and veried. Examples of platforms have
been given using the concepts that we have developed. We conclude by mentioning that the Metropolis
design environment [34], a federation of integrated analysis, verication, and synthesis tools supported
by a rigorous mathematical theory of meta-models and agents, has been designed to provide a general
open-domain PBD framework.
Acknowledgments
We gratefully acknowledge the support of the Gigascale Silicon Research Center (GSRC), the Center for
Hybrid Embedded System Software (CHESS) supported by an NSF ITR grant, the Columbus Project
of the European Community, and the Network of Excellence ARTIST. Alberto SangiovanniVincentelli
would like to thank Alberto Ferrari, Luciano Lavagno, Richard Newton, Jan Rabaey, and Grant Martin for
their continuous support in this research.
We also thank the member of the DOP center of the University of California at Berkeley for their
support and for the atmosphere they created for our work. The Berkeley Wireless Research Center
and our industrial partners, (in particular: Cadence, Cypress Semiconductors, General Motors, Intel,
Xilinx, and ST Microelectronics) have contributed with designs and continuous feedback to make this
approach more solid. Felice Balarin, Jerry Burch, Roberto Passerone, Yoshi Watanabe, and the Cadence
Berkeley Labs team have been invaluable in contributing to the theory of meta-models and the Metropolis
framework.
References
[1] K. Keutzer, S. Malik, A.R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli. System level design:
orthogonalization of concerns and platform-based design. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 19(12), 2000.
[2] A.L. Sangiovanni-Vincentelli. Dening platform-based design. In EEDesign, February 2002,
Available at www.eedesign.com/story/OEG20020204S0062).
[3] Felice Balarin, Massimiliano Chiodo, Paolo Giusto, Harry Hsieh, Attila Jurecska, Luciano Lavagno,
Claudio Passerone, Alberto Sangiovanni-Vincentelli, Ellen Sentovich, Kei Suzuki, and Bassam
Tabbara. HardwareSoftware Co-Design of Embedded Systems: The POLIS Approach. Kluwer
Academic Publishers, Boston/Dordrecht/London, 1997.
[4] Henry Chang, Larry Cooke, Merrill Hunt, Grant Martin, Andrew McNelly, and Lee Todd.
Surviving the SOC Revolution: A Guide to Platform Based Design. Kluwer Academic Publishers,
Boston/Dordrecht/London, 1999.
[5] A. Ferrari and A.L. Sangiovanni-Vincentelli. System design: traditional concepts and new
paradigms. In Proceedings of the International Conference on Computer Design, October 1999,
pp. 112.
[6] Marco Sgroi. Platform-based design methodologies for communication networks. PhD thesis,
Electronics Research Laboratory, University of California, Berkeley, CA, December 2002.
[7] E.A. Lee and A. Sangiovanni-Vincentelli. A framework for comparing models of computation.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 17: 12171229,
1998.
[8] E.A. Lee. Whats ahead for embedded software? Computer, 33: 1826, 2000.
[9] J.C. Laprie, (Ed.). Dependability: Basic Concepts and Terminology in English, French, German,
Italian and Japanese, Vol. 5 Series Title: Dependable Computing and Fault-Tolerant Systems.
Springer-Verlag, New York, 1992.
[10] R. Alur, T. Dang, J. Esposito, Y. Hur, F. Ivancic, V. Kumar, I. Lee, P. Mishra, G.J. Pappas, and
O. Sokolsky. Hierarchical modeling and analysis of embedded systems. Proceedings of the IEEE,
91: 1128, 2003.
[11] S. Edwards, L. Lavagno, E. Lee, and A.L. Sangiovanni-Vincentelli. Design of embedded systems:
formal methods, validation and synthesis. Proceedings of the IEEE, 85: 266290, 1997.
[12] J. Eker, J.W. Janneck, E.A. Lee, J. Liu, J. Ludwig, S. Neuendorffer, S. Sachs, and Y. Xiong. Taming
heterogeneity the ptolemy approach. Proceedings of the IEEE, 91: 127144, 2003.
[13] A. Benveniste, P. Caspi, S. Edwards, N. Halbwachs, P. Le Guernic, and R. de Simone. The
synchronous language twelve years later. Proceedings of the IEEE, 91: 6483, 2003.
[14] R. Bannatyne. Time triggered protocol-fault tolerant serial communications for real-time
embedded systems. In Wescon/98. Conference Proceedings, 1998.
[15] R. Schwarz and P. Rieth. Global chassis control integration of chassis systems. Automatisier-
ungstechnik, 51: 300312, 2003.
[16] H. Kopetz and D. Millinger. The transparent implementation of fault tolerance in the time-
triggered architecture. In Dependable Computing for Critical Applications. San Jose, CA, 1999.
[17] C. Pinello, L.P. Carloni, andA.L. Sangiovanni-Vincentelli. Fault-tolerant deployment of embedded
software for cost-sensitive real-time feedback-control applications. In Proceedings of the European
Design and Test Conference. ACM Press, 2004.
[18] C. Ferdinand, R. Heckmann, M. Langenbach, F. Martin, M. Schmidt, H. Theiling, S. Thesing, and
R. Wilhelm. Reliable and precise WCET determination for a real-life processor. Lecture Notes in
[19] H. Kopetz and G. Grundsteidl. TTP a protocol for fault-tolerant real-time systems. IEEE
Computer, 27: 1423, 1994.
[20] M. Baleani, A. Ferrari, L. Mangeruca, A. Sangiovanni-Vincentelli, M. Peri, and S. Pezzini. Fault-
tolerant platforms for automotive safety-critical applications. In Proceedings of the International
Conference on Compilers, Architectures and Synthesis for Embedded Systems. ACM Press, 2003,
pp. 170177.
[21] F.V. Brasileiro, P.D. Ezhilchelvan, S.K. Shrivastava, N.A. Speirs, and S. Tao. Implementing fail-silent
nodes for distributed systems. IEEE Transactions on Computers, 45: 12261238, 1996.
[22] L. Lamport, R. Shostak, and M. Pease. The byzantine generals problem. ACM Transactions on
Programming Languages and Systems, 4: 382401, 1982.
[23] H.S. Siu, Y.H. Chin, and W.P. Yang. Reaching strong consensus in the presence of mixed failure
types. Transactions on Parallel and Distributed Systems, 9, 1998.
[24] C. Dima, A. Girault, C. Lavarenne, and Y. Sorel. Off-line real-time fault-tolerant scheduling.
In Proceedings of the Euromicro 2001, Mantova, Italy, February 2001.
[25] T.A. Henzinger, B. Horowitz, and C.M. Kirsch. Embedded control systems development with
Giotto. In Proceedings of the Languages, Compilers, and Tools for Embedded Systems. ACM Press,
2001, pp. 6472.
[26] A.J. Wellings, L. Beus-Dukic, and D. Powell. Real-time scheduling in a generic fault-tolerant
architecture. In Proceedings of the RTSS98. Madrid, Spain, December 1998.
[27] M. Barborak, M. Malek, and A. Dahbura. The consensus problem in fault-tolerant computing.
ACM Computing Surveys, 25: 171220, 1993.
[28] L. Lamport and P. Melliar-Smith. Byzantine clock synchronization. In Proceedings of the Third
ACM Symposium on Principles of Distributed Computing. ACM Press, New York, 1984, pp. 6874.
[29] F. De Bernardinis, M.I. Jordan, and A.L. Sangiovanni Vincentelli. Support vector machines for
analog circuit performance representation. In Proceedings of the Design Automation Conference,
June 2003.
[30] J. Platt. Sequential minimal optimization: a fast algorithm for training support vector machines.
Microsoft Research, MSR-TR-98-14, 1998.
[31] P. Bunus and P. Fritzson. A debugging scheme for declarative equation based modeling languages.
Practical Aspects of Decl. Languages: 4th Int. Symp, 280, 2002.
[32] F. De Bernardinis, S. Gambini, F. Vinci, F. Svelto, R. Castello, and A. Sangiovanni-Vincentelli.
Design space exploration for a UMTS front-end exploiting analog platforms. In Proceedings of the
International Conference on Computer-Aided Design, 2004.
[33] I. Macbeth. Programmable analog systems: the missing link. In EDA Vision (www.edavision.com),
July 2001.
[34] F. Balarin, Y. Watanabe, H. Hsieh, L. Lavagno, C. Passerone, and A. Sangiovanni-Vincentelli.
Metropolis: an integrated electronic system design environment. IEEE Computer, 36: 4552,
2003.
23
Interface Specication
and Converter
Synthesis
Roberto Passerone
Cadence Design Systems, Inc.
23.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-1
23.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-2
23.3 Automata-Based Converter Synthesis . . . . . . . . . . . . . . . . . . 23-4
Interface Specication Requirements Specication
Synthesis
23.4 Algebraic Formulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-10
Trace-Based Solution End-to-End Specication
23.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-18
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23-19
23.1 Introduction
Reuse is an established technique in modern design methodologies to reduce the complexity of designing
a system. Design reuse complements a design methodology by providing precharacterized components
that can be put together to perform the desired function. Together with abstraction and renement
techniques, design reuse is at the basis of such methodologies as platform-based design [13]. A platform
consists of a set of library elements, or resources, that can be assembled and interconnected according
to predetermined rules to form a platform instance. One step in a platform-based design ow involves
mapping a function or a specication onto different platform instances, and evaluating its performance.
By employing existing components and interconnection structures, reuse in a platform-based design ow
shifts the functional verication problemfromthe verication of the individual elements to the verication
of their interaction [4, 5]. This technique can be used at all the levels of abstraction in a design in order to
come to a complete implementation.
Adesign process can therefore be simplied by using a methodology that promotes the reuse of existing
components, also known as intellectual property, or IPs.
1
However, despite the advantages of precharac-
terization, the correct deployment of these blocks when the IPs have been developed by different groups
inside the same company, or by different companies, is notoriously difcult. Unforeseen interactions may
1
The termintellectual property is used to highlight the intangible nature of virtual components which essentially
consist of a set of property rights that are licensed, rather than of a physical entity that is sold.
23-1
often make the behavior of the resulting design unpredictable. Design rules have been proposed that
try to alleviate the problem by forcing the designers to be precise about the behavior of the individual
components and to verify this behavior under a number of assumptions about the environment in which
they have to operate. While this is certainly a step in the right direction, it is by no means sufcient to
guarantee correctness: extensive simulation and prototyping are still needed on the compositions. Several
methods have been proposed for hardware and software components that encapsulate the IPs so that their
behavior is protected fromthe interaction with other components. Interfaces are then used to ensure the
compatibility between components. Roughly speaking, two interfaces are compatible if they t together
as they are.
Simple interfaces, typically specied in the type systemof a systemdescription language, may describe
the types of values that are exchanged between the components. This is the case, for example, of high-
level programming languages and hardware description languages. More expressive interfaces, typically
speciedinformally indesigndocuments, may describe the protocol for the component interaction[611].
Several formal methodologies have been proposed for specifying the protocol aspects of interfaces in a way
that supports automatic compatibility checks [7, 8, 12]. The key elements of these approaches are the
interpretation of an interface in the context of its environment, a model-independent formalism, and the
use of automata and game-theoretic algorithms for compatibility checking. With these approaches, given
interfaces for different IPs, one can check whether these IPs can be composed.
When components are taken from legacy systems or from third-party vendors, interface protocols are
unlikely to be compatible. However, this does not necessarily mean that components cannot be combined
together: approaches have been proposed that adapt the components by constructing a converter among
the incompatible communication protocols [10, 13]. We refer to these techniques collectively as interface
synthesis or converter synthesis. Thus, informally, two interfaces are adaptable if they t together by
communicating through a third component, the adapter. If interfaces specify only value types, then
adapters are simply type converters. However, if interfaces specify interaction protocols, then adapters
are protocol converters. For instance, a protocol may be dened as a formal language (a set of strings
from an alphabet) and can be nitely represented using automata [10]. The problem of converting one
protocol into another can then be addressed by considering their conjuction in terms of the product of
the corresponding automata and by removing the states and transitions that lead to a violation of one
of the two protocols. The converter uses state information to rearrange the communication between the
original interfaces, in order to ensure compatibility. A specication in the formof a third component can
be used to dene which rearrangements are appropriate in a given communication context. For instance,
it is possible to specify that the converter can change the timing of messages, but not their order, using an
n-bounded buffer, or that some messages may or may not be duplicated. In this work we initially review
this methodology, and then introduce a mathematically sound interpretation and generalization that can
be applied in several different contexts.
This chapter is organized as follows. First we review some related work in Section 23.2. Then, in
Section 23.3, we illustrate with an example the automata-based approach to the synthesis of protocol
converters. We then introduce more general frameworks in Section 23.4 and discuss the solution of the
protocol conversion problemin Section 23.4.1.
23.2 Related Work
One of the rst approach to interface synthesis was proposed by Borriello [14, 15], who introduces the
event graphto establish correct synchronization of the operations and to determine the data sequencing.
The event graph is constructed at a very low level of abstraction (waveforms), and can be derived from
the timing diagrams of a protocol. In this approach, the two protocols should be made compatible by
manually assigning labels to the data on both sides in order to establish the correct correspondence.
Because the specication is expressed in terms of the real timing of the signals, this approach can handle
both synchronous and asynchronous protocols. Sun and Brodersen [16] extend the approach by providing
Interface Specication and Converter Synthesis 23-3
a library of components that frees the user fromconsidering lower-level details, without, however, lifting
the requirement of manually identifying the data correspondence.
Another approach is that of Akella and McMillan [17]: the protocols are described as two nite state
machines, while a third nite state machine represents the valid transfer of data. The correspondence
between the protocols is therefore embedded in this last specication. The synthesis procedure consists
of taking the product machine of the two protocols, which is then pruned of the invalid/useless states,
according to the specication. In the form proposed by the authors, the procedure, however, does not
account for data explicitly, so that the converter is unable to handle data widthmismatches inthe protocols.
Adifferent approach is that taken by Narayan and Gajski [18]: rst, the protocol specication is reduced
to the combination of ve basic operations (data read/write, control read/write, time delay); the protocol
description is then broken into blocks (called relations) whose execution is guarded by a condition on
one of the control wires or by a time delay; nally the relations of the two protocols are matched into
sets that transfer the same amount of data. Because the data is broken up into sets, this algorithm is
able to account for data width mismatch between the communicating parties. However, the procedural
specication of the protocols makes it difcult to adapt different sequencing (order) of the data, so that
only the synchronization problemis solved.
Some of the limitations above are addressed by the procedure proposed by Passerone et al. [10].
The specication is simplied by describing the protocols as regular expressions, which more closely
match the structure of a protocol, rather than as nite state machines (of course, the two formalisms
carry the same expressive power). In addition, typing information is used to automatically deduce the
correspondence of data between the communicating parties, so that a third specication for the valid
transfers is not necessary. The synthesis procedure then follows the approach proposed by Akella by
rst translating the regular expressions into automata, then constructing a product machine, and nally
pruning it of the illegal states. This approach was then extended to also include a specication of the valid
transactions, and was cast in the framework of game theory to account for more complex properties, such
as liveness [13].
Recently, Siegmund and Mller [19] have proposed a similar approach where the regular expressions
are embedded in the description language, in this case SystemC, through the use of appropriate supporting
classes. The advantage is that the interface description can be simulated directly with the existing applica-
tion. However, in this approach the user is required to describe the converter itself, instead of having it
be generated automatically froma description of the communicating protocols. In other words, issues of
synchronization and data sequencing must be solved upfront. Register transfer level code for the interface
can then be generated automatically fromthe SystemC specication.
More recent work has been focused on studying the above issues in a more general setting, generalizing
the approach to modeling interfaces and to synthesis by abstracting away from the particular model of
computation. De Alfaro and Henzinger propose to use block algebras to describe the relation between
components and interfaces [8]. Block algebras are mathematical structures that are used to model a system
as a hierarchical interconnection of blocks. Blocks are further classied as components and interfaces.
Informally components are descriptions of blocks that say what the block does. Conversely, interfaces
are descriptions of blocks that say the expectations that the block has with respect to its environment.
This distinction is based upon the observation that physical components do something in any possible
environment, whether they behave well or misbehave. In contrast, interfaces describe for each block the
environments that can correctly work with the block. Several different kinds of block algebras have been
developed for synchronous models, real-time models, and resource models, each carrying a particular
notion of compatibility [7, 2022]. The authors, however, limit their study to questions of compatibility,
and do not address the problemof synthesizing adapters.
The solution to the problem of protocol synthesis in an abstract setting will be discussed in more
detail in Section 23.4, along with the presentation of the relevant related work. Informally, the prob-
lem is formulated as an equation of the form P
1
| C | P
2
G, where P
1
and P
2
are the incompatible
protocols, C the protocol converter, and G a global specication that denes the terms of the transac-
tions. The operator | represents the operation of composition while the relation expresses the notion of
conformance to the specication. This problem was rst addressed by Larsen and Xinxin in the framework
of process algebra [23]. The solution is derived constructively by building a special form of transition
system. More recently, Yevtushenko et al. [24] present a formulation of the problem in terms of languages
(sets of sequences of actions) under various kinds of composition operators. By working directly with
languages, the solution can then be specialized to different specic representations, including automata
and nite state machines. Finally, Passerone generalize the solution by representing the models as abstract
algebras, and derive the conditions that guarantee the existence of a solution [12].
23.3 Automata-Based Converter Synthesis
We introduce the problem of interface specication and protocol conversion by way of an example. We rst
set up the conversion problem for sendreceive protocols, where the sender and the receiver are specied as
automata. A third automaton, the requirement, is also introduced to specify constraints on the converter,
such as buffer size and the possibility of message loss. We then solve the protocol conversion bymanually
(of course, the procedure is easy to automate!) deriving an adapter that conforms to both the protocols
and the requirements. Section 23.4.1 will discuss an algebraic solution to the same problem.
23.3.1 Interface Specication
A producer and a consumer component wish to communicate some complex data across a communication
channel. They both partition the data into two parts. The interface of the producer is dened so that it
can wait an unbounded amount of time between the two parts. Because the sender has only outputs, this
is equivalent to saying that the interface does not guarantee to its environment that the second part will
follow the rst within a xed nite time. On the other hand, the interface of the consumer is dened so that
it requires that once the rst part has been received, the second is also received during the state transition
that immediately follows the rst. Because the receiver has only inputs, this specication corresponds to
an assumption that the receiver makes on the set of possible environments that it can work with. Clearly,
the two protocols are incompatible. In fact, the sender may elect to send the rst part of the data and then
wait for some time before sending the second part. Upon receiving the rst part, the receiver will, however,
assume that the second part will be delivered right away. Since this is not the case, a protocol violation
will occur. In other words, the guarantees of the sender are not sufcient to prove that the assumptions
of the receiver are always satised. Thus a direct composition would result in a possible violation of the
protocols. Because no external environment can prevent this violation (the system has no inputs after the
composition), an intermediate converter must be inserted to make the communication possible. Below,
we illustrate how to synthesize a converter that enables sender and receiver to communicate correctly.
The two protocols can be represented by the automata shown in Figure 23.1. There, the symbols a and b
(and their primed counterparts) are used to denote the rst and the second part of the data, respectively.
The symbol denotes instead the absence or irrelevance of the data. In other words, it acts as a dont care.
Figure 23.1(a) shows the producer protocol. The self-loop in state 1 indicates that the transmission of
a can be followed by any number of cycles before b is also transmitted. We call this protocol handshake
because it could negotiate when to send the second part of the data. After b is transmitted, the protocol
returns to its initial state, and is ready for a new transaction.
Figure 23.1(b) shows the receiver protocol. Here state 1 does not have a self-loop. Hence, once a has
been received, the protocol assumes that b is transmitted in the cycle that immediately follows. This
protocol is called serial because it requires a and b to be transferred back-to-back. Similarly to the sender
protocol, once b is received the automaton returns to its initial state, ready for a new transaction.
We have used nonprimed and primed versions of the symbols in the alphabet of the automata to
emphasize that the two sets of signals are different and should be connected through a converter. It is
the specication (below) that denes the exact relationships that must hold between the elements of
the two alphabets. Note that in the denition of the two protocols nothing relates the quantities of one
1
0
b
1
0
b a a
(b) (a)
FIGURE 23.1 (a) Handshake and (b) serial protocols. (From Roberto Passerone, Luca de Alfaro, Thomas A.
Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on
Computer-Aided Design (ICCAD02), November 2002. With permission. Copyright 2002 IEEE.)
(a and b) to those of the other (a
/
and b
/
). The symbol a could represent the toggling of a signal, or
could symbolically represent the value of, for instance, an 8-bit variable. It is only in the interpretation of
the designer that a and a
/
actually hold the same value. The specication that we are about to describe
does not enforce this interpretation, but merely denes the (partial) order in which the symbols can be
presented to and produced by the converter. It is possible to explicitly represent the values passed; this is
necessary when the behavior of the protocols depends on the data, or when the data values provided by
one protocol must be modied (translated) before being forwarded to the other protocol. The synthesis of
a protocol converter would then yield a converter capable of both translating data values, and of modifying
their timing and order. However, the price to pay for the ability to synthesize data translators is the state
explosion in the automata to describe the interfaces and the specication.
Observe also that if a and b are symbolic representation of data, some other means must be available
in the implementation to distinguish when the actual data corresponds to a or to b. At this level of the
description we do not need to be specic: we simply assume that the sender has a way to distinguish
whether the symbol a or the symbol b is being produced, and the receiver has a way to distinguish whether
a
/
or b
/
is being provided. Examples of methods include toggling bits, or using data elds to specify
message types. However, we do not want to be tied to any particular method at this time.
23.3.2 Requirements Specication
What constitutes a correct transaction? Or in other words, what properties do we want the communica-
tion to have? In the context of this particular example the answer seems straightforward. Nonetheless,
different criteria could be enforced depending on the application. Each criterion is embodied by a different
specication.
One example of a specication is shown in Figure 23.2. The alphabet of the automaton is derived from
the Cartesian product of the alphabets of the two protocols for which we want to build a converter. This
specication states that no symbols should be discarded or duplicated by the converter, and symbols must
be delivered in the same order in which they were received; moreover, the converter can store at most one
undelivered symbol at any time. The three states in the specication correspond to three distinct cases:
State 0 denotes the case in which all received symbols have been delivered (or that no symbol has
been received, yet).
(b,a)
0
a b
(,) (a,a) (b,b)
(a,a)
(,)
(a,)
(,a) (,b)
(b,)
(b,b)
(,)
(a,b)

FIGURE 23.2 Specication automaton. (From Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and
Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on Computer-Aided
Design (ICCAD02), November 2002. With permission. Copyright 2002 IEEE.)
State a denotes the case in which symbol a has been received, but it has not been output yet.
Similarly, state b denotes the case in which symbol b has been received, but not yet output.
Note that this specication is not concerned with the particular form of the protocols being considered
(or else it would itself function as the converter); for example, it does not require that the symbols a or b
be received in any particular order (other than the one in which they are sent). On the other hand, the
specication makes precise what the converter can, and cannot do, ruling out, for instance, converters
that simply discard all input symbols from one protocol, never producing any output for the destination
protocol. In fact, the specication admits the case in which a and b are transferred in the reversed order.
It also does not enforce that a and b always occur in pairs, and admits a sequence of as without intervening
bs (or vice versa). The specication merely asserts that a
/
should occur no earlier than a (an ordering
relation), and that a
/
must occur whenever a new a or b occurs. In fact, we can view the specication as
an observer that species what can happen (a transition on some symbol is available) and what should not
happen (a transition on some symbol is not available). As such, it is possible to decompose the specication
into several automata, each one of which species a particular property that the synthesized converter
should exhibit. This is similar to the monitor-based property specication proposed by Shimizu et al. [11]
for the verication of communication protocols. In our work, however, we use the monitors to drive the
synthesis so that the converter is guaranteed to exhibit the desired properties (correct-by-construction).
A high-level view of the relationship between the protocols and the specication is presented in
Figure 23.3. The protocol handshake produces outputs a and b, the protocol serial accepts inputs a
/
and b
/
.
The specication accepts inputs a, b, a
/
, b
/
, and acts as a global observer that states what properties the
converter should have. Once we compose the two protocols and the specication, we obtain a system
with outputs a, b, and inputs a
/
, b
/
(Figure 23.3). The converter will have inputs and outputs exchanged:
a and b are the converter inputs, and a
/
, b
/
its outputs.
23.3.3 Synthesis
The synthesis of the converter begins with the composition(product machine) of the two protocols, shown
in Figure 23.4. Here the direction of the signals is reversed: the inputs to the protocols become the outputs
Handshake
protocol
Serial
protocol
Converter
Specification
a,b, a,b,
a,b, a,b,
FIGURE 23.3 Inputs and outputs of protocols, specication, and converter. (From Roberto Passerone, Luca de Alfaro,
Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design (ICCAD02), November 2002. With permission. Copyright 2002 IEEE.)
b/ T
1
0
0 1
b a
T
a
b
T
T/T
T/T
T/a
a/T
T/a
a/a
b/b
b/a
a/b
T/b
T/b
FIGURE 23.4 Composition between handshake and serial. (From Roberto Passerone, Luca de Alfaro,
Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International
Conference on Computer-Aided Design (ICCAD02), November 2002. With permission. Copyright 2002 IEEE.)
of the converter, and vice versa. This composition is also a specication for the converter, since on both
sides the converter must comply with the protocols that are being interfaced. However this specication
does not have the notion of synchronization (partial order, or causality constraint) that the specication
discussed above dictates.
We can ensure that the converter satises both specications by taking the converter to be the com-
position of the product machine with the specication, and by removing transitions that violate either
protocol or the correctness specication. Figure 23.5 through Figure 23.7 explicitly show the steps that we
go through to compute this product. The position of the state reects the position of the corresponding
T/T
T
a/b
b
0 a
a/T
0
T/b
T/b
b/a
b/b
a/a
T/a
T/T
b
a
T
a b
0 1
0
1
T/a
b/T
FIGURE 23.5 Converter computation, phase 1. (From Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger,
and Alberto L. Sangiovanni-Vincentelli. In Proceedings of the IEEE/ACM International Conference on Computer-Aided
state in the protocol composition, while the label inside the state represents the corresponding state in
the specication. Observe that the bottom-right state is reached when the specication goes back to
state 0. This procedure corresponds to the synthesis algorithm proposed in Reference 10. The approach
here is however fundamentally different: the illegal states are dened by the specication, and not by the
particular algorithm employed.
The initial step is shown in Figure 23.5. The composition with the specication makes the transitions
depicted in dotted line illegal (if taken, the specication would be violated). However, transitions can
be removed from the composition only if doing so does not result in an assumption on the behavior
of the sender. In Figure 23.5, the transition labeled /a
/
leaving state 0 can be removed because the
machine can still respond to a input by taking the self-loop, which is legal. The same applies to
the transition labeled b /
/
leaving state a which is replaced by the transition labeled b /a
/
. However,
removing the transition labeled /b
/
leaving the bottom-right state would make the machine unreceptive
to input . Equivalently, the converter is imposing an assumption on the producer that will not
occur in that state. Because this assumption is not veried, and because we cannot change the producer,
we can only avoid the problem by making the bottom-right state unreachable, and remove it from the
composition.
The result is shown in Figure 23.6. The transitions that are left dangling because of the removal of the
state should also be removed, and are now shown in dotted lines. The same reasoning as before applies,
and we can only remove transitions that can be replaced by others with the same input symbol. In this
case, all illegal transitions can be safely removed.
The resulting machine shown in Figure 23.7 has now no illegal transitions. This machine complies
both with the specication and with the two protocols, and thus represents the correct conversion (correct
relative to the specication). Notice how the machine at rst stores the symbol a without sending it
(transition a/
/
). Then, when b is received, the machine sends a
/
, immediately followed in the next cycle
by b
/
, as required by the serial protocol.
1
0
0
1
b a
T
a
b
T
T/T
T/a
a/a
b/a
a/b
T/b
0
a
b
a/ T
T/T
b/a
1
0
0
1
b a
T
a
b
T
T/T
T/T
T/b
0
a
b
a/T
a/b
23.4 Algebraic Formulation
The problemof converter synthesis canbe seenas a special case of the more general problemof the synthesis
of a local specication, shown in Figure 23.8 (also known as the unknown component problem). Here,
we are given a global specication G and a partial implementation, called a context, which consists of
the composition of several modules, such as P
1
and P
2
. The implementation is only partially specied,
and is completed by inserting an additional module X to be composed with the rest of the context.
The problem consists of nding a local specication L for X, such that if X implements L, then the
full implementation I implements the global specication G. If we denote with _ the implementation
relation, then the local specication synthesis problem can be expressed as solving for the variable X the
following inequality
P
1
| X | P
2
_ G
The problemof local specication synthesis is very general and can be applied to a variety of situations.
One area of application is, for example, that of supervisory control synthesis [25]. Here a plant is used
as the context, and a control relation as the global specication. The problem consists of deriving the
appropriate control lawto be applied in order for the plant to followthe specication. Engineering changes
is another area, where modications must be applied to part of a systemin order for the entire systemto
satisfy a new specication. This procedure is also known as rectication. Note that the same rectication
procedure could be used to optimize a design. Here, however, the global specication is unchanged, while
the local specication represents all the possible admissible implementations of an individual component
of the system, thus exposing its full exibility [26].
In the case of converter synthesis, the context consists of the protocols that must be connected, while the
specication may simply insist that data be passed fromone side to the other within a set of requirements.
In this case the local specication describes the additional element in the implementation required to
make the communication possible, that is, the converter.
The literature on techniques to solve the local specication synthesis problemis vast. Here we focus on
three of the proposed techniques and highlight in particular their differences in scope and aim.
LarsenandXinxin[23] solve the problemof synthesizing the local specicationfor a systemof equations
in a process algebra. In order to represent the exibility in the implementation, the authors introduce the
Disjunctive Modal Transition System (DMTS). Unlike traditional labeled transition systems, the DMTS
model includes two kinds of transitions: transitions that may exist and transitions that must exist. The
transitions that must exist are grouped into sets, of which only one is required in the implementation.
In other words, the DMTS is a transition systemthat admits several possible implementation in terms of
traditional transition systems.
P
2
L
Local specification
Implies
Global specification
G
X P
1
X
I
FIGURE 23.8 Local specication synthesis.
The systemis solved constructively. Given a context and a specication, the authors construct a DMTS
whose implementations include all and only the solution to the equation. To do so, the context is rst
translated fromits original equational forminto an operational formwhere a transition includes both the
consumption of an event fromthe unknown component, and the production of an event. The transitions
of the context and of the specication are then considered in pairs to deduce whether the implementation
may or may not take certain actions. A transition is possible, but not required, in the solution whenever
the context does not activate such transition. In that case, the behavior of the solution may be arbitrary
afterwards. A transition is required whenever the context activates the transition, and the transition is
used to match a corresponding transition in the specication. A transition is not allowed in the solution
(thus it is neither possible, nor required) whenever the context activates it, and the transition is contrary
to the specication.
The solution proposed by Larsen and Xinxin has the advantage that it provides a direct way of com-
puting the set of possible implementations. On the other hand, it is specic to one model of computation
(transition systems). Yevtushenko et al. [24] present a more general solution where the local specication
is obtained by solving abstract equations over languages under various kinds of composition oper-
ators. By working directly with languages, the solution can then be specialized to different kinds of
representations, including automata and nite state machines.
In the formalism introduced by Yevtushenko et al., a language is a set of nite strings over a xed
alphabet. The particular notion of renement (or implementation) proposed in this work corresponds
to language containment: language P renes a language Q if and only if P Q. If we denote with

P the
operation of complementation of the language P (i.e.,

P is the language that includes all the nite strings
over the alphabet that are not in P), then the most general solution to the equation in the variable X
A X C
is given by the formula
S = A

C
The language S is called the most general solution because a language P is a solution of the equation
if and only if P S. In the formulas above, the operator can be replaced by different avors of
parallel composition, including synchronous and asynchronous composition. These operators are both
constructed as a series of an expansion of the alphabet of the languages, followed by a restriction. For
the synchronous composition, the expansion and the restriction do not alter the length of the strings of
the languages to which they are applied. Conversely, expansion in the asynchronous composition inserts
arbitrary substrings of additional symbols thus increasing the length of the sequence, while the restriction
discards the unwanted symbols while shrinking the string.
The language equations are then specialized to various classes of automata, including nite automata
and nite state machines. This provides an algorithmic way of solving the equation for restricted classes
of languages (i.e., those that can be represented by the automaton). The problem in this case consists of
proving certain closure properties that ensure that the solution can be expressed in the same nite repre-
sentation as the elements of the equation. In particular, the authors consider the problemof receptiveness
(there called I -progression) and prex closure.
A similar solution is proposed in the framework of Agent Algebra by Passerone et al. [12, 27]. The
approach is, however, more general, and does not make any particular assumption about the form that
the protocols or the specication can take. In other words, the solution is not limited to protocols
represented as languages over an alphabet, or as transition systems. This is similar to the block algebras
proposed by de Alfaro and Henzinger (see Section 23.2). There is however a fundamental difference in the
way interfaces and components interact. In de Alfaro and Henzinger, the distinction between interfaces
and components seems to ultimately arise from the fact that components, by making no assumptions,
are unable to constrain their environment. For this reason, components are often called input-enabled,
or receptive. Interfaces, on the other hand, constrain the environment by failing to respond to some
of their possible inputs. Receptiveness and environment constraints are not, however, mutually exclu-
sive. The two notions coexist, and are particularly well behaved, in the so-called trace-based models
such as Dills trace structures [9] and Negulescus Process Spaces [28, 29]. We refer to these models as
two-set trace models. In two-set trace models, traces, which are individual executions of a component,
are classied as either successes or failures. In order for a system to be failure-free, the environment of
each component must not exercise the failure traces. Failure traces therefore represent the assumptions
that a component makes relative to its environment. However, the combination of failure and success
traces makes the component receptive. Agent Algebras generalize these concepts by shifting the notion
of compatibility from the individual executions to the components themselves. The interface models
proposed by de Alfaro and Henzinger can easily be seen in these terms. For example, interface auto-
mata [7] can be explained almost exactly in terms of the prex closed trace structures of Dill [9].
In particular, the composition operator in interface automata is an implementation of Dills autofail-
ure manifestation and failure exclusion. Therefore, Agent Algebras do not distinguish between the notion
of an interface and a component. Or, to be more precise, the distinction between a component and its
interface has only to do with a difference in the level of abstraction, rather than with a difference in their
nature.
In Agent Algebra, the problem of local specication synthesis, and therefore of protocol conversion,
is set up as usual as the equation
proj(A)(P
1
| P
2
| X) _ G
Note that here the operation of restriction on the alphabet is not part of the composition and is made
explicit by the operator proj(A), whose effect is to retain only the elements of the alphabet that are
contained in the set A. The solution to the equation is expressed in the form
C _ mirror(proj(A)(P
1
| P
2
| mirror(G)))
where mirror is a generalized complementation operation whose formdepends on the particular model of
computation and on its notion of compatibility. The details of the derivation of this solution are outside
the scope of this chapter [12]. Instead, we only concentrate on protocols represented as two-set trace
structures.
23.4.1 Trace-Based Solution
Two-set trace structures are particularly well suited to modeling behavioral interfaces and protocols.
The set of failure traces, in fact, states the conditions of correct operation of a component. They can there-
fore be interpreted as assumptions that components make relative to their environment. Two components
are compatible whenever they respect those assumptions, that is, they do not engage in behaviors that
make the other component fail. Interface protocols can often be described in this way. The transactions
that do not comply with the protocol specication are considered illegal, and therefore result in an incor-
rect operation of the component that implements the protocol. The solution to the protocol conversion
problemdescribed in Section 23.3 requires that we develop a trace-based model of a synchronous system.
The model that we have in mind is essentially identical to the synchronous models proposed by Burch [30]
and Wolf [31]. For our simple case, an individual execution of a component (a trace) is a sequence of
actions fromthe alphabet A = {, a, b, a
/
, b
/
], where denotes the absence of an action. Each component
T consists of two sets of traces S and F, corresponding to the successes and the failures, respectively.
A projection, or hiding of signals, in a trace can be obtained by replacing everywhere in the trace the
actions to be hidden by the special value , denoting the absence of any action. In this way, while we
abstract away the information about the signal, we do retain the cycle count, ensuring that the model is
synchronous. For instance,
proj({a])(a, b, a, , b, a, b, b, a, . . .)) = a, , a, , , a, , , a, . . .)
where the argument of the projection lists the signals that must be retained. The operation of projection
is applied to all success and all failure traces of a component.
Parallel composition is more complex. A trace is a possible execution of a component whether it is a
success or a failure. It is not a possible execution if it is neither a success nor a failure. If T
1
and T
2
are
two components, then their parallel composition should contain all and only those traces that are possible
executions of both T
1
and T
2
. One such trace will be a success of the composition if it is a success of both
T
1
and T
2
. However, a trace is a failure of the composite if it is a possible trace of one component, and it
is also a failure of the other component. Note that if a trace is a failure of one component, but it is not a
possible trace of the other component (i.e., it is neither a success nor a failure of the other component),
then the trace does not appear as a failure of the composite (in fact, it is not a trace of the composition
at all). This is because, in the interaction, the particular behavior that results in that failure will never be
exercised, as it is ruled out by the other component. Formally, if T
1
= (S
1
, F
1
) and T
2
= (S
2
, F
2
), then the
parallel composition T = T
1
| T
2
is given by
T = (S
1
S
2
, (F
1
(S
2
F
2
)) (F
2
(S
1
F
1
)))
If the two components do not share the same alphabet, parallel composition must also include an inter-
mediate step of projection or inverse projection to equalize the signals. Because the length of the sequence
is retained during a projection, parallel composition results in a lock step execution of the components.
Because components consist of two sets of executions, the relation of implementation cannot be reduced
to a simple set containment. Instead, a component T implements another component T
/
if all the possible
behaviors of T are also possible behaviors of T
/
, and if T fails less often than T
/
. This ensures that replacing
T for T
/
does not produce any additional failure in the system. Formally, T _ T
/
whenever
S F S
/
F and F F
/
The operation of complementation, or mirroring, must also take successes and failures into account.
The complement of T is dened as the most general component that can be composed with T without
generating any failure. Given the denitions of composition and the implementation relation, the mirror
of T is dened as
mirror (T ) = (S F, S F )
In other words, the possible behaviors of mirror (T ) include all behaviors that are not failures of T. Of those,
the successes of T are also successes of its mirror. It is easy therefore to verify that the composition of a
component with its complement has always an empty set of failures.
The two protocols and the correctness specication of the example of Section 23.3 are easily represented
as two-set models. In fact, sets of traces can be represented using automata as recognizers. However, for
each component, we must represent two sets. This can be accomplished in the automaton by adding failure
states that accept the failure traces. For the particular example presented in Section 23.3, we can still use
the automata shown in Figure 23.1 and Figure 23.2. Note that we do not need to add failures to either
the sender protocol or to the specication, since they have only outputs and therefore do not constrain the
environment in any way. The receiver, on the other hand, must be augmented with a state representing the
failure traces. A transition to this additional state is taken from each state on all the inputs for which an
action is not already present. In this case, if P
1
is the sender protocol, P
2
the receiver, C the converter, and
G the specication, we may compute the converter by setting up the following local specication synthesis
problem:
P
1
| P
2
| C _ G
The solution is therefore,
C _ mirror(P
1
| P
2
| mirror(G))
Note that projections are not needed in this case, since the alphabet is always A = {, a, b,
/
, a
/
, b
/
],
which is also the alphabet of C. The solution to the problem thus consists of taking the complement of the
global specication, compose it with the context (i.e., the two protocols), and complementing the result.
After taking the complementation, the resulting component may not be receptive. This can be avoided
by applying the operations of autofailure manifestation and failure exclusion, similarly to the synchronous
trace structure algebra of Wolf [31], before computing the mirror. A state is an autofailure if all its outgoing
transitions are failures. In that case, the state can be bypassed by directing its incoming transitions to the
outgoing failure state. Failure exclusion, instead, results in the removal of successful transitions whenever
they are matched by a corresponding failure transition on the same input in the same state. The comple-
mentation can then be most easily done by rst making the automaton deterministic (note, however, that
this is a potentially expensive computation). For a deterministic and receptive automaton the mirror can
be computed by removing the existing outgoing failure transitions of each state and by adding transitions
to a new failure state for each of the input actions that does not already result in a success. When doing
so in the example above, we obtain exactly the result depicted in Figure 23.7, with additional failure
transitions that stand to represent the exibility in the implementation. In particular, the state labeled 0
in Figure 23.7 has failure transitions on input b, the state labeled 1 on input a, and the state labeled 2 on
input b. This procedure is explained in more details below.
23.4.2 End-to-End Specication
A potentially better approach to protocol conversion consists of changing the topology of the local
specication problem, by providing a global specication that extends end to end from the sender to
the receiver, as shown in Figure 23.9. The global specication in this case may be limited to talking about
the behavior of the communication channel as a whole, and would be independent of the particular
signals employed internally by each protocol. In addition, in a scenario where the sender and the receiver
function as layers of two communicating protocol stacks, the end-to-end behavior is likely to be more
abstract, and therefore simpler to specify, than the inner information exchange.
We illustrate this case by modifying the previous example. In order to change the topology, the sender
and receiver protocols must be modied to include inputs from (for the sender) and outputs to (for the
receiver) the environment. This is necessary to let the protocols receive and deliver the data transmitted
over the communication channel, and to make it possible to specify a global behavior. In addition to
adding connections to the environment, in this example we also explicitly model the data. Thus, unlike
the previous example where the specication only required that a certain ordering relationship on the
data be satised, we can here express true correctness by specifying that if a value is input to the system,
the same value is output by the systemat the end of the transaction. Since the size of the state space of the
automata increases exponentially with the size of the data, we will limit the example to the communication
of a two-bit integer value. Abstraction techniques must be used to handle larger problems. To make the
example more interesting, we modify the protocols so that the sender serializes the least signicant bit
Specification
Handshake
protocol
Serial
protocol
Converter
FIGURE 23.9 End-to-end specication.
00
01 10
F
s
ft, 00 / st, 0
nft, -- / st, 0
ft, 01 / st, 1 ft, 10 / st, 0
ft, 11 / st, 1
nft, -- /st, 1
nft, -- /st, 1
nft, -- /st, 0
nft, -- / nst, 0
nft, -- / nst, 0 nft, -- /nst, 0
nft, -- / nst, 0
ft, -- / nst, 0
ft, -- / nst, 0
ft, -- / nst, 0
ft, -- / nst, 0
nft, -- / nst, 0
11
, --/ nst, 0
FIGURE 23.10 The sender protocol.
rst, while the receiver expects the most signicant bit rst. In this case, the converter will also need to
reorder the sequence of the bits received from the sender.
All signals in the system are binary valued. The protocols are simple variations of the ones depicted in
Figure 23.1. The inputs to the sender protocol include a signal ft that is set to 1 when data is available,
and two additional signals that encode the two-bit integer to be transmitted. The outputs also include a
signal st that clocks the serial delivery of the data, and one signal sd for the data itself. The sender protocol
is depicted in Figure 23.10. We adopt the convention that a signal is true in the label of a transition
when it appears with its original name, and it is false when its name is preceded by an n. Hence, for
example, ft implies that ft = 1, and nft that ft = 0. The shaded state labeled F in the automaton accepts
the failure traces, while the rest of the states accept the successful traces. Note that the protocol assumes
that the environment refrains from sending new data while in the middle of a transfer. In addition,
the protocol may wish to delay the transmission of the second bit of the data for as many cycles as
desired.
Similarly, the receiver protocol has inputs rt and rd, where rt is used to synchronize the start of the
serial transfer with the other protocol; the output tt nally informs the environment when new data is
available. The receiver protocol is depicted in Figure 23.11. The receiver fails if the second bit of the data
is not received within the clock cycle that follows the delivery of the rst bit.
The automaton for the global specication is shown in Figure 23.12. The global specication has the
same inputs as the sender protocol, and the same outputs as the receiver protocol. A trace is successful if a
certain value is received on the sender side, and the same value is emitted immediately or after an arbitrary
delay on the receiver side. Analogously to the sender protocol, the specication fails if a new data value is
received while the old value has not been delivered yet.
Following the same notation as the previous example, the solution to the conversion problem can be
stated as
C _ mirror(proj({st, sd, rt, rd])(P
1
| P
2
| mirror(G)))
The projection is now essential to scope down the solution to only the signals that concern the conversion
algorithm. The components must again be receptive, therefore similar considerations as those expressed
before for the computation of the mirror apply. In particular, autofailure manifestation and failure
exclusion is applied before computing the mirror. The automaton is also made deterministic if necessary.
0 1
F
r
nrt, / ntt, 00
rt, 0 / ntt, 00
nrt, / tt, 00
rt, 1 / tt, 01
rt, 0 / tt, 00
rt, 0 / tt, 10
rt, 1 / tt, 11
rt, 1 / ntt, 00
nrt, / tt, 00
, / ntt, 00
FIGURE 23.11 The receiver protocol.
00
01 10
F
p
nft, -- / tt, 00
ft, 01 / ntt, --
ft, 10 / ntt, --
ft, 11 / ntt, --
nft, -- / tt, 11
nft, -- / tt, 01
nft, -- / ntt, --
nft, -- / ntt, --
ft, -- /, --
ft, 11 / tt, 11
11
, -- / , --
ft, 00 / ntt, --
nft, -- / tt, 10
nft, -- / ntt, --
nft, -- / ntt, --
ft, -- / , --
ft, -- / , --
ft, -- / , --
ft, 10 / tt, 10
ft, 01 / tt, 01
ft, 00 / tt, 00
nft, -- / ntt, --
FIGURE 23.12 The global specication.
c
F
st, 1 / nrt,
nst, 0 / nrt, nst, 0 / nrt,

st, 1 / rt, 1
st, 0 / rt, 0
st, 1 / nrt,
nst, 0 / rt, 0
nst, 0 / rt, 1

st, 0 / nrt,
nst, 0/ nrt, nst, 0/ nrt, nst, 0/ nrt,
nst, 0 / nrt,
nst, 0 / nrt,
st, 1 / rt, 1
st, 0 / rt, 0
st, 1 / nrt, st, 0 / nrt,
nst, 0 / rt, 0 nst, 0 / rt, 1
nst, 0 / rt, 0

nst, 1 / ,
st, / ,
nst, 1 / ,
nst, 1 / ,
st, / ,
nst, 1 / ,
st, / ,
nst, 1 / ,
st, / ,

nst, 1 / ,
st, / ,
nst, 1 / ,

nst, 1 / ,
nst, 1 / ,
, / ,
st, / ,
st, 0 / nrt,
nst, 0 / rt, 1

0 1
10 00 01 11
*0 *1
FIGURE 23.13 The local converter specication.
The result of the computation is shown in Figure 23.13, where, for readability, the transitions that lead to
the failure states have been displayed in dotted lines. The form of the result is essentially identical to that of
Figure 23.7. Note how the converter switches the position of the most and the least signicant bit of the data
during the transfer. In this way the converter makes sure that the correct data is transferred from one end
to the other. Note, however, that the new global specication (Figure 23.12) had no knowledge whatsoever
of how the protocols were supposed to exchange data. Failure traces again express the exibility in the
implementation, and at the same time represent assumptions on the environment. These assumption
are guaranteed to be satised (modulo a failure in the global specication), since the environment is
composed of the sender and the receiver protocol, which are known variables in the system.
The solution excludes certain states that lead to a deadlock situation. This is in fact an important side
effect of our specic choice of synchronous model, and has to do with the possibility of combinational
loops that may arise as a result of a parallel composition. When this is the case, the mirror of an otherwise
receptive component may not be receptive. This is because it is perfectly admissible in the model to avoid
a failure by withholding an input, that is, by constraining the environment not to generate an input. But
since the environment is not constrained, this can only be achieved by stopping time before reaching
the deadlock state. Since this would be infeasible in any reasonable physical model, we consider deadlock
states tantamount to an autofailure, and remove them from the nal result. This problem can be solved
by employing a synchronous model that deals with combinational loops directly. This is an aspect of
the implementation that has been extensively studied by Wolf [31], who proposes to use a three-valued
model that includes the usual binary values 0 and 1, and one additional value to represent the oscillating,
or unknown, behavior that results from the combinational loops. Exploring the use of this model in the
context of protocol specication and converter synthesis is part of our future work.
Asimilar conditionmay occur whena component tries toguessthe future, by speculating the sequence
of inputs that will be received in the following steps. If the sequence is not received, the component will
nd itself in a deadlock situation, unable to roll back to a consistent state. This is again admissible in
*0 *1
F
c
st, 0 / nrt, st, 1 / nrt,
nst, 0/ nrt, nst, 0/ nrt, nst, 0/ nrt,
st, 1 / rt, 1
st, 0 / rt, 0
st, 1 / rt, 1
st, 0 / rt, 0
st, 1 / nrt,
0 1
nst, 0 / rt, 0 nst, 0 / rt, 1
st, / ,
st, / ,
nst, 1 / ,
nst, 1 / ,
nst, 1 / ,
, / ,
nst, 1 / ,
nst, 1 / ,
FIGURE 23.14 The optimized converter.
our model, but would be ruled out if the right notion of receptiveness were adopted. These states and
transitions are also pruned as autofailures.
The procedure outlined above has been implemented in a prototype application in approximately
2400 lines of C++ code. In the code, we explicitly represent the states and their transitions, while the
formulas in the transitions are represented implicitly using BDDs (obtained from a separate package).
This representation obviously suffers from the problem of state explosion. This is particularly true when
the value of the data is explicitly handled by the protocols and the specication, as already discussed.
A better solution can be achieved if the state space and the transition relation are also represented implicitly
using BDDs. Note, in fact, that most of the time the data is simply stored and passed on by a protocol
specication and is therefore not involved in deciding its control ow. The symmetries that result can
therefore likely be exploited to simplify the problem and make the computation of the solution more
efcient.
Note that the converter that we obtain is nondeterministic and could take paths that are slower than
one could expect them to be. This is evident in particular for the states labeled 0 and 1 which can react
to the arrival of the second piece of data by doing nothing, or by transitioning directly to the states *0
and *1, respectively, while delivering the rst part of the data. This is because our procedure derives the
full exibility of the implementation, and the specication depicted in Figure 23.12 does not mandate
that the data be transferred as soon as possible. A faster implementation can be obtained by selecting
the appropriate paths whenever a choice is available, as shown in Figure 23.14. In this case, the converter
starts the transfer in the same clock cycle in which the last bit fromthe sender protocol is received. Other
choices as also possible. In general, a fully deterministic converter can be obtained by optimizing certain
parameters, such as the number of states or the latency of the computation. More sophisticated techniques
might also try to enforce properties that were not included already in the global specication.
23.5 Conclusions
Emerging new design methodologies promote reuse of intellectual property as one of the basic tech-
niques to handle complexity in the design process. In a methodology based on reuse, the components
are predesigned and precharacterized, and are assembled in the system to perform the desired function.
System verication thus reduces to the verication of the interaction of the components used in the sys-
tem. In this chapter, we have reviewed and explored techniques that are useful to dene the interface that
components expose to their environment. These interfaces include not only the basic typing information,
typical of todays programming and hardware description languages, but also sequencing and behavi-
oral information that is necessary to verify correct synchronization. The interface specications of the
components are then used to automatically construct adapters if the components do not already satisfy
each others requirements. This technique was rst presented in the context of automata theory. Later, we
have presented similar, but stronger, results in the context of language theory and algebraic specications.
A simple example was used to illustrate a possible implementation of a converter synthesis algorithm.
Acknowledgments
Several people collaborated to the work described in this chapter, including Jerry Burch,
Alberto Sangiovanni-Vincentelli, Luca De Alfaro, Thomas Henzinger, and James Rowson. The author
would like to acknowledge their contribution.
References
[1] Henry Chang, Larry Cooke, Merrill Hunt, Grant Martin, Andrew J. McNelly, and Lee Todd.
Surviving the SOC Revolution. A Guide to Platform-Based Design. Kluwer Academic Publishers,
Norwell, MA, 1999.
[2] Alberto Ferrari and Alberto L. Sangiovanni-Vincentelli. System design: traditional concepts and
new paradigms. In Proceedings of the International Conference on Computer Design, ICCD 1999,
October 1999, pp. 212.
[3] Alberto L. Sangiovanni-Vincentelli. Dening platform-based design. EEdesign, February 2002.
[4] James A. Rowson and Alberto L. Sangiovanni-Vincentelli. Interface-based design. In Proceedings
of the 34th Design Automation Conference, DAC 1997, Anaheim, CA, June 913, 1997, pp. 178183.
[5] Marco Sgroi, Michael Sheets, Andrew Mihal, Kurt Keutzer, Sharad Malik, Jan Rabaey,
and Alberto Sangiovanni-Vincentelli. Addressing system-on-a-chip interconnect woes through
communication-based design. In Proceedings of the 38th Design Automation Conference, DAC
2001, Las Vegas, NV, June 2001, pp. 667672.
[6] S. Chaki, S.K. Rajamani, andJ. Rehof. Types as models: model checking message-passing programs.
In Proceedings of the 29th ACM Symposium on Principles of Programming Languages, 2002.
[7] Luca de Alfaro and Thomas A. Henzinger. Interface automata. In Proceedings of the Ninth Annual
SymposiumonFoundations of Software Engineering, ACMPress, Vienna, Austria, 2001, pp. 109120.
[8] Luca de Alfaro and Thomas A. Henzinger. Interface theories for component-based design.
In Thomas A. Henzinger and Christoph M. Kirsch, Eds., Embedded Software, Vol. 2211 of Lecture
Notes in Computer Science. Springer-Verlag, Heidelberg, 2001, pp. 148165.
[9] David L. Dill. Trace Theory for Automatic Hierarchical Verication of Speed-Independent Circuits.
ACMDistinguished Dissertations. MIT Press, Cambridge, MA, 1989.
[10] Roberto Passerone, James A. Rowson, andAlberto L. Sangiovanni-Vincentelli. Automatic synthesis
of interfaces between incompatible protocols. In Proceedings of the 35th Design Automation
Conference, San Francisco, CA, June 1998.
[11] Kanna Shimizu, David L. Dill, and Alan J. Hu. Monitor-based formal specication of PCI.
In Proceedings of the Third International Conference on Formal Methods in Computer-Aided Design,
Austin, TX, November 2000.
[12] Roberto Passerone. Semantic Foundations for Heterogeneous Systems. Ph.D. thesis, Department of
EECS, University of California, Berkeley, CA, May 2004.
[13] Roberto Passerone, Luca de Alfaro, Thomas A. Henzinger, and Alberto L. Sangiovanni-Vincentelli.
Convertibility verication and converter synthesis: two faces of the same coin. In Proceedings of
the IEEE/ACM International Conference on Computer-Aided Design (ICCAD02), November 2002.
[14] G. Borriello. A New Interface Specication Methodology and its Applications to Transducer Synthesis.
Ph.D. thesis, University of California at Berkeley, Berkeley, CA, 1988.
[15] G. Borriello and R.H. Katz. Synthesis and optimization of interface transducer logic. In Proceedings
of the International Conference on Computer Aided Design, November 1987.
[16] J.S. Sun and R.W. Brodersen. Design of system interface modules. In Proceedings of International
Conference on Computer Aided Design, 1992, pp. 478481.
[17] J. Akella and K. McMillan. Synthesizing converters between nite state protocols. In Proceedings
of the International Conference on Computer Design, Cambridge, MA, October 1415, 1991,
pp. 410413.
[18] S. Narayan and D.D. Gajski. Interfacing incompatible protocols using interface process generation.
In Proceedings of the 32nd Design Automation Conference, San Francisco, CA, June 1216, 1995,
pp. 468473.
[19] Robert Siegmund and Dietmar Mller. A novel synthesis technique for communication controller
hardware fromdeclarative data communication protocol specications. In Proceedings of the 39th
conference on Design Automation, New Orleans, LA, 2002, pp. 602607.
[20] ArindamChakrabarti, Luca de Alfaro, Thomas A. Henzinger, and Freddy Y.C. Mang. Synchronous
and bidirectional component interfaces. In Proceedings of the 14th International Conference on
Computer-Aided Verication (CAV), Vol. 2404 of Lecture Notes in Computer Science. Springer-
Verlag, Heidelberg, 2002, pp. 414427.
[21] Arindam Chakrabarti, Luca de Alfaro, Thomas A. Henzinger, and Marielle Stoelinga. Resource
interfaces. In Proceedings of the Third International Conference on Embedded Software (EMSOFT),
Vol. 2855 of Lecture Notes in Computer Science. Springer-Verlag, Heidelberg, 2003.
[22] Luca de Alfaro, Thomas A. Henzinger, and Marielle Stoelinga. Timed interfaces. In Proceedings of
the Second International Workshop on Embedded Software (EMSOFT), Vol. 2491 of Lecture Notes
in Computer Science. Springer-Verlag, Heidelberg, 2002, pp. 108122.
[23] Kim G. Larsen and Liu Xinxin. Equation solving using modal transition systems. In Proceedings
of the Fifth Annual IEEE Symposium on Logic in Computer Science (LICS 90), June 47, 1990,
pp. 108117.
[24] Nina Yevtushenko, Tiziano Villa, Robert K. Brayton, Alex Petrenko, and Alberto L. Sangiovanni-
Vincentelli. Sequential synthesis by language equation solving. Memorandum No. UCB/ERL
M03/9, Electronic Research Laboratory, University of California at Berkeley, Berkeley, CA, 2003.
[25] Adnan Aziz, Felice Balarin, Robert K. Brayton, Maria D. Di Benedetto, Alex Saldanha, and
Alberto L. Sangiovanni-Vincentelli. Supervisory control of nite state machines. In Pierre
Wolper, Ed., Proceedings of Computer Aided Verication: Seventh International Conference,
CAV95, Vol. 939 of Lecture Notes in Computer Science, Liege, Belgium, July 1995. Springer,
Heidelberg, 1995.
[26] Jerry R. Burch, David L. Dill, Elizabeth S. Wolf, and Giovanni De Micheli. Modeling hierarchical
combinational circuits. In Proceedings of the IEEE/ACM International Conference on Computer-
Aided Design (ICCAD93), November 1993, pp. 612617.
[27] Jerry R. Burch, Roberto Passerone, and Alberto L. Sangiovanni-Vincentelli. Notes on agent
algebras. Technical Memorandum UCB/ERL M03/38, University of California, Berkeley, CA,
November 2003.
[28] Radu Negulescu. Process Spaces and the Formal Verication of Asynchronous Circuits. Ph.D. thesis,
University of Waterloo, Canada, 1998.
[29] Radu Negulescu. Process spaces. In C. Palamidessi, Ed., CONCUR, Vol. 1877 of Lecture Notes in
Computer Science. Springer-Verlag, Heidelberg, 2000.
[30] Jerry R. Burch. Trace Algebra for Automatic Verication of Real-Time Concurrent Systems.
Ph.D. thesis, School of Computer Science, Carnegie Mellon University, August 1992.
[31] Elizabeth S. Wolf. Hierarchical Models of Synchronous Circuits for Formal Verication and
Substitution. Ph.D. thesis, Department of Computer Science, Stanford University, October 1995.
24
Hardware/Software
Interface Design
for SoC
Wander O. Cesrio
TIMA Laboratory
Flvio R. Wagner
UFRGS Instituto de Informtica
A.A. Jerraya
TIMA Laboratory
24.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-1
24.2 SoC Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-3
System-Level Design Flow SoC Design Automation
An Overview
24.3 HW/SW IP Integration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-5
Introduction to IP Integration Bus-Based and Core-Based
Approaches Integrating Software IP Communication
Synthesis IP Derivation
24.4 Component-Based SoC Design . . . . . . . . . . . . . . . . . . . . . . . . . 24-8
Design Methodology Principles Virtual Architecture
Target MPSoC Architecture Model HW/SW Wrapper
Architecture Design Tools Dening IP-Component
Interfaces
24.5 Component-Based Design of a VDSL Application . . . . 24-14
Specication DFU Abstract Architecture MPSoC RTL
Architecture Results Evaluation
24.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-19
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24-19
24.1 Introduction
Modern system-on-chip (SoC) design shows a clear trend toward integration of multiple processor cores.
The SoCsystemdriver sectionof theInternational Technology Roadmapfor Semiconductors [1]predicts
that the number of processor cores will increase fourfold per technology node in order to match the
processing demands of the corresponding applications. Typical multiprocessor SoC(MPSoC) applications
such as network processors, multimedia hubs, and base-band telecom circuits have particularly tight
time-to-market and performance constraints that require a very efcient design cycle.
Our conceptual model of the MPSoCplatformis composedof four kinds of components: software tasks,
processor and intellectual property (IP) cores, and a global on-chip interconnect IP (see Figure 24.1[a]).
Moreover, to complete the MPSoC platform we must also include hardware/software (HW/SW)
elements that adapt platform components to each other. MPSoC platforms are quite different from
single-master processor SoCs (SMSoCs). For instance, their implementation of system communication is
24-1
On-chip
communication
interconnect
HW interface
IP core
SW application
HW components
(RTL and Layout)
SW communication
abstraction
HW communication
abstraction
Platform API
Abstract HW interfaces
SW
design

SoC
design
HW
design
Communication
interconnect
Dedicated SW
Custom OS
Drivers
MPU core
Platform API
IP
Interface
HW interface
MPU core
SW interface
SW tasks
(a) (b) (c)
FIGURE 24.1 (a) MPSoC platform, (b) software stack, and (c) concurrent development environment.
more complicatedsince heterogeneous processors may be involvedandcomplex communicationprotocols
and topologies may be used. The hardware adaptation layer must deal with some specic issues:
1. In SMSoC platforms, most peripherals (excluding DMA controllers) operate as slaves with respect
to the shared communication interconnect. MPSoCplatforms may use many different types of pro-
cessor cores; in this case, sophisticated synchronization is needed to control shared communication
between several heterogeneous masters.
2. While SMSoC platforms use simple master/slave shared-bus interconnections, MPSoC platforms
often use several complex system buses or micronetworks as global interconnect. In MPSoC plat-
forms, we can separate computation and communication design by using communication copro-
cessors and proting from the multimaster architecture. Communication coprocessors/controllers
(masters) implement high-level communication protocols in hardware and execute themin parallel
with the computation executed on processor cores.
Application software is generally organized as a stack of layers that runs on each processor core (see
Figure 24.1[b]). The lowest layer contains drivers and low-level routines to control/congure the platform.
For the middle layer we can use any commercial embedded operating system (OS) and congure it
according to the application. The upper layer is an application-programming interface (API) that provides
some predened routines to access the platform. All these layers correspond to the software adaptation
layer inFigure 24.1(a), coding applicationsoftware canthenbe isolatedfromthe designof the SoCplatform
(software coding is not the topic of this chapter and will be omitted). One of the main contributions of
this work is to consider this layered approach for the dedicated software (often-called rmware) also.
Firmware is the software that controls the platform, and, in some cases, executes some nonperformance
critical application functions. In this case, it is not realistic to use a generic OS as the middle layer due
to code size and performance reasons. A lightweight custom OS supporting an application-specic and
platform-specic API is required.
Software and hardware adaptation layers isolate platform components enabling concurrent develop-
ment as shown in Figure 24.1(c). With this scheme, the software design teamuses APIs for both application
and dedicated software development. The hardware design team uses abstract interfaces provided by
communication coprocessors/controllers. SoC design team can concentrate on implementing HW/SW
abstraction layers for the selected communication interconnect IP. Designing these HW/SW abstraction
layers represent a major effort, and design tools are lacking. Established EDA tools are not well adapted to
this newMPSoCdesignscenario, andconsequently many challenges are emerging; some major issues are:
1. Higher abstraction level is needed: the register-transfer level (RTL) is very time consuming to
model and verify the interconnection between multiple processor cores.
2. Higher-level programming is needed: MPSoCs will include hundredthousands of lines of dedicated
software (rmware). This software cannot be programmed at the assembler level as today.
Hardware/Software Interface Design for SoC 24-3
3. Efcient HW/SW interfaces are required: microprocessor interfaces, bank of registers, shared
memories, software drivers, and OSs must be optimized for each application.
This chapter presents a component-based design automation approach for MPSoC platforms.
Section 24.2 introduces the basic concepts for MPSoC design and discusses some related platform and
component-based approaches. Section 24.3 details IP-based methodologies for HW/SW IP integration.
Section 24.4 details our specication model and design ow. Section 24.5 presents the application of this
ow for the design of a VDSL circuit and the analysis of the results.
24.2 SoC Design
24.2.1 System-Level Design Flow
This section gives an overview of current SoC design methodologies using a template design ow (see
Figure 24.2). The basic theory behindthis owis the separationbetweencommunicationandcomputation
renement for platform and component-based design [2,3]; it has ve main design steps:
1. System specication: system designers and the end-customer must agree on an informal model
containing all applications functionality and requirements. Based on this model, system designers
build a more formal specication that can be validated by the end-customer.
2. Architecture exploration: systemdesigners build an executable model of the specication and iterate
through a performance analysis loop to decide the HW/SW partitioning for the SoC architecture.
This executable specication uses an abstract platform composed of abstract models for HW/SW
components. For instance, an abstract software model can concentrate on I/O execution proles,
most frequent use cases, or worst-case scheduling. Abstract hardware can be described using
Specification
SW models HW models
Performance analysis
HW/SW partitioning
Abstract
platform
S
y
s
t
e
m
s
p
e
c
i
f
i
c
a
t
i
o
n
A
r
c
h
i
t
e
c
t
u
r
e
e
x
p
l
o
r
a
t
i
o
n
HW design SW design HW/SW IP integration
A
r
c
h
i
t
e
c
t
u
r
e
d
e
s
i
g
n
1
2
4
5
3 Golden
abstract architecture
RTL
architecture
IP core
interface
HW/SW interfaces
design
API
SW tasks
FIGURE 24.2 System-level design ow for SoC.
transaction-level models or behavioral models. This step produces the golden architecture model
that is the customized SoCplatformor a newarchitecture created by systemdesigners after selecting
processors, the global communication interconnect and other IP components. Once HW/SW
partitioning is decided, software and hardware development can be done concurrently.
3. Software design: since the nal hardware platform will not be available during software develop-
ment, some kind of hardware abstraction layer (HAL) or API must be provided to the software
design team.
4. Hardware design: hardware IP designers implement the functionality described by the abstract
hardware models at the RTL. Hardware IPs can use specic interfaces for a given platform or
standard interfaces as dened by Virtual Socket Interface Alliance (VSIA) [4].
5. HW/SW IP integration: SoC designers create HW/SW interfaces to the global communication
interconnect. The golden architecture model must specify performance constrains to assure
a good HW/SW integration. SW/HW communication interfaces are designed to conform to these
constrains.
24.2.2 SoC Design Automation An Overview
Many academic and industrial works propose tools for SoC design automation covering many, but not all,
design steps presented before. Most approaches can be classied into three groups: system-level synthesis,
platform-based design, and component-based design.
System-level synthesis methodologies are top-down approaches, the SoC architecture and software
models are produced by synthesis algorithms from a system-level specication. COSY [5] proposes
a HW/SW communication renement process that starts with an extended Kahn Process Network model
on design step (1), uses virtual channel connection (VCC) [6] for step (2), callback signals over a standard
real-time operating system (RTOS) for the API in step (3), and VSIA interfaces for steps (4) and (5).
SpecC [7] starts with an untimed functional specication model written in extended C on design step (1),
uses performance estimation for a structural architecture model for step (2), HW/SW interface synthesis
based on a timed bus-functional communication model for step (5), synthesized C code for step (3), and
behavioral synthesis for step (4).
Platform-based design is a meet-in-the-middle approach that starts with a functional system specica-
tion and a predesigned SoC platform. Performance estimation models are used to try different mappings
between the set of applications functional modules and the set of platform components. During these
iterations, designers can try different platform customizations and functional optimizations. VCC [6] can
produce a performance model using a functional description of the application and a structural descrip-
tion of the SoC platform for design steps (1) and (2). CoWare N2C [8] is a good complement for VCC
for design steps (4) and (5). Still the API for software components and many architecture details must be
implemented manually.
Section 24.3 discusses HW/SW IP integration in the context of current IP-based design approaches.
Most IP-based design approaches build SoC architectures from the bottom-up using predesigned com-
ponents with standard interfaces and a standard bus. For instance, IBM dened a standard bus called
CoreConnect [9], Sonics proposes a standard on-chip network called Silicon Backplane Network [10],
and VSIA dened a standard component protocol called VCI. When needed, wrappers adapt incompat-
ible buses and component interfaces. Frequently, internally developed components are tied to in-house
(nonpublic) standards; in this case, adopting public standards implies a big effort to redesign interfaces
or wrappers for old components.
Section 24.4 introduces a higher-level IP-based design methodology for HW/SWinterface design called
component-based design. This methodology denes a virtual architecture model composed of HW/SW
components and uses this model to automate design step (5), by providing automatic generation of
hardware interfaces (4), device drivers, OSs, and APIs (3). Even if this approach does not provide much
help in automating design steps (1) and (2), it provides a considerable reduction of design time for design
steps (3), (4), and (5) and facilitates component reuse. The key improvements over other state-of-art
platform and component-design approaches are:
1. Strong support for software design and integration: the generated API completely abstracts the
hardware platform and OS services. Software development can be concurrent to, and independent
of, platform customization.
2. Higher-level abstractions: the use of a virtual architecture model allows designers to deal with
HW/SW interfaces at a high abstraction level. Behavior and communication are separated in the
system specication, thus, they can be rened independently.
3. Flexible HW/SW communication: automatic HW/SW interfaces generation is based on the com-
position of library elements. It can be used with a variety of IP interconnect components by adding
the necessary supporting library.
24.3 HW/SW IP Integration
There are two major approaches for the integration of HW/SW IP components into a given design. In the
rst one, component interfaces follow a given standard (such as a bus or core interface, for hardware
components, or a set of high-level communication primitives, for software components) and can be thus
directly connected to each other. In the second approach, components are heterogeneous in nature and
their integration requires the generation of HW/SW wrappers. In both cases, an RTOS must be used to
provide services that are needed in order that the application software ts into the SoC architecture. This
section describes different solutions to the integration of HW/SW IP components.
24.3.1 Introduction to IP Integration
The design of an embedded SoC starts with a high-level functional specication, which can be validated.
This specication must already follow a clear separation between computation and communication [11],
in order to allow their concurrent evolution and design. An abstract architecture is then used to evaluate
this functionality based on a mapping that assigns functional blocks to architectural ones. This high-
level architectural model abstracts away all low-level implementation details. A performance evaluation
of the system is then performed, by using estimates of the computation and communication costs.
Communication renement is now possible, with a selection of particular communication mechanisms
and a more precise performance evaluation.
According to the platform-based design approach [2], the abstract architecture follows an architectural
template that is usually domain specic. This template includes both a hardware platform, consisting
of a given communication structure and given types of components (processors, memories, hardware
blocks), and a software platform, in the form of a high-level API. The target embedded SoC will be
designed as a derivative of this template, where the communication structure, the components, and
the software platform are all tailored to t the particular application needs.
The IP-based design approach follows the idea that the architectural template may be implemented
by assembling reusable HW/SW IP components, maybe even delivered by third-party companies.
The IP integration step comprises a set of tasks that are needed to assemble predesigned components
in order to fulll system requirements. As shown in Figure 24.3, it takes as inputs the abstract architecture
and a set of HW/SW IP components that have been selected to implement the architectural blocks.
Its output is a microarchitecture where hardware components are described at the RTL with all cycle-and-
pin accurate details that are needed for a further automatic synthesis. Software components are described
in an appropriate programming language, such as C, and can be directly compiled to the target processors
of the architecture.
In an ideal situation, IP components would t directly together (or to the communication structure)
and exactly match the desired SoC functionality. In a more general situation, the designer may need
to adapt each components functionality (a step called IP derivation) and synthesize HW/SW wrappers
OS services
scheduler, interrupt,...
Abstract architecture
HW IP components
CPU
IP
Bus
Microarchitecture
IP Memory
HW
wrapper
HW
wrapper
HW
wrapper
Communication network
CPU
Application software IP
Specific API
Drivers
I/O, interrupt,...
S
W
w
r
a
p
p
e
r
HW/SW IP
integration step
Application SW IP
Memory
FIGURE 24.3 The HW/SW IP integration design step.
to interconnect them. For programmable components, although adaptation may be easily performed
by programming the desired functionality, the designer may still need to develop software wrappers
(usually device and bus drivers) to match the application software to the communication infrastruc-
ture. The generation of HW/SW wrappers is usually known as interface or communication synthesis.
Besides them, application software may also need to be retargeted to the processors and OS of the chosen
architecture.
In the following subsections, different approaches to IP integration are introduced and their impact on
the possible integration subtasks is analyzed.
24.3.2 Bus-Based and Core-Based Approaches
In the bus-based design approach [9,12,13], IP components communicate through one or more buses
(interconnected by bus bridges). Since the bus specication can be standardized, libraries of compo-
nents whose interfaces directly match this specication can be developed. Even if components follow
the bus standard, very simple bus interface adapters may still be needed [14]. For components that
do not directly match the specication, wrappers have to be built. Companies offer very rich compo-
nent libraries and specialized development and simulation environments for designing systems around
their buses.
A somewhat different approach is the core-based design, as proposed by the VSIA VCI standard [4]
and by the OCP-IP organization [15]. In this case, IP components are compliant to a bus-independent
and standardized interface and can be thus directly connected to each other. Although the standard
may support a wide range of functionality, each component may have an interface containing only the
functions that are relevant for it. These components may also be interconnected through a bus, in which
case standard wrappers can adapt the component interface to the bus. Sonics [13] follows this approach,
proposing wrappers to adapt the bus-independent OCP socket to the MicroNetwork bus.
For particular needs, the SoCmay be built arounda sophisticatedanddedicatednetwork-on-chip(NoC)
[16] that may deliver very high performance for connecting a large number of components. Even in this
case, a bus or core-based approach may be adopted to connect the components to the network.
Bus-based and core-based design methodologies are integration approaches that depend on stan-
dardized component or bus interfaces. They allow the integration of homogeneous IP components that
follow these standards and can be directly connected to each other, without requiring the development of
complex wrappers. The problem we face is that many de facto standards exist, coming from different
companies or organizations, thus preventing a real interchange of libraries of IP components developed
for different substandards.
24.3.3 Integrating Software IP
Programmable components are important in a reusable architectural platform, since it is very cost-effective
to tailor a platform to different applications by simply adapting the low-level software and maybe only
conguring certain hardware parameters, such as memory sizes and peripherals.
As illustrated in Figure 24.3, the software view of an embedded system shows three different layers:
1. The bottom layer is composed of services directly provided by hardware components (processor
and peripherals) such as instruction sets, memory and peripheral accesses, and timers.
2. The top layer is the application software, which should remain completely independent from the
underlying hardware platform.
3. The middle layer is composed of three different sublayers, as seen from bottom to top:
(a) Hardware-dependent software (HdS), consisting for instance of device drivers, boot code, parts
of an RTOS (such as context switching code and conguration code to access the memory
management unit [MMU]), and even some domain-oriented algorithms that directly interact
with the hardware.
(b) Hardware-independent software, typically high-level RTOS services, such as task scheduling
and high-level communication primitives.
(c) The API, which denes a system platform that isolates the application software from the
hardware platform and from all basic software layers and enables their concurrent design.
The standardization of this API, which can be seen as a collection of services usually offered by an
OS, is essential for software reuse above and below it. At the application software level, libraries of
reusable software IP components can implement a large number of functions that are necessary for
developing systems for given application domains. If, however, one tries to develop a systemby integrating
application software components that do not directly match a given API, software retargeting to the new
platformwill be necessary. This can be a very tedious and error-prone manual process, which is a candidate
for an automatic software synthesis technique.
Nevertheless, reuse can be also obtained below the API. Software components implementing the
hardware-independent parts of the RTOS can be more easily reused, especially if the interface between
this layer and the HdS layer is standardized. Although the development of reusable HdS may be harder to
accomplish, because of the diversity of hardware platforms, it can be at least obtained for platforms aimed
at specic application domains.
There are many academic and industrial alternatives providing RTOS services. The problem with most
approaches, however, is that they do not consider specic requirements for SoC, such as minimizing
memory usage and power consumption. Recent research efforts propose the development of application-
specic RTOS containing only the minimal set of functions needed for a given application [17,18] or
including dynamic power management techniques [19]. IP integration methodologies should thus con-
sider the generation of application-specic RTOS that are compliant to a standard API and optimized for
given system requirements.
In recent years, many standardization efforts aimed at hardware IP reuse have been developed. Similar
efforts for software IP reuse are now needed. VSIA [4] has recently created working groups to deal with
HdS and platform-based design.
24.3.4 Communication Synthesis
Solutions for the automatic synthesis of communication wrappers to connect hardware IP components
that have incompatible interfaces have been already proposed. In the PIG tool [20], component interfaces
are specied as protocols described as regular expressions, and a nite state machine (FSM) interface for
connecting two arbitrary protocols is automatically generated. The Polaris tool [21] generates adapters
based on state machines for converting component protocols into a standard internal protocol, together
with send and receive buffers and an arbiter.
These approaches, however, do not address the integration of software IP components. The TEReCS
tool [18] synthesizes communication software to connect software IP components, given a specication
of the communication architecture and a binding of IP components to processors. In the IPChinook envi-
ronment [22], abstract communication protocols are synthesized into low-level bus protocols according
to the target architecture. While the IPChinook environment also generates a scheduler for a given parti-
tioning of processes into processors, the TEReCS approach is associated with the automatic synthesis of
a minimal OS, assembled from a general-purpose library of reusable objects that are congured according
to application demands and the underlying hardware.
Recent solutions uniformly handle HW/SW interfaces between IP components. In the COSY approach
[5], design is performed by an explicit separation between function and architecture. Functions are
then mapped to architectural components. Interactions between functions are modeled by high-level
transactions and then mapped to HW/SW communication schemes. A library provides a xed set of
wrapper IPs, containing HW/SW implementations for given communication schemes.
24.3.5 IP Derivation
Hardware IP components may come in several forms [23]. They may be hard, when all gates and
interconnects are placed and routed, soft, with only an RTL representation, or rm, with an RTL descrip-
tion together with some physical oorplanning or placement. The integration of hard IP components
cannot be performed by adapting their internal behavior and structure. If they have the advantage of
a more predictable performance, in turn they are less exible and therefore less reusable than adaptable
components.
Several approaches for enhancing reusability are based on adaptable components. Although one can
think of very simple component congurations (for instance by selecting a bit width), a higher degree
of reusability can be achieved by components whose behavior can be more freely modied. Object
orientation is a natural vehicle for high-level modeling and adaptation of reusable components [24,25].
This approach, which can be better classied as IP derivation, is adequate for not only rm and soft
hardware IP components, but also for software IP [26]. Although component reusability is enhanced by
this approach, the system integrator has a greater design effort, and it becomes more difcult to predict
IP performance.
Intellectual property derivationandcommunicationsynthesis are different approaches tosolve the same
problem of integration between heterogeneous IP components, which do not follow standards (or the
same substandards). IP derivation is a solution usually based on object-oriented concepts coming from
the software community. It can be applied to the integration of application software components and for
hardware soft and rm components, but it cannot be used for hard IP components. Communication syn-
thesis, on the other hand, follows the path of the hardware community on automatic logic and high-level
synthesis. It is the only solution to the integration of heterogeneous hard IP components, although it can
also be used for integrating software IP and soft and rm hardware IP. While IP derivation is essentially a
user-guided manual process, communication synthesis is an automatic process, with no user intervention.
24.4 Component-Based SoC Design
This section introduces the component-based design methodology, a high-level IP-based methodology
aimed at the integration of heterogeneous HW/SW IP components. It follows an automatic communica-
tion synthesis approach, generating both HW/SW wrappers. It also generates a minimal and dedicated
OS for programmable components. It uses a high-level API, which isolates the application software from
the implementation of a HW/SW solution for the system platform, such that software retargeting is not
necessary.
This approach enables the automatic integration of heterogeneous (components that do not follow
a givenbus or core standard) andhardIPcomponents (whose internal behavior or structure is not known).
However, the approach is also very well suited to the integration of homogeneous and soft IP components.
The methodology has been conceived to t to any communication structure, such as an NoC [16] or a bus.
The component-based methodology is based on a clear denition of three abstraction levels that are also
adopted by other current approaches: system (pure functional), macroarchitecture, and microarchitecture
(RTL). These levels constitute clearinterfacesbetween design steps, promoting reuse of both components
and tools for design tasks at each of these levels.
24.4.1 Design Methodology Principles
The design ow starts with a virtual architecture model that corresponds to the golden architecture in
Figure 24.2, and allows automatic generation of wrappers, device drivers, OSs, and APIs. The goal is to
produce a synthesizable RTL model of the MPSoC platform that is composed of processor cores, IP cores,
the communication interconnect IP, and HW/SW wrappers. The latter are automatically generated from
the interfaces of virtual components (as indicated by the arrows in Figur 24.4). Software written for the
virtual architecture specication runs without modication on the implementation because the same APIs
are provided by the generated custom OSs.
The input abstract architecture (see Figure 24.4[a]) is composed of virtual modules (VM), correspond-
ing to processing and memory IPs, connected by any communication structure, also encapsulated within
a VM. This abstract architecture model clearly separates computation from communication, allowing
independent and concurrent implementation paths for components and for communication. VMs that
MPU core 1
SW
wrapper
HW
wrapper
Communication interconnect IP
IP core1
Wrapper
OS
...
IP core
(blackbox)
Communication interconnect IP
(blackbox)
Configuration
parameters
Wrapper
Module
Task
Virtual port
Virtual channel
Virtual component
API
(a)
(b)
Wrapper
FIGURE 24.4 MPSoC design ow: (a) virtual architecture and (b) target MPSoC platform.
correspond to processors may be hierarchically decomposed into submodules containing software tasks
assigned to this processor. VMs communicate through virtual ports, which are sets of hierarchical internal
and external ports through which services are requested and provided. The separation between internal and
external ports makes possible the connection of modules described at different abstraction levels.
24.4.2 Virtual Architecture
The virtual architecture represents a system as an abstract netlist of virtual components (see
Figure 24.4[a]). It is described in VADeL, a SystemC [27] extension that includes a platform-independent
API offering high-level communication primitives. This API abstracts the underlying hardware platform,
thus enhancing the free development of reusable components. In the abstract architecture model, the
interfaces of software tasks are the same for SW/SW and SW/HW connections, even if the software tasks
are executed by different processors. Different HW/SW realizations of this API are possible. Architec-
tural design space exploration can be thus achieved without inuencing the functional description of the
application.
Virtual components use wrappers to adapt accesses from the internal component (a set of software
tasks or a hardware function) to the external channels. The wrapper is modeled as a set of virtual ports
that contain internal and external ports that can be different in terms of: (1) communication protocol,
(2) abstraction level, and (3) specication language. This model is not directly synthesizable or executable
because wrappers behavior is not described. These wrappers can be generated automatically, in order
to produce a detailed architecture that can be both synthesized and simulated.
The required SystemC extensions implemented in VADeL are:
1. Virtual module: consists of a module and its wrapper.
2. Virtual port : groups some internal and external ports that have a conversion relationship. The
wrapper is the set of virtual ports for a given VM.
3. Virtual channel : groups several channels having a logical relationship (e.g., multiple channels
belonging to the same communication protocol).
4. Parameters: used to customize hardware interfaces (e.g., buffer size and physical addresses of ports),
OSs, and drivers.
In VADeL, there are also predened ports with special semantics called SAP (Service Access Ports). They
can be used to access some services that are implemented by hardware or software wrapper components.
For instance, the timer SAP can be used to request an interrupt from a hardware timer after a given delay.
24.4.3 Target MPSoC Architecture Model
We use a generic MPSoC architecture where processors and other IP cores are connected to a global
communication interconnect IP via wrappers (see Figure 24.4[b]). In fact, processors are separated from
the physical communication IP by wrappers that act as communication coprocessors or bridges, freeing
processors from communication management and enabling parallel execution of computation tasks and
communication protocols. Software tasks also need to be isolated from hardware through an OS that
plays the role of software wrapper. When dening this model our goal was to have a generic model where
both computation and communication may be customized to t the specic needs of the application.
For computation, we may change the number and kind of components and, for communication, we can
select a specic communication IP and protocols. This architecture model is suitable to a wide domain of
applications; more details can be found in Reference 28.
24.4.4 HW/SW Wrapper Architecture
Wrappers are automatically generated as point-to-point adapters between each VM and the communica-
tion structure, as shown in Figure 24.4(b) [28]. This approach allows the connection of components to
standard buses as well as point-to-point connections between cores.
Processor adapter
CA CA
ib_data
ib_it ib_enable
Task
schedule
Software
module
Task1() {...
_write(d);
yield2sched();
SW
wrapper
(a) (b)
FIFO
write()
Yiel d
schedule
...
FIFO Int. I/O
Write
Reg()
... ... ...
API
Services
Drivers
FIGURE 24.5 HW/SW wrapper architecture: (a) software wrapper; (b) hardware wrapper.
Wrappers may have HW/SW parts. The internal architecture of a wrapper on the hardware side is shown
in Figure 24.5(b). It consists of a processor adapter, one or more channel adapters, and an internal bus. The
number of channel adapters depends on the number of channels that are connected to the corresponding
VM. This architecture allows the easy generation of multipoint, multiprotocol wrappers. The wrapper
dissociates communication from computation, since it can be considered as a communication coprocessor
that operates concurrently with other processing functions.
On the software side [17], as shown in Figure 24.5(a), wrappers provide the implementation of the
high-level communication primitives (available through the API) used in the system specication and
drivers to control the hardware. If required, the wrapper will also provide sophisticated OS services such
as task scheduling and interrupt management minimally tailored for the particular application.
The synthesis of wrappers is based on libraries of basic modules from which hardware wrappers and
dedicated OSs are assembled. These libraries may be easily extended with modules that are needed to build
wrappers for processors, memories, and other components that follow various bus and core standards.
24.4.5 Design Tools
Figure 24.6 shows an overall view of our design environment, which is called ROSES. The input model may
be imported from a specication analysis tool (e.g., Reference 6) or manually coded using our extended
SystemC library. All design tools use a unied design model that contains an abstract HW/SW netlist
annotated with parameters (Colif [29]). Hardware wrapper generation [28] transforms the input model
into a synthesizable architecture. The software wrapper generator [17] produces a custom OS for each
processor on the target platform. For validation, we use the cosimulation wrapper generator [30] to
produce simulation models. Details about these tools can be found in the references, only their principle
will be discussed here.
Hardware wrapper generation assembles library components using the architecture template presented
before (Figure 24.5[b]) to produce the RTL architecture. This library contains generalized descriptions
of hardware components in a macrolanguage (m4 like); it has two parts: the processor library and the pro-
tocol library. The former contains local template architectures for processors with four types of elements:
processor cores, local buses, local IP components (e.g., local memory, address decoder, coprocessors, etc.),
and processor adapters. The latter consists of a list of channel adapters. Each channel adapter has simula-
tion, estimation, and synthesis models that are parameterized (by the channel parameters, e.g., direction,
storage size, and data type) as the elements in the processor library.
The software wrapper generator produces OSs streamlined and precongured for the software
module(s) that run(s) on each target processor. It uses a library organized in three parts: APIs,
communication/system services, and device drivers. Each part contains elements that will be used in
a given software layer in the generated OS. The generated OS provides services: communication services
(e.g., FIFO [rst in rst out] communication), I/O services (e.g., AMBA bus drivers), memory services
(e.g., cache or virtual memory usage), etc. Services have dependency between them, for instance, com-
munication services are dependent on I/O services. Elements of the OS library also have dependency
RTL architecture
Virtual architecture
Custom OS
generation
HW wrapper
generation
Extended
SystemC
Executable
cosimulation
model
Cosimulation
wrapper generation
RTL synthesis and
compilation
Emulation platform
APIs
OS library
Communication
and system
services
Device
drivers
Processor
library
HW wrapper library
Protocol
library
Simulator
library
Cosimulation
library
Channel
library
FIGURE 24.6 Design automation tools for MPSoC.
information. This mechanism is used to keep the size of the generated OS at a minimum; the elements
that provide unnecessary services are not included.
There are two types of service code: reusable (or existing) code and expandable code. As an example
of existing code, AMBA bus-master service code can exist in the OS library in the form of C language.
As an example of expandable code, OS kernel functions can exist in the OS library in the form of macro-
code (m4 like). There are several preemptive schedulers available in the OS library such as round-robin
scheduler, priority-based scheduler, etc. In the case of round-robin scheduler, time-slicing (i.e., assigning
different CPU load to tasks) is supported. To make the OS kernel very small and exible, (1) the task
scheduler can be selected from the requirements of the application code and (2) a minimal amount (less
than 10% of kernel code size) of processor specic assembly code is used (for context switching and
interrupt service routines).
The cosimulation wrapper generator [30] produces an executable model composed of a SystemC simu-
lator that acts as a master for other simulators. A variety of simulators can participate in this cosimulation:
SystemC, VHDL, Verilog, and Instruction-set simulators. Cosimulation wrappers have the same structure
as that of hardware wrappers (see Figure 24.5[b]), with simulation adapters in the place of processor
adapters and simulation models in the place of channel adapters. In the cosimulation wrapper library,
there are simulation adapters for the different simulators supported and channel adapters that implement
all supported protocols in different languages.
In terms of functionality, the cosimulation wrapper transforms channel access(es) via internal port(s)
to channel access(es) via external port(s) using the following functional chain: channel interface, channel
resolution, data conversion, module communication behavior. Internal ports use channel functions
(e.g., FIFOavailable, FIFOwrite) toexchange data. Channel interface provides the implementationof these
channel functions. Channel resolution maps N-to-Mcorrespondence between internal and external ports.
Data conversion is required since different abstraction levels can use different data types to represent the
same data. Module communication behavior is required to exchange data via external port(s), that is, to
call port functions of external ports.
24.4.6 Dening IP-Component Interfaces
Hardware and software component interfaces must be composed using basic elements of the hardware
wrapper and software wrapper generators libraries (respectively). Table 24.1 lists some API functions
available for different kinds of software task interfaces and some services provided by channel adapters
available to be used in hardware component interfaces.
Software tasks must communicate through API functions provided by the software wrapper generator
library. For instance, the shared memory (SHM) API provides read/write functions for intertask
communication. The guarded shared memory (GSHM) API adds semaphores services to the SHM API
by providing lock/unlock functions.
Hardware IP components must communicate through communication primitives provided by the
channel adapters of the hardware wrapper generator library. For instance, FIFO channel adapters (sender
and receiver) implement a buffered two-phase handshake protocol (put/get) and provide full/empty func-
tions for accessing the state of the buffer. ASFIFO channel adapters instead use a single-phase handshake
protocol and can generate an interrupt for signaling the full and empty state of the buffer.
A recurrent problem in library-based approaches is library size explosion. In ROSES, this problem is
minimized by the use of layered library structures where a service is factorized so that its implementation
uses elements of different layers. This scheme increases reuse of library elements since the elements of the
upper layers must use the services provided by the elements in the immediate lower layer.
Designers are able to extend ROSES libraries since they are implemented in an open format. This is
an important feature since it enables the support of different standards while reusing most of the basic
elements in the libraries.
Table 24.2 shows some of the existing HW/SW components in the current ROSES IP library and give
the type of communication they use in their interfaces.
TABLE 24.1 HW/SW Communication APIs
Basic component
interfaces API functions
SW Register Put/get
Signal Sleep/wakeup
FIFO Put/get
SHM Read/write
GSHM Lock/unlock/read/write
HW Register Put/get
FIFO Put/get/full/empty
ASFIFO Put/get/IT(full/empty)
Buffer BPut/BGet
Event Send/IT(receiver)
AHB master/slave Read/write
Timer Set/wait
TABLE 24.2 Sample IP Library
IP Description Interfaces
SW host-if Host PC interface Register/signal
Rand Random number generator Signal/FIFO
mult-tx Multipoint FIFO data transmission FIFO
reg-cong Register conguration Register/FIFO/SHM
shm-sync SHM synchronization SHM/signal
stream FIFO data streaming GSHM/FIFO/signal
HW ARM7 Processor core ARM7 pins
TX_Framer Data package framing 17 registers, 1 FIFO
Stream
1 void stream::stream_beh()
2 {
3 long int * P;
4 ...
5 for(;;)
6 {...
7 P=(long int*)P1.Lock();
8 P2.Sleep();
9 for (int i=0; i<8; i++)
10 {
11 long int val=P3.Get();
12 P4.Put(*(P+i+8));
13 ...};...
14 P1.Unlock();
15 }
16 ...
17 }
P3
(b) (a)
P4
P1 P2
(GSHM)
(FIFO)
(Signal)
(FIFO)
P1 (Register)
P2 (Register)
P3 (Register)
P15 (Register)
P16 (Register)
P17 (Register)
P18 (FIFO)
P19 (FIFO)
P20 (FIFO)
Register 16
Register 17
Register 1
Register 2
Register 3
Register 15
FIGURE 24.7 (a) The stream software IP and (b) the TX_Framer hardware IP.
Figure 24.7(a) shows the stream software IP and part of its code to demonstrate the utilization of the
communication APIs. Its interface is composed of four ports: two for the FIFO API (P3 and P4), one
for the signal API (P2), and one for the GSHM API (P1). In line 7 of Figure 24.4(a), the stream IP uses
P1 to lock the access to the SHM that contains the data that will be streamed. P2 is used to suspend the
task that lls-up the SHM (line 8). Then, some header information is got from the input FIFO using P3
(line 11) and streamed to the output FIFO using P4 (line 12). When streaming is nished, P1 is used to
unlock the access to the SHM (line 14).
Figure 24.7(b) shows the TX_Framer hardware IP, which is part of a VDSL modem and responsible
for packaging data into ATM-network compatible frames. Its interface is composed of 17 conguration
registers (P1P17) and one single-handshake input FIFO (P18P20). The registers are used to congure the
IP functionality and have bit sizes varying from 2 to 11, while the FIFO is used to store data packets that will
be inserted into specic places in the output ATM frames. These ports are driven directly by the compatible
outputs from register and ASFIFO channel adapters that are generated by the hardware wrapper generator.
24.5 Component-Based Design of a VDSL Application
24.5.1 Specication
The design presented in this section illustrates the IP integration capabilities of ROSES. We redesigned
part of a VDSL modem that was prototyped by Reference 31 using discrete components (the shaded part
in Figure 24.8[a]). The block diagram for the modem subset used in the rest of this paper is shown in
Figure 24.8(b). It corresponds to a deframing/framing unit (DFU), composed of two ARM7 processors
and the TX_Framer. The TX_Framer is part of the VDSL Protocol Processor. In this experiment, it is
used as a hard IP component described at the RTL. The partition of processors/tasks was suggested by the
design team of the VDSL-modem prototype.
Processors exchange data using three asynchronous FIFO buffers. The TX_Framer IP has some cong-
uration registers and input a data streamthrough a synchronous FIFObuffer. Tasks use a variety of control
and data-transmission protocols to communicate. For instance, a task can block/unblock the execution
of other tasks by sending them an OS signal. Tasks use for data transmission: a FIFO memory buffer,
two shared memories (with or without semaphores), and direct register access. Despite representing only
H
o
s
t

P
C
... ...
(TX_Framer)
config. reg.
Data
processing
HW IP3
(ARM7 processor)
HW IP2
T9
T6
T8
T4
T5
T7
FIFO
FIFO
FIFO
FIFO
Signal
Signal
Data communication
Control communication
Shared memory space
FIFO memory buffer
Software IP (task)
(ARM7 processor)
HW IP1
Signal
Signal
T2
T3
T1
Register 1
Register 2
Register 3
Register 15
Register 16
Register 17
Host PC
Part redesigned as a multicore SoC
VDSL modem
processor
Constellation
processor
VDSL protocol
processor
A
n
a
l
o
g
f
r
o
n
t
-
e
n
d
ATM
layer
DSP
ARM7 RAM
I-M Di-M
BL-M V-M
BL-M: bit-loading memory
V-M: variance memory
I-M: interleaver memory
Di-M: de-interleaver memory
ASIC
FPGA
FPGA
Digital
front-end
ARM7
MCU2
MCU1
T
w
i
s
t
e
d
-
P
a
i
r
(
c
o
p
p
e
r

l
i
n
e
)
(a)
(b)
FIGURE 24.8 (a) VDSL modem prototype and (b) DFU block diagram.
a subset of the VDSL modem, the design of the DFU remains quite challenging. In fact, it uses two pro-
cessors executing parallel tasks. The control over the three modules of the specication is fully distributed.
All three modules act as masters when interacting with their environment. Additionally, the application
includes multipoint communication channels requiring sophisticated OS services.
24.5.2 DFU Abstract Architecture
Figure 24.9 shows the abstract architecture model that captures the DFU specication with point-to-point
communications between the three main IP cores. VM1 and VM2 are two virtual processors, and VM3
corresponds to the TX_Framer function. VM1 and VM2 include several submodules corresponding to
software tasks T1 through T9 assigned to these processors. This abstract model can be mapped onto differ-
ent concrete microarchitectures depending on the selected IP components and on desired performance,
area, and power constraints. For instance, the three point-to-point connections (VC1, VC2, and VC3)
between VM1 and VM2 can be mapped onto a bus or onto an SHM.
M1 M2
T5
T7
T8
T6
T9
T4
T3
T1
VM1
T2
VC1
VC2
VC3
.
.
.
SAP
.
.
.
.
.
.
M3
VM3 VM2
.
.
.
.
.
.
FIGURE 24.9 DFU abstract architecture specication.
TABLE 24.3 HW/SW IP Utilization
IP Description Use
SW host-if Host PC interface T1
Rand Random number generator T2, T3
mult-tx Multipoint FIFO data transmission T4, T8
reg-cong Register conguration T5
shm-sync SHM synchronization T6, T9
Stream FIFO data streaming T7
HW ARM7 Processor core VM1, VM2
TX_Framer Data package framing VM3
24.5.3 MPSoC RTL Architecture
For the implementation of the DFU virtual architecture, two hardware IP cores have been selected: an
ARM7 processor and the TX_Framer. The application software has been built by reusing several available
software IP components for implementing tasks T1 to T9. Table 24.3 lists the selected IP components and
indicates their correspondence to the VMs and submodules in the DFU virtual architecture. The interfaces
of the selected software IP components in Table 24.3 (see Table 24.2) match the communication type of
the software tasks of the virtual architecture in Figure 24.8(b).
Figure 24.10 shows the RTL microarchitecture obtained after HW/SW wrapper generation. It is impor-
tant to notice that, from an abstract target architecture containing an abstract ARM7 processor, ROSES
automatically generates aconcreteARM7 local architecture containing additional IP components, which
implement local memory, local bus, and address decoder.
Each software wrapper (custom OS) is customized to the set of software IPs corresponding to the
tasks executed by the processor core. For example, software IPs running on VM2 access the custom OS
using communication primitives available through the API: register is used to write/read to/from the
conguration/status registers inside the TX_Framer block, while SHM and GSHM are used to manage
shared-memory communication. Each OS contains a round-robin scheduler (Sched) and resource man-
agement services (Sync, IT). The driver layer contains low-level code to access the channel adapters within
VM1
Signal
internal
Signal
LReg
Pipe
LReg
IT Sync Sched
Signal Pipe
VM2
ARM7
processor
core
Memory
(RAM/ROM)
Address
decoder
IP
(TX_Framer)
SHM
internal
Pipe
internal
Pipe
LReg
IT Sync Sched
Direct
Timer
LReg
Signal
internal
Semaph
internal
Pipe
buffer
Direct
register
Pipe SHM GSHM Signal Timer
clock
reset
CS RS CK CS
a
d
d
r
e
s
s
c
o
n
tr
o
l
CK
a
d
d
r
e
s
c
o
n
tr
o
l
d
a
ta
a
d
d
r
e
s
s
c
o
n
tr
o
l
CK
d
a
ta
a
d
d
r
e
s
s
HW
wrapper
d
a
ta
a
d
d
.
c
tr
l
d
a
ta
a
d
d
.
c
tr
l
test vector
d
a
ta
a
d
d
.
c
tr
l
HNDSHK 3
comm. adapter
d
a
ta
a
d
d
.
c
tr
l
CK Polling
comm. adapter
CK HNDSHK 1
comm. adapter
CK
ARM7 processor adapter
CK
CS RS CK CS
d
a
ta
a
d
d
r
e
s
s
c
o
n
tr
o
l
CK
d
a
ta
a
d
d
r
e
s
s
c
o
n
tr
o
l
d
a
ta
a
d
d
r
e
s
s
c
o
n
tr
o
l
CK
d
a
ta
a
d
d
r
e
s
s
HW
wrapper
d
a
ta
a
d
d
.
c
tr
l
d
a
ta
a
d
d
.
c
tr
l
d
a
ta
a
d
d
.
c
tr
l
HNDSHK
1 CA
HNDSHK
3 CA
CK
d
a
ta
a
d
d
.
c
tr
l
d
a
ta
a
d
d
.
c
tr
l
d
a
ta
a
d
d
.
c
tr
l
FIFO
CA
CK
d
a
ta
a
d
d
.
c
tr
l
Polling
16 CA
CK
RS CK
d
a
ta
CK
ARM7 processor adapter
CK
Wrapper bus
CS
Wrapper bus
CS
c
o
n
tr
o
l
c
o
n
tr
o
l
...
...
CK
RS
ARM7 local bus
d
a
ta
RS
ARM7 local bus
TIMER CK CK
... ...
VM3
Custom OS
Custom OS
ARM7
processor
core
Memory
(RAM/ROM)
Address
decoder
Polling
1 CA
FIGURE 24.10 Generated MPSoC Architecture.
TABLE 24.4 Results for OS Generation
Number of Number of lines Code size Data size
OS results lines in C in assembly (bytes) (bytes)
VM1 968 281 3829 500
VM2 1872 281 6684 1020
Context switch (cycles) 36
Latency for interrupt 59(OS) + 28(ARM7)
treatment (cycles)
System call latency (cycles) 50
Resume of task execution 26
(cycles)
the hardware wrappers (e.g., Pipe LReg for the HNDSHK channel adapter), and some low-level kernel
routines.
24.5.4 Results
The manual design of a full VDSL modemrequires several person-years; the presented DFUwas estimated
as a more than ve persons-year effort. When using the ROSES IP integration capabilities, the overall
experiment took only one person during 4 months, including all validation and verication time (but
not counting the effort to develop library components and to debug design tools). This corresponds to
a 15-times reduction in design effort (a more detailed presentation can be found in Reference 32).
Application code and generated OS are compiled and linked together to be executed on each ARM7
processor. The hardware wrapper can be synthesized using RTL synthesis. As can be seen in Table 24.4,
most OS code is generated in C, only a small part of it is in assembly and includes some low-level
routines (e.g., context switching and processor boot) that are specic to each processor. If we compare
the numbers presented in Table 24.4 with commercial embedded OSs, the results are still very good. The
minimum size for such OSs is around 4 KB; but with this size, few of them could provide the required
functionality. Table 24.5 shows the numbers obtained after RTL synthesis of the hardware wrappers
using a CMOS (complementary metal oxide semiconductor) 0.35 m technology. These are good results
because wrappers account for less than 5% of the ARM7s core surface and have a critical path that
TABLE 24.5 Results for Hardware Wrapper Generation
Number of Critical path Maximum frequency
HW interfaces gates delay (nsec) (MHz)
VM1 3284 5.95 168
VM2 3795 6.16 162
Latency for read operation 6
(clock cycles)
Latency for write operation 2
(clock cycles)
Number of code lines 2168
(RTL VHDL)
corresponds to less than 15% of the clock cycle for the 25 MHz ARM7 processors used in this case
study.
24.5.5 Evaluation
Results show that the component-based approach can generate HW/SW interfaces and OSs that are
as efcient as the manually coded/congured ones. The HW/SW frontier in wrapper implementation
can be easily displaced by changing some library components. This choice is transparent to the nal
user since everything that implements the interconnect API is generated automatically (the API does not
change only its implementation does). Furthermore, correctness and coherence can be veried inside
tools and libraries against the API semantics without having to impose xed boundaries to the HW/SW
frontier (in contrast to standardized component interfaces or buses).
The utilization of layered library components provides lots of exibility; the design environment
can be easily adapted to accommodate different languages to describe system behavior, different task
scheduling and resource management policies, different global communication interconnect topologies
and protocols, a diversity of processor cores and IP cores, and different memory architectures. In most
cases, inserting a new design element in this environment only requires to add the appropriate library
components. Layered library components are at the roots of the methodology; the principle followed
is to contain a unique functionality, and to respect well-dened interfaces that enable easy composition.
This layered structure prevents library size explosion since composition is used to implement complex
functionality and to increase component reutilization.
As explained in this chapter and illustrated by the design case study, ROSES uses a component-based
methodology that presents a unique combination of features:
It implements a general approach for the automatic integration of heterogeneous and hard
IP components, although it easily accommodates the integration of homogeneous and soft IP
components.
It offers an architectural-independent API, integrated into SystemC, containing high-level com-
munication primitives and enhancing the free development of reusable components. Application
software accessing this API does not need to be retargeted to each system implementation.
It adopts a library-based approach to wrapper generation. As long as components communicate
through a known protocol, communication synthesis can be done automatically, without any addi-
tional designeffort. Ina formal-based approach, instead, the designer must describe the component
interface by some appropriate formalism, such as an FSM [21] or a regular expression [20].
It uniformly addresses the generation of HW/SW parts of communication wrappers for program-
mable components. While some approaches consider only hardware [20,21] or software [18,22]
wrappers, others also consider HW/SW parts but are restricted to predened wrapper libraries
for given communication schemes [5]. The library-based approach of ROSES, in turn, allows the
synthesis of software interfaces for various communication schemes.
It can be used with any architectural template and communication structure, such as a bus, an NoC,
or point-to-point connections between components. It is also congurable to synthesize wrappers
for any bus or core standard.
24.6 Conclusions
Reuse of IP components is a major requirement for the design of complex embedded SoCs. However, reuse
is a complex process that involves many steps, requires support from specialized tools and methodologies,
and inuences current design practices. The integration of IP components into a particular design is maybe
the most complex step of the reuse process. Many design approaches, such as bus- and core-based design
and platform-based design, are aimed at an easier IP integration. Nevertheless, many problems are still
open, in particular the automatic synthesis of HW/SW wrappers between heterogeneous and hard IP
components.
The chapter has shown that the component-based design methodology provides a complete, generic,
and efcient solution to the HW/SW interfaces design problem. Starting from a high-level function
specication and an abstract architecture, design tools can automatically generate HW/SW wrappers
that are necessary to integrate heterogeneous IP components that have been selected to implement the
application. Designers do not need to design any low-level interfacing details manually. The chapter
has also shown how HW/SW component interfaces can be decomposed and easily adapted to different
communication structures and bus and core standards.
References
[1] ITRS. Available at http://public.itrs.net/.
[2] K. Keutzer, A.R. Newton, J.M. Rabaey, and A. Sangiovanni-Vincentelli. System-Level Design:
Orthogonalization of Concerns and Platform-based Design. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 19: 15231543, 2000.
[3] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and A. Sangiovanni-Vincentelli.
Addressing the System-on-Chip Interconnect Woes through Communication-Based Design.
In Proceedings of the 38th Design Automation Conference, Las Vegas, NV, June 2001.
[4] VSIA. http://www.vsi.org.
[5] J.-Y. Brunel, W.M. Kruijtzer, H.J.H.N. Kenter, F. Ptrot, L. Pasquier, E.A. de Kock, and W.J.M. Smits.
COSY Communication IPs. In Proceedings of the 37th Design Automation Conference, Los Angeles,
CA, June 2000.
[6] Cadence Design Systems, Inc. Virtual Component Co-design. http://www.cadence.com/products/
vcc.html.
[7] D. Gajski, J. Zhu, R. Domer, A. Gerslauer, and S. Zhao. SpecC Specication Language and
[8] CoWare. http://www.coware.com.
[9] IBM CoreConnect Bus Architecture. http://www.chips.ibm.com/bluelogic.
[10] D. Wingard. MicroNetwork-Based Integration for SOCs. In Proceedings of the 38th Design
Automation Conference, Las Vegas, NV, June 2001.
[11] J. Rowson and A. Sangiovanni-Vincentelli. Interface-Based Design. In Proceedings of the 34th
Design Automation Conference, 1997.
[12] ARM AMBA. http://www.arm.com.
[13] Sonics SiliconBackplane MicroNetwork. http://www.sonicsinc.com.
[14] R.A. Bergamaschi and W.R. Lee. Designing Systems-on-Chip Using Cores. In Proceedings of the
37th Design Automation Conference, 2000.
[15] Open Core Protocol. http://www.ocpip.org.
[16] G. de Micheli and L. Benini. Networks-on-Chip: A New Paradigm for Systems-on-Chip Design.
In Proceedings of the Design, Automation and Test in Europe Conference, 2002.
[17] L. Gauthier, S. Yoo, and A.A. Jerraya. Automatic Generation and Targeting of Application Specic
Operating Systems and Embedded Systems Software. IEEE Transactions on Computer-Aided Design
and Integrated Circuits and Systems, 20(11): 12931301, 2001.
[18] C. Bke. Combining Two Customization Approaches: Extending the Customization Tool TEReCS
for Software Synthesis of Real-Time Execution Platforms. In Proceedings of the Workshop on
Architectures of Embedded Systems (AES2000), Karlsruhe, Germany, January 2000.
[19] L. Benini, A. Bogliolo, and G. De Micheli. Dynamic Power Management of Electronic Systems.
In Proceedings of International Conference on Computer Aided Design, 1998.
[20] R. Passerone, J.A. Rowson, and A. Sangiovanni-Vincentelli. Automatic Synthesis of Interfaces
between Incompatible Protocols. In Proceedings of the 35th Design Automation Conference, 1998.
[21] J. Smith and G. deMicheli. Automated Composition of Hardware Components. In Proceedings of
the 35th Design Automation Conference, 1998.
[22] P. Chou et al. IPChinook: An Integrated IP-Based Design Framework for Distributed Embedded
Systems. In Proceedings of the 36th Design Automation Conference, 1999.
[23] M. Birnbaum and H. Sachs. How VSIA Answers the SoC Dilemma. IEEE Computer, 32: 4250,
June 1999.
[24] C. Barna and W. Rosenstiel. Object-Oriented Reuse Methodology for VHDL. In Proceedings of the
Design, Automation and Test in Europe Conference, 1999.
[25] P. Schaumont et al., Hardware Reuse at the Behavioral Level, In Proceedings of the 36th Design
Automation Conference, 1999.
[26] F.J. Rammig, Web-based System Design with Components off The Shelf (COTS). In Proceedings
of the Forumon Design Languages, Tuebingen, 2000.
[27] SystemC, http://www.systemc.org.
[28] D. Lyonnard, S. Yoo, A. Baghdadi, and A.A. Jerraya, Automatic Generation of Application-Specic
Architectures for Heterogeneous Multiprocessor System-on-Chip. In Proceedings of the 38th Design
Automation Conference, Las Vegas, NV, June 2001.
[29] W.O. Cesrio, G. Nicolescu, L. Gauthier, D. Lyonnard, and A.A. Jerraya. Colif: A Design
Representation for Application-Specic Multiprocessor SOCs. IEEE Design & Test of Computers,
18: 820, 2001.
[30] S. Yoo, G. Nicolescu, D. Lyonnard, A. Baghdadi, and A.A. Jerraya. A Generic Wrapper Architecture
for Multi-Processor SoC Cosimulation and Design. In Proceedings of the Interantional Symposium
on HW/SWCodesign (CODES), 2001.
[31] M. Diaz-Nava and G.S. Okvist. The Zipper Prototype: A Complete and Flexible VDSL Multi-
Carrier Solution. ST Journal Special Issue xDSL, 2(1): 1/321/3, September 2001.
[32] W. Cesrio, A. Baghdadi, L. Gauthier, D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, M. Diaz-Nava,
and A.A. Jerraya. Component-Based Design Approach for Multicore SoCs. In Proceedings of the
39th Design Automation Conference, New Orleans, June 2002.
25
Design and
Programming of
Embedded
Multiprocessors:
An Interface-Centric
Approach
Pieter van der Wolf,
Erwin de Kock,
Tomas Henriksson,
Wido Kruijtzer, and
Gerben Essink
Philips Research
25.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-1
25.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-2
25.3 TTL Interface Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-3
25.4 TTL Interface. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-4
Inter-Task Communication TTL Multi-Tasking Interface
TTL APIs
25.5 Multiprocessor Mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-8
Source Code Transformation Automated Transformation
25.6 TTL on an Embedded Multi-DSP . . . . . . . . . . . . . . . . . . . . . . 25-14
The Multi-DSP Architecture TTL Implementation
Implementation Results Implementation Conclusions
25.7 TTL in a Smart Imaging Core. . . . . . . . . . . . . . . . . . . . . . . . . . . 25-18
The Smart Imaging Core TTL Shells
25.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25-22
25.1 Introduction
Modern consumer devices need to offer a broad range of functions at low cost and with low energy
consumption. The core of such devices often is a multiprocessor System-on-Chip (MPSoC) that imple-
ments the functions as an integrated hardware/software solution. The integration technology used for
Originally published as: P. van der Wolf, E. de Kock, T. Henriksson, W. Kruijtzer, and G. Essink. Proceedings of
CODES + ISSS 2004 Conference, Stockholm, September 810. ACMPress. Reprinted with permission.
25-1
building such MPSoCs from a set of hardware and software modules is typically based on low-level
interfaces for the integration of the modules. For example, the usual way of working is to use bus
interfaces for the integration of hardware devices, with ad-hoc mechanisms based on memory mapped
registers and interrupts to synchronize hardware and software modules. Further, support for reuse is
typically poor and a method for exploring trade-offs is often missing. As a consequence MPSoC inte-
gration is a labor-intensive and error-prone task and opportunities for reuse of hardware and software
modules are limited.
Integration technology for MPSoCs should be based on an abstract interface for the integration of
hardware and software modules. Such abstract interface should help to close the gap between the
application models used for specication and the optimized implementation of the application on a
multi-processor architecture. The interface must enable mapping technology that supports systematic
renement of application models into optimized implementations. Such interface and mapping technol-
ogy will help to structure MPSoC integration, thereby enhancing both the productivity and the quality of
MPSoC design.
We present design technology for MPSoC integration with an emphasis on three contributions:
1. We present TTL, a task-level interface that can be used both for developing parallel application
models and as a platform interface for integrating hardware and software tasks on a platform
infrastructure. The TTL interface makes services for inter-task communication and multi-tasking
available to tasks.
2. We show how mapping technology can be based on TTL to support the structured design and
programming of embedded multi-processor systems.
3. We showthat the TTLinterface canbe implementedefciently ondifferent architectures. We present
both a software and a hardware implementation of the interface.
After discussing related work in Section 25.2, we present the requirements for the TTL interface in
Section25.3. The TTLinterface is presentedinSection25.4. Section25.5discusses the mapping technology,
exemplied by several code examples. We illustrate the design technology in Sections 25.6 and 25.7 with
two industrial design cases: a multi-DSP solution and a smart-imaging multi-processor. We present
conclusions in Section 25.8.
25.2 Related Work
Interface-based design has been proposed as a way to separate communication from behavior so that
communication renement can be applied [1]. Starting from abstract token passing semantics, commu-
nication mechanisms are incrementally rened down to the level of physical interconnects. In References 2
and 3, a library-based approach is proposed for generating hardware and software wrappers for the inte-
gration of heterogeneous sets of components. The wrappers provide the glue to integrate components
having different (low-level) interfaces. No concrete interface is proposed. In Reference 4 transaction level
models (TLMs) on the device or component level are discussed.
In contrast, we present an abstract task-level interface, named TTL, which can be implemented as
platform interface. This interface is the target for the mapping of tasks. Previously, several task-level
interfaces and their implementations have been developed at Philips [57]. TTL brings these interfaces
together in a single framework, to unify themas a set of interoperable interface types.
The data transfer and storage exploration (DTSE) method [8] of IMEC focuses on source code trans-
formation to optimize memory accesses and memory footprint. To our knowledge the method does not
address the mapping of concurrent applications onto multiprocessor platforms. The Task Concurrency
Management [9] method focuses on run-time scheduling of tasks on multiprocessor platforms to opti-
mize energy consumption under real-time constraints. The interaction between these tasks is based on
low-level primitives such as mutexes and semaphores. As a result, the tasks are less re-usable than TTL
tasks and the design and transformation of tasks is more difcult and time consuming.
Design and Programming of Embedded Multiprocessors 25-3
The Open SystemC Initiative [10] provides a modeling environment to enable system-level design and
IP exchange. Currently, the environment does not standardize the description of tasks at the high level
of abstraction that we aim at. However, TTL can be made available as a class library for SystemC in the
future.
25.3 TTL Interface Requirements
We present a design method for implementing media processing applications as MPSoCs. A key ingredi-
ent of our design method is the Task Transaction Level (TTL) interface. On the one hand, application
developers can use TTL to build executable specications. On the other hand, TTL provides a plat-
form interface for implementing applications as communicating hardware and software tasks on a
platform infrastructure. The TTL interface enables mapping technology that automates the renement
of application models into optimized implementations. Using the TTL interface to go fromspecication
to implementation allows the mapping process to be an iterative process, where during each step selected
parts of the application model are rened. Figure 25.1 illustrates the basic idea, with the TTL interface
drawn as dashed lines.
For the TTL interface to provide a proper foundation for our design method, it must satisfy a number
of requirements. First it must offer well-dened semantics for modeling media processing applications.
It must allow parallelism and communication to be made explicit to enable mapping to multi-processor
architectures.
Further, the TTL interface must be an abstract interface. This makes the interface easy to use for
application development because the developer does not have to consider low-level details. An abstract
interface also helps to make tasks reusable, as it hides underlying implementation details. For example, if
a task uses an abstract interface for synchronization with other tasks, it can be unaware and independent
of the implementation of the synchronization with, for example, semaphores or some interrupt-based
scheme.
The platforminfrastructure makes services available to tasks via the TTL platforminterface. Specically,
these are services for inter-task communication, multi-tasking, and (re)conguration. Rather than offering
a low-level interface and implementing, for example, synchronization as part of all the tasks, we factor
out such generic services from the tasks to implement them as part of the platform infrastructure. This
implementation is done once for a platform, optimized for the targeted application domain and the
underlying multiprocessor architecture.
An abstract platforminterface provides freedomfor implementing the platforminfrastructure. It must
allow a broad range of platform implementations, including different multiprocessor architectures. For
example, both shared memory and message-passing architectures should be supported. Further, the
Platform infrastructure
TTL
A T S K S
Mapping TTL Mapping TTL
TTL
Task
Task
Task
FIGURE 25.1 TTL interface for building parallel application models and implementing them on a platform
infrastructure.
CPU
SW Shell
Task 3
ASP
Task 1 Task 2
TTL API
TTL HW
interface
Interconnect
HW Shell
FIGURE 25.2 TTL interface as software API and as hardware interface in example architecture.
abstraction allows critical parts of a platform implementation to be optimized transparently and enables
evolution of a platform implementation as technology evolves. For example, smooth transition from
bus-based interconnects towards the use of network-on-chip technology should be supported.
The TTL interface must allow efcient implementations of the platform infrastructure and the tasks
integrated on top of it. To enable integration of hardware and software tasks, the interface must be available
both as an API and as a hardware interface. An example of how the TTL interface could manifest itself in
a simple multiprocessor architecture is shown in Figure 25.2.
In the left part of Figure 25.2 the TTL interface is implemented as an API of a software shell executing
on a CPU. Software tasks executing on the CPU can access the platform services via the API. In the right
part of Figure 25.2 a task is implemented as an application-specic processor (ASP). The TTL interface for
integrating the ASP is available as a hardware interface. A hardware shell implements the platform services
on top of a lower interconnect. Such interconnect could, for example, have an interface like AXI [11],
OCP [12], or DTL [13].
25.4 TTL Interface
In this section we present the TTL interface. Specically, we discuss the TTL interface for inter-task
communication and multi-tasking services. We do not discuss reconguration. In this chapter all task
graphs are static.
25.4.1 Inter-Task Communication
We dene the following terminology and associated logical model for the communication between tasks.
The logical model provides the basis for the denition of the TTL inter-task communication interface.
It identies the relevant entities and their relationships (see Figure 25.3).
A task is an entity that performs computations and that may communicate with other tasks. Multiple
tasks canexecute concurrently toachieve parallelism. The mediumthroughwhichthe data communication
takes place is called a channel. A task is connected to a channel via a port. A channel is used to transfer
values from one task to another. A variable is a logical storage location that can hold a value. A private
variable is a variable that is accessible by one task only. A token is a variable that is used to hold a value
that is communicated fromone task to another. A token can be either full or empty. Full tokens are tokens
that contain a value. Empty tokens do not contain a valid value, but merely provide space for a task to put
a value in. We also refer to full and empty tokens as data and room, respectively.
Tasks communicate with other tasks by calling TTL interface functions on their ports. Hence, a task has
to identify a port when calling an interface function. We focus on streaming communication: communica-
ting tasks exchange sequences of values via channels. A set of communicating tasks is organized as a task
graph.
Task Port
Empty token Full token
Private variable
with value
Channel
FIGURE 25.3 Logical model for inter-task communication.
TABLE 25.1 TTL Interface Types
Acronym Full name
CB Combined blocking
RB Relative blocking
RN Relative non-blocking
DBI Direct blocking in-order
DNI Direct non-blocking in-order
DBO Direct blocking out-of-order
DNO Direct non-blocking out-of-order
25.4.1.1 Interface Types
Considering the varying needs for modeling media processing applications and the great variety in poten-
tial platform implementations, it is not likely that a single narrow interface can satisfy all requirements.
For example, applications may process tokens of different granularities, where streams of tokens may or
may not be processed strictly in order. Platformimplementations may have different costs associated with
synchronization between tasks, data transfers, and the use of memory. Certain architectures efciently
implement message-passing communication whereas shared memory architectures offer a single address
space for memory-based communication between tasks.
In our view designers are best served if they are offered a palette of communication styles fromwhich
they can use the most appropriate one for the problem at hand. The TTL interface offers support for
different communication styles by providing a set of different interface types. Each interface type is easy to
use and implement. All interface types are based on the same logical model, which enables interoperability
across interface types. A task designer must select an interface type for each port. Different interface types
can be used in a single model, even in a single task. This allows models to be rened iteratively, where
during each step selected parts of a model are rened.
In dening the interface types, we have to choose which properties to support and which properties
to combine in a particular interface type. Some properties hold for all interface types. Specically, all
channels are uni-directional and support reliable and ordered communication. TTL supports arbitrary
communication data types, but each individual channel can communicate tokens of a single type only.
Multi-cast is supported, that is, a channel has one producing task but can have multiple consuming tasks.
The TTL interface types are listed in Table 25.1.
25.4.1.2 Interface Type CB
The interface type CB provides two functions for communication between tasks:
write (port, vector, size)
read (port, vector, size)
The write function is used by a producer to write a vector of size values into the channel connected
to port. The read function is used by a consumer to read a vector of size values from the channel
connected to port. The write and read functions are also available as scalar functions that operate
on a single value at a time. The write and read functions are blocking functions, that is, they do not
return until the complete vector has been written or read, respectively. This interface type is based on our
earlier work on YAPI [5].
This interface type is the most abstract TTL interface type. Since it hides low-level details from the
tasks, it is easy to use and supports reuse of tasks. The write and read functions perform both the
synchronization and the data transfer associated with communication. That is, they check for availability
of room/data, copy data to/from the channel, and signal the availability of data/room. The length of the
communicated vectors may exceed the number of tokens in the channel. The platform implementation
may transfer such vectors in smaller chunks, transparent to the communicating tasks [14]. This inter-
face type is named CB as it combines (C) synchronization and data transfer in a single function with
blocking (B) semantics.
This interface type can be implemented efciently on message-passing architectures or on shared
memory architectures where the processors have local buffers that can hold the values that are read or
written. However, onsharedmemory architectures where the processors donot have suchlocal buffers, this
interface type may yield overhead in copying data between private variables, situated in shared memory,
and the channel buffer in shared memory.
25.4.1.3 Interface Types RB and RN
To provide more exibility for making trade-offs upon task implementation, the other TTL interface types
offer separate functions for synchronization and data transfer. The availability of room or data can be
checked explicitly by means of an acquire function and can be signaled by means of a release function.
The acquire function can be blocking or non-blocking. A non-blocking acquire function does not wait
for data or room to be available, but returns immediately to report success of failure. The functions for
the producer are:
reAcquireRoom (port, count)
tryReAcquireRoom (port, count)
store (port, offset, vector, size)
releaseData (port, count)
reAcquireRoom is the blocking acquire function and tryReAcquireRoom is the non-blocking
acquire function. The acquire and release functions synchronize for vectors of count tokens at a time.
The acquire functions are named reacquire since they also acquire tokens that have previously been
acquired and not yet released. That is, they do not change the state of the channel. This helps to reduce
the state saving effort for tasks as the acquire function can simply be issued again upon a next task
invocation. This behavior is similar to GetSpace in Reference 6. Data accesses can be performed on
acquired room with the store function, which copies a vector of size values to the acquired empty
tokens. The store function can perform out-of-order accesses on the acquired empty tokens using a
relative reference offset. An offset of 0 refers to the oldest acquired and not yet released token. The
store function is also available as a scalar function. The releaseData function releases the count
oldest acquired tokens as full tokens on port.
The functions for the consumer are:
reAcquireData (port, count)
tryReAcquireData (port, count)
load (port, offset, vector, size)
releaseRoom (port, count)
These interface types are named RB and RN with the R of relative, B of blocking, and N of non-blocking.
Offering separate functions for synchronization and data transfer allows data transfers to be performed
on a different granularity and rate than the related synchronizations. This may, for example, be used
to reduce the cost of synchronization by performing synchronization at a coarse grain outside a loop,
while performing computations and data transfers at a ner grain inside the loop. This interface type can
be used to avoid the overhead of memory copies on shared memory architectures at a lower cost than
with CB, as coarse-grain synchronization can be combined with small local buffers, for example, registers,
for ne-grain data transfers. Additionally, for some applications the support for out-of-order accesses
helps to reduce the cost of private variables that are needed in a task. Further, with this interface type,
tasks can selectively load only part of the data fromthe channel, thereby allowing the cost of data transfers
to be reduced. The drawback, compared to CB, is that these interface types are less abstract.
25.4.1.4 Interface Type DBI and DNI
The RB and RN interface types hide the memory addresses of the tokens from the tasks. This supports
reuse of tasks. However, it may also incur inefciencies upon data transfers, like function call overhead,
accesses to the channel administration, and address calculations. To avoid such inefciencies, TTL offers
interface types that support direct data accesses. In these interface types the acquire functions return a
reference to the acquired token in the channel. This reference can subsequently be used by the task to
directly access the data/room in the channel without using a TTL interface function. The functions for
the producer are:
acquireRoom (port, &token)
tryAcquireRoom (port, &token)
token->field = value
releaseData (port)
The functions for the consumer are:
acquireData (port, &token)
tryAcquireData (port, &token)
value = token->field
releaseRoom (port)
The acquire and release functions acquire/release a single token at a time. Supporting vector operations for
these interface types would result in a complex interface. For example, it would expose the wrap-around
in the channel buffer or would require a vector of references to be returned. Since tasks must still be
able to acquire more than one token, these acquire functions acquire the rst un-acquired token and
change the state of the channel, unlike the reacquire functions of RB and RN. The release functions release
the oldest acquired token on port. The interface types are named DBI and DNI with the D of direct,
B of blocking, N of non-blocking, and I of in-order as tokens are released in the same order as they are
acquired. These interface types can be implemented efciently on shared memory architectures [7] and
are suited for software tasks that process coarse-grain tokens.
25.4.1.5 Interface Type DBO and DNO
In some cases tasks do not nish the processing of data in the same order as the data was acquired.
In particular when large tokens are used, it should be possible to release a token as soon as a task is nished
with it. For this purpose TTL offers the DBO and DNO interface types (O for out-of-order). The only
difference with the DBI and DNI interface types is in the release functions:
releaseData (port, &token)
releaseRoom (port, &token)
The token reference allows the task to specify which token should be released. The out-of-order release
supports efcient use of memory at the cost of a more complex implementation of the channel.
25.4.2 TTL Multi-Tasking Interface
To support different forms of multi-tasking, TTL offers different ways for tasks to interact with the
scheduler. Thereto TTL supports three task types.
The task type process is for tasks that have their own (virtual) thread of execution and that do not
explicitly interact with the scheduler. This task type is suited for tasks that have their private processing
resource or that rely on the platform infrastructure to perform task switching and state saving implicitly.
For example, this task type is well suited for software tasks executing on an OS.
The task type co-routine is for cooperative tasks that interact explicitly with the scheduler at points
in their execution where task switching is acceptable. For this purpose TTL offers a suspend function.
This task type may be used to reduce the task-switching overhead by allowing the task to suspend itself at
points where only little state needs to be saved.
The task type actor is for re-exit tasks that perform a nite amount of computations and then return
to the scheduler, similar to a function call. Unless explicitly saved, state is lost upon return. This task type
may be used for a set of tasks that have to be scheduled statically.
25.4.3 TTL APIs
The TTL interface is available both as C++ and as C API. The use of C ++ gives cleaner descriptions of
task interfaces, due to C++ support for templates and function overloading. We use C to link to software
compilers for embedded processors and hardware synthesizers since most of them do not support C++
as input language. For both the C++ API and the C API we offer a generic run-time environment, which
can be used for functional modeling and verication of TTL application models.
25.5 Multiprocessor Mapping
In this section we present a systematic approach to map applications efciently onto multiprocessors.
The key advantage of TTL is that it provides a smooth transition from application development to
application implementation. In our approach, we rewrite the source code of applications to improve ef-
ciency. We focus on source code transformations for multiprocessor architectures taking into account
costs of memory usage, synchronization cycles, data transfer cycles, and address generation cycles.
We do not consider algorithmic transformations because these transformations are application-specic.
Typically, application developers perform these transformations. We also do not consider code transforma-
tions for single target processors because these transformations are processor-specic. We assume that
processor-specic compilers and synthesizers support these transformations, although in todays practice
programmers also write processor-specic C.
In the remainder of this section we present methods and tools to transform source code. First we
present source code transformations to illustrate the advantages of using TTL. Next we present tools that
we developed to automate these transformations.
25.5.1 Source Code Transformation
We use a simple example to illustrate the use of TTL. The example consists of an inverse quantization
(IQ) task that produces data for an inverse zigzag (IZZ) task; see Figure 25.4. We focus on the interaction
between these two tasks.
The TTL interface supports different inter-task communication interface types that provide a trade-off
betweenabstractionand efciency. We illustrate this by means of code fragments. To save space we indicate
scopes by means of indentation rather than curly braces.
25.5.1.1 Optimization for Single Interface Types
The most abstract and easy-to-use interface type is CB, which combines synchronization and data transfer
in write and read functions. Figure 25.5 shows a fragment of the IQ task that reads input (Line 08),
IQ IZZ
FIGURE 25.4 IQ and IZZ example.
01 void IQ::main()
02 while (true)
03 for(int j=0; j<vi; j++)
04 for(int k=0; k<hi; k++)
05 for(int l=0; l<64; l++)
06 VYApixel Cin;
07 VYApixel Cout;
08 read(CinP, Cin);
09 Cout = QTable[ti][l]*Cin;
10 write(CoutP, Cout);
FIGURE 25.5 IQ using interface type CB.
01 void IZZ::main()
02 while (true)
03 VYApixel Cin[64];
04 VYApixel Cout[64];
05 read(CinP, Cin, 64);
06 for (int i=0; i<64; i++)
07 Cout [zigzag[i]] = Cin[i];
08 write(CoutP, Cout, 64);
FIGURE 25.6 IZZ using interface type CB.
01 void IQ::main()
02 while (true)
06 for(int l=0; l<64; l++)
07 VYApixel Cin;
08 read(CinP, Cin);
09 Cout[l] = QTable[ti][l]*Cin;
FIGURE 25.7 IQ using interface type CB and vector write.
performs the inverse quantization (Line 09) and writes output using a scalar write operation (Line 10).
The write function terminates when the value of variable Cout has been transferred to the channel. This is
repeated for all 64 values of a block (Line 05) and for all blocks in a minimumcoding unit (Line 03 and 04).
Figure 25.6 shows a fragment of the IZZ task using a vector read function (Line 05). The read function
terminates when 64 values fromthe channel have been transferred to the variable Cin. Subsequently, these
values are reordered (Line 06 and 07) and written to the output (Line 08). The channel fromthe IQ task
to the IZZ task implements the write and read functions that handle both the synchronization and the
data transfer. Note that the length of the communicated vectors is not bounded by the number of tokens
in the channel, which makes tasks independent of their environment.
A potential performance problem of the IQ task in Figure 25.5 is that for each pixel, the output
synchronizes with the input of the IZZ task. In Reference 15 we demonstrated that this is costly in terms
of cycles per pixel if the write function is implemented in software. We can solve this problem by calling
the write function outside the inner loop as shown in Figure 25.7 in Line 10. To this end, we need to
store a block of pixels locally in the IQ task (Line 05). Similar source code transformations to reduce the
synchronization rate are possible for the other TTL interface types.
25.5.1.2 Optimization across Interface Types
The disadvantage of the IQtask of Figure 25.7 is the additional local memory requirement. Interface type
RB splits synchronization and data transfer in separate functions such that the synchronization rate can
be decreased without additional local memory requirements.
01 void IQ::main()
02 while (true)
05 reAcquireRoom(CoutP, 64);
06 for(int l=0; l<64; l++)
07 VYApixel Cin;
08 read(CinP, Cin);
09 store(CoutP, l, QTable[ti][l]*Cin);
10 releaseData(CoutP, 64);
FIGURE 25.8 IQ using interface type RB.
01 void IZZ::main()
02 while (true)
04 reAcquireData(CinP, 64);
05 for(int i=0; i<64; i++)
06 VYApixel Cin;
07 load(CinP, i, Cin);
08 Cout[zigzag[i]] = Cin;
10 releaseRoom(CinP, 64);
FIGURE 25.9 IZZ using interface type RB.
Figure 25.8 shows how to decrease the synchronization rate from pixel rate to block rate at the output
of the IQ task. Note that here we assume that the channel can store at least 64 pixels, otherwise the
call of the function reAcquireRoom at Line 05 will never terminate. This assumption on the environ-
ment is not needed with interface type CB. Hence, the IQ task of Figure 25.8 puts more constraints on
its use.
Figure 25.9 shows the IZZ task with separate synchronization and data transfer. The IQ task and the IZZ
task do not need to store blocks locally to interact with each other. They share the tokens in the channel.
If the IQ task and the IZZ task need to execute concurrently, then the channel must be able to contain two
blocks, that is, 128 pixels.
The load function (Figure 25.9, Line 07) and the store function (Figure 25.8, Line 09) use relative
addressing. The advantage of this is that the address generation for the FIFO can be implemented in the
load and store functions. Hence, address generation is hidden for the tasks.
Interface type DBI uses direct addressing rather than relative addressing. Direct addressing has advan-
tages if the tokens of a channel and the variables of a task are stored in the same memory. In that case the
tokens and the variables should be mapped onto the same memory locations to avoid in-place copying
in the memory during the transfer of data from and to the tokens. Such copying occurs for instance
in Figure 25.9 at Line 07 where a value from the channel is copied into variable Cin. Furthermore, the
cost of calling the load and store functions can be avoided. The disadvantage of direct addressing is that
the addresses of the tokens are exposed to tasks. To avoid that tasks must take care of wrap-around in
the FIFO only scalar functions are available. Hence, typically it is more efcient to choose larger tokens if
the synchronization rate has to be low. Figure 25.10 shows the IQ task using direct addressing on its output.
We declare a pointer Cout in Line 04 that is given a value in Line 05. After the room has been acquired,
Cout points to a block of 64 pixels. The channel data type is also block of 64 pixels. The pointer Cout is
used to set the value of the pixels in Line 09 avoiding a call of a store function. Similarly, Figure 25.11
shows the IZZ task using direct addressing on its input avoiding both a call to a load function and a copy
operation from the channel to a variable. Note that the granularity of synchronization between the IQ
output and the IZZ input must be identical, because only scalar functions are available. For this reason,
the IQ task and the IZZ task have become less re-usable.
01 void IQ::main()
04 VYApixel *Cout;
05 acquireRoom(CoutP, Cout);
06 for(int l=0; l<64; l++)
07 VYApixel Cin;
08 read(CinP, Cin);
09 Cout[l] = QTable[ti][l]*Cin;
10 releaseData(CoutP);
FIGURE 25.10 IQ using interface type DBI.
01 void IZZ::main()
02 while (true)
03 VYApixel *Cin;
05 acquireData(CinP, Cin);
06 for(int i=0; i<64; i++)
07 Cout[zigzag[i]] = Cin[i];
09 releaseRoom(CinP);
FIGURE 25.11 IZZ using interface type DBI.
01 void IZZ::main()
02 if (!tryReAcquireData(CinP, 64)) return;
03 if (!tryReAcquireRoom(CoutP, 64)) return;
04 for (unsigned int i=0; i<64; i++)
05 VYApixel Cin;
06 load(CinP, i, Cin);
07 store(CoutP, zigzag[i], Cin);
08 releaseRoom(CoutP, 64);
09 releaseData(CinP, 64);
FIGURE 25.12 IZZ using interface type RN.
25.5.1.3 Non-Blocking Interface Types
So far, we only discussed interface types that provide blocking synchronization functions. These interfaces
are easy to use because programmers do not have to program what should happen when access to tokens
is denied. Sometimes blocking synchronization is not efcient, for instance, if the state of a task is large
such that it is costly to save it. In that case it may be more efcient to let the programmer decide what
should happen. For this reason, non-blocking synchronization functions are needed. Figure 25.12 shows
how the IZZ task can be modeled as an actor. When the actor is red, it rst checks for available data on its
input (Line 02) and then for available room on its output (Line 03). If the data is available but the room
is not available, then the actor can return without saving its state. In the next ring, it can redo the checks
since the tryReAcquire functions do not modify the state of the channels. If both the data and the room
are available, it is guaranteed that the actor can complete its execution.
25.5.1.4 Channel and Task Merging and Splitting
Channel and task merging and splitting are important for load balancing. In Reference 15 we applied task
merging to reduce the data transfer load, since the cost of data transfer from the IQ task to the IZZ task
is large compared to the amount of computation that the IZZ task performs. Figure 25.13 shows how the
IQ task and the IZZ task can be merged.
The merging of the two tasks is based on the observation that the loop structure of the IZZ task ts in
the loop structure of the IQ task. If one wants to merge two arbitrary tasks, this is not always the case.
01 void IQ_IZZ::main()
02 while (true)
05 VYApixel Cin[64];
07 read(CinP, Cin, 64);
08 for(int l=0; l<64; l++)
09 Cout[zigzag[l]]=QTable[ti][l]*Cin[l];
FIGURE 25.13 Merged IQ and IZZ task.
01 void IQ_IZZ::main()
02 while (true)
03 VYApixel mcu[vi][hi][64];
04 IQ(mcu);
07 IZZ(mcu[j][k]);
08
09 void IQ_IZZ::IQ(mcu)
12 for(int l=0; l<64; l++)
13 VYApixel Cin;
14 read(CinP, Cin);
15 mcu[j][k][l] = QTable[ti][l]*Cin;
16
17 void IQ_IZZ::IZZ(block)
19 for(int i=0; i<64; i++)
20 Cout[zigzag[i]] = block[i];
FIGURE 25.14 Statically scheduled IQ and IZZ tasks.
A more generic approach to statically schedule the rings of tasks is exemplied in Figure 25.14. The new
task IQ_IZZ executes an innite loop fromwhich it calls the IQ and IZZ task by means of function calls.
The communication between the IQ function and the IZZ function does not have to be synchronized
explicitly because the calling order of the functions guarantees the availability of data and room. For this
reason, we replace the channel by a variable mcu (minimumcoding unit) that is declared in Line 03. The
blocks in the mcu are passed by reference to the IQ function and the IZZ function.
25.5.2 Automated Transformation
We aimto automate the above-mentioned source code transformations to support the proposed method
by tools. It is not the goal toautomate the designdecisionmaking process, because experiences inhigh-level
synthesis and compilation tools show that it is hard to automate this while maintaining transparency for
users. Our goal is to automate the rewriting of the source code according to the design decisions of users.
This approach has two advantages. First, design decisions are explicitly formulated rather than implicitly
coded in the source code. Second, the source code can be rewritten automatically such that modications
and bug xes in the original specication can be inserted automatically in architecture-specic versions
of the code. In this way a large set of design descriptions can be kept consistent.
25.5.2.1 Parser Generation
The rst step in automatic source code transformation is to be able to parse programs and to build data
types that support source code transformation. For this purpose, we use an in-house tool called C3PO
in combination with the publicly available parser generator tool ANTLR [16]. C3PO takes a grammar as
input and synthesizes data types for the non-terminals in the rules of the grammar as well as input for
ANTLR. We use C3PO and ANTLR to generate a C ++ parser and a heterogeneous abstract syntax tree
(AST). We use the same tools to generate visitors for the AST that transform the code. After transformation,
we generate new C ++ or C code from the AST. The transformations that we target are typically inter-le
transformations. For this reason, we process all source les simultaneously as opposed to the usual single
le compilation for single processors.
25.5.2.2 Iterative Transformation
Source code transformation typically is an iterative process in which many versions of the same program
are generated. Automatic source code transformation has the advantage that the generated source code is
consistently formatted and that the transformations can be repeated if necessary. This makes it possible to
keep all versions of a program consistent automatically. For version management we have adopted CVS.
Each iteration uses three versions of the source code. The rst version is the result of the previous iteration
or the original code if it is the rst iteration. The second version is manually augmented source code that is
the input for the automatic transformation. The augmentation can contain for instance design constraints
and design decisions. The third version is the code that is automatically generated. If the original code
changes, for instance, due to bug xes or specication changes, then the changes can be automatically
inserted in the second version of the code by the version management tool. The modied second version
of the code is then given as input to the transformation tools in order to produce the third version of the
code that is the starting point for the next iteration.
25.5.2.3 Automatic Interface Type Renement
We illustrate automatic interface renement using the example of IQ and IZZ. The original source code of
the tasks is given in Figure 25.5 and Figure 25.6. The resulting code is given in Figure 25.8 and Figure 25.9.
The complete code is distributed over six les: a source le and a header le for the denition of each
of the two tasks, and a source le and a header le for the denition of the task graph that instantiates
and connects the two tasks. All these les require changes if the communication between the two tasks
changes. This has been automated in the following way. We augment the source code of the tasks with
synchronization constraints. In Figure 25.5 between Line 04 and 05 we add the line ENTRY(P) and at
the end of the text we add the line LEAVE(P), both in the scope of the loop in Line 04. This annotation
means that we want to synchronize the output of the IQ task on blocks of 64 pixels. Similarly we add
synchronization constraints ENTRY(C) and LEAVE(C) to the IZZ task in Figure 25.6 between Line 04
and 05 and at the end of the text, respectively, both in the scope of the loop of Line 02. Assuming that
the channel between the two tasks is called iqizzbuf, we provide the transformation tool with the design
information shown in Figure 25.15.
This information means that we want the iqizzbuf channel to have 64 tokens (Line 01). Furthermore,
the channel should be implemented in data type Channel, it should handle tokens of type VYApixel, and
it should connect to interface type RB both for output and for input (Line 02). Line 03 and 04 denote the
synchronization constraints: the amount of consumption should not exceed the amount of production
01 iqizzbuf[64]
02 Channel<VYApixel> USING RbIn, RbOut
03 64*IZZ::C <= 64*IQ::P
04 64*IQ::P <= 64*IZZ::C+64
05 STORAGE IQ
06 Cout-> ../iqizzbuf TRANSFORMATION T1
07 STORAGE IZZ
08 Cin-> ../iqizzbuf TRANSFORMATION T2
09 SYNCHRONIZATION
10 IQ, IZZ -> iqizzbuf
FIGURE 25.15 Design constraints and decisions.
but the difference between the amount of production and consumption may not exceed the buffer capacity
of the channel. Line 06 and 08 denote that the variables Cout and Cin of the IQ task and the IZZ task,
respectively, should be mapped on the iqizzbuf channel using Transformation T1 and T2 that are available
in a library. This introduces the calls to load and store functions. The result of the call to the load function
in the IZZ task is stored in a new variable, also called Cin. Line 10 denotes that the IQ task and the IZZ
task should be synchronized using the iqizzbuf channel. This introduces the calls to acquire and release
functions at the positions indicated by the ENTRY and LEAVE annotations in the augmented source code.
The resulting source code is given in Figure 25.8 and Figure 25.9.
25.5.2.4 Processor and Channel Binding
The last phase of source code transformation is the link to existing compilers and synthesizers in order
to map the individual tasks to hardware and software. To this end, programmers specify a binding of
tasks to processor types and processor instances. From that information the necessary input, that is,
C les and makeles, for compilation or synthesis to the target processor is generated. Furthermore,
the programmer species specic implementations of channels. For instance, the same interface type
can be implemented differently for intra-processor communication and for inter-processor commu-
nication because of efciency reasons. Each implementation has its own set of names for its interface
functions since function overloading is not available in C. The generated C code contains the data types
and function calls that correspond to the implementations of the channels that the programmer has
chosen.
25.5.2.5 Other Transformations
There are other transformations that are beyond the scope of this paper. We briey mention them here.
We support structure transformation to change the hierarchy of task graphs. We support instance trans-
formations such that multiple instances of the same task or task graph can be transformed individually.
Finally, we plan to support channel and task merging and splitting [15] by connecting to the Compaan
tool suite [17].
25.6 TTL on an Embedded Multi-DSP
In this section we present the implementation of TTL on a multi-DSP. The objectives are to show (1) how
TTL can be implemented and that a TTL implementation is cheap, (2) trade-offs between the imple-
mentation cost and the abstraction level of the TTL interfaces, and (3) how TTL supports the exploration
of trade-offs between, for example, memory use and execution cycles. The TTL implementation is done
without special hardware support. We rst present the multi-DSP architecture. Then we describe how
the implementation of ve TTL interface types has been done and we present quantitative results. Finally,
the results for an implementation of an MP3 decoder application are presented.
25.6.1 The Multi-DSP Architecture
The embedded multi-DSP is a template that allows an arbitrary number of DSPs [18]. Each DSP has
it own memory, which in limited ways can be accessed by (some of) the other DSPs. A DSP with
memory and peripherals is called a DSP subsystem (DSS), see Figure 25.16. The DSPs do not have
a shared address space. Communication between the DSSs is done through memory mapped uni-
directional point-to-point links. Thus, two DSPs may refer to a single memory location with different
addresses. Data may be routed from one point-to-point link to another and so on until it reaches its
destination.
In our instance, the DSP Epics7B from Philips Semiconductors was used. The DSP, which is mainly
used for audio applications, has a dual Harvard architecture with 24 bits wide data path and 12 bit
coefcients.
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
DSS
E
x
t
e
r
n
a
l

i
n
t
e
r
f
a
c
e
s
M
i
c
r
o
p
r
o
c
e
s
s
o
r

i
n
t
e
r
f
a
c
e
Micro-
processor
FIGURE 25.16 Multi-DSP architecture. Here an instance with 16 DSP subsystems is shown.
25.6.2 TTL Implementation
There are two criteria to decide which TTL interface type to use for a certain application on a certain archi-
tecture. First the interface type must match the application characteristics. Second, the implementation
of the interface type on the target architecture must be efcient.
For audio applications, DBO and DNO are not needed because audio applications do not have large
amounts of data that are producedor consumedout-of-order. Therefore, the other ve interface types have
been implemented on the multi-DSP architecture in order to determine the cost of the implementations.
Most of the TTL functions have been implemented in optimized assembly code. It is justiable to spend
the effort because the TTL functions are implemented only once and used by many applications.
ATTLchannel implementationconsists of twoparts, the channel buffer andthe channel administration.
In the multi-DSP architecture, no special-purpose memories exist, so the channel buffer is a circular
buffer in a RAM. This is where the tokens are stored. The channel administration is a structure that holds
information about the state of the channel. In the multi-DSP architecture, the channel buffer has to be
located in the memory of the DSS where the consumer is executed. This is due to the uni-directional
point-to-point links in the architecture.
25.6.2.1 Channel Administration
The channel administration keeps track of how many tokens there are in the channel and how many
of those are full and empty respectively. It also provides a way to get the next full and the next empty
token from the channel. When the channel buffer is implemented as a circular buffer in a RAM, the
channel administration can be implemented in two different ways with two variables to keep track of
the state of the channel. The rst alternative is to use two pointers, one to the start of the empty tokens
and one to the start of the full tokens. The second alternative is to have one pointer and one counter,
for example, a pointer to the start of the full tokens and a counter telling how many full tokens there
are in the channel. This requires atomic increment and decrement operations, which are not supported
on the multi-DSP architecture. Therefore the channel administration is implemented with two pointers.
The producer updates the pointer to the empty tokens (write_pointer) and the consumer updates the
pointer to the full tokens (read_pointer) and thereby no atomic operations are needed [7]. Another
method to avoid the need for atomic updates is to use two counters and two pointers. That method is
explained in Section 25.7.
When the two pointers point to the same memory location, it is not clear if the channel is full or
empty unless wrap-around counters are used. Wrap-around counters imply expensive acquire functions.
To avoid that problemwe have implemented a channel administration that does not allow the pointers to
point to the same memory location unless the channel is empty. We thereby have a memory overhead of
the size of one token in the channel buffer. In the indirect interfaces the token size is always one word.
Both the producer and the consumer need to access the channel administration. In the multi-DSP
there are no shared memories, therefore the channel administration has to be duplicated and present in
Producer side Consumer side
READ_POINTER
WRITE_POINTER
BASE_POINTER
CH_SIZE
BASE_RA
REMOTE_POINTER
WRITE_POINTER
READ_POINTER
BASE_POINTER
CH_SIZE
BASE_RA
REMOTE_POINTER
FIGURE 25.17 Double channel administration for the indirect interface types.
01 Boolean tryReAcquireData(port p, uint n) {
02 uint available_data;
03 available_data = (p->write_pointer p->read_pointer)
modulo p->ch_size;
04 if (available_data >= n)
05 return true;
06 else
07 return false; }
FIGURE 25.18 Pseudo code for tryReAcquireData (RN).
the two DSSs involved in the communication. The two copies are called the local and remote channel
administration. See Figure 25.17. Since the producer and the consumer refer to the channel buffer with
different addresses, this must be taken into consideration when updating the remote channel adminis-
tration. We keep a pointer to the base address in the local address space (base_pointer) and a pointer
to the base address in the remote address space (base_ra). These two pointers are used to calculate the
pointer value to be stored in the remote channel administration. The channel administrations as well as
the channel buffer must be located in memory areas that are accessible via the point-to-point links.
As an example of the implementation of the TTL functions, pseudo code for the tryReAcquireData
function in RN is shown in Figure 25.18.
25.6.3 Implementation Results
The acquire functions for the RN interface type use 9 instructions. The release functions use 15 instruc-
tions. The vector load and store functions use a loop unrolling of 2 and achieve 2.5 instructions per data
word with an overhead of 24 instructions to set up the data transfer. The scalar load and store functions
are in-lined in the task code and use each 10 instructions.
The acquire functions for the direct interface type DNI use between 19 and 33 instructions, dependent
on the state of the channel. The release functions use between 29 and 38 instructions. No data transfer
functions are used. The cost of the data transfers is comparable to the cost of accessing private data
structures in the task.
For the blocking interface types, it is not as easy to determine the cost in terms of instructions for the
individual acquire functions, because they may include task switches. However, an acquire function in
RB, that does not trigger a task switch uses 18 instructions. The release functions and the data transfer
functions for RB have the same cost as those for RN. The same applies to the release functions of DBI with
respect to DNI.
In CB, synchronization and data transfer is combined into one single function. The cost of the
implementation is approximately the sumof the costs of the three corresponding functions in RB.
25.6.3.1 Evaluation Application
An MP3 decoder application has been used for the evaluation of the TTL implementations on the multi-
DSP. The MP3 decoder application was available as a sequential Cprogram. The application was converted
TABLE 25.2 Simulation Results for the Whole
Application
Part in TTL #Memory
TTL IF type #Cycles (%) words
CB 45,579,603 2.9 12493
RB 45,551,243 2.8 12494
RN 45,505,950 2.2 12365
DBI 45,152,454 1.1 9162
DNI 45,108,086 0.5 9041
into a TTL task and additional TTL tasks were added for mimicking the rest of a complete application
and for handling the interaction with the simulation environment.
The application has been implemented with all ve interface types. The RN and DNI implementations
use TTL actors and the other types use TTL processes. The application has also been implemented with
four different granularities of the communication for the RN interface type. In the implementations with
the direct interface types, DNI and DBI, the channel between the input task and the MP3 task uses RN
and RB interface types respectively. This is due to the fact that the amount of communicated data on that
channel is data dependent.
25.6.3.2 Simulation Results
Table 25.2 shows the results of the various interface types with frame-based communication. All channel
buffers have been sized so that they can hold one frame. The memory is the total data memory for the
whole application. The number of cycles is the number used by the whole application to decode a test le.
The blocking implementations use somewhat more memory and have some cycle overhead compared
to the non-blocking implementations, when comparing RB to RN and DBI to DNI. This is due to the fact
that the multi-tasking costs both memory for storing register contents and cycles to save and restore the
register contents. The DNI and DBI interface type use considerably less memory and less cycles than the
other interface types. This is because the data in the channels is accessed directly, without copying it to
and from private variables. The CB version has similar performance as the RB version.
For the DNI implementation, about 0.5% of the cycles are spent in the TTL functions and 99.5%
of the cycles are spent in the tasks. This is of course dependent on the application as well as on the
implementation of the TTL functions.
Figure 25.19 shows the trade-offs that can be made by changing the granularity of the communication.
Here a change of a factor of 36 has been pursued on the channel between the MP3 task and the output
task. In the MP3 decoder, this is made possible by using a sub-frame decoding method, which allows the
MP3 decoder to output blocks smaller than a frame.
The memory is reduced both in the channel buffer, in the MP3 task and in the output task. The channel
buffer sizes have been adjusted to match the granularity of the communication. The cycle overhead
for the small granularity communication has two reasons. Smaller granularity implies more frequent
synchronization calls and smaller buffers imply more frequent task switching.
The implementation of CB allows channel buffers to be smaller than the vector sizes used by the tasks.
One of the advantages with CB is that the channel buffer size can be reduced to achieve a memory-cycle
trade-off without rewriting the tasks themselves. Results for this are shown in Figure 25.20.
25.6.4 Implementation Conclusions
It has been shown that TTL can be implemented efciently on a multi-DSP architecture. It has also been
shown that changing the granularity of the communication of the tasks has great impact on the memory-
cycle trade-off. The direct interfaces in TTL provide benets both regarding the memory usage and the
cycle overhead. As expected, the most abstract interface type, CB, is also the most expensive to use. This
proves the value of automating transformations between the various implementation alternatives.
4.4 4.5 4.5 4.55 4.6 4.65 4.7 4.75 4.8
x 10
7
#cycles
M
E
M

#
w
o
r
d
s
MEM versus #cycles
Full frame
1/2 frame
1/36 frame
1/4 frame
5000
6000
7000
8000
9000
10000
11000
12000
13000
FIGURE 25.19 Simulation results for RN, when changing the communication granularity.
x 10
7
10000
4.54 4.55 4.56 4.57 4.58 4.59 4.6 4.61 4.62 4.63

10500
11000
11500
12000
12500
13000
#cycles
M
E
M

#
w
o
r
d
s
MEM versus #cycles
Full frame
1/2 frame
1/4 frame
1/8 frame
1/16 frame
1/32 frame
FIGURE 25.20 Simulation results for CB, when changing the channel buffer size.
25.7 TTL in a Smart Imaging Core
The objective of this section is to show that the implementation of TTL in hardware, software, and
mixed hardware/software is possible with reasonable costs. The implementation allows the buffer size
and the buffer location to be changed and the channel administration to be relocated. This section rst
discusses the smart imaging core followed by a detailed description of the TTL implementation including
performance results.
25.7.1 The Smart Imaging Core
Smart imaging applications combine image and video capturing with the processing and/or interpretation
of the scene contents. An example is a camera that is able to segment a video sequence into objects, track
Motion
Estimator
Copro
ARM 9xx
CPU
ext.
Flash
I/O Interface
* -Timers,
Watchdog,
Interrupt Controller
CCIR/
Camera
Frontend
Video
Input
Smart
Imaging
Copro
off-chip
communication
TTL
TTL
DTL
DTL
HW Shell HW Shell
Data
IF
Data
IF
Data
IF
Data
IF
Data
IF
Data
IF
Data
IF
Data
IF
SW Shell
Communication interconnect
SW Shell
I/D Cache
embed.
RAM &
(boot)
ROM
Peri-
pherals*
Memory
Controller
(Flash & DRAM)
ext.
SDRAM
FIGURE 25.21 Architecture of the smart imaging core.
some of them, and raise an alarmif some of these objects show an unusual behaviour. The smart imaging
core described here can be embedded in a camera and is suited for automotive and mobile communication
applications. Example applications are pedestrian detection [19], low-speed obstacle detection [20], and
face tracking.
Each of the smart imaging applications uses low-level pixel processing, typically on image segments, for
an abstraction of the scene contents (feature extraction). Furthermore, motion segmentation is used to
help in tracking objects in the scene. The applications are structured such that the more control-oriented
parts are combined together in a task that ts well on a CPU. All the low-level pixel processing is combined
together in a pixel processing task, which is mapped onto a smart imaging coprocessor. Likewise, the main
processing part of the motion segmentation is described as an independent task, which is mapped onto a
motion estimator coprocessor. The architecture of the smart imaging core is depicted in Figure 25.21.
More details of the architecture can be found in Reference 21. The architecture globally consists of
an ARM CPU, a video input unit, and two coprocessors: the motion estimator (ME) and the smart
imaging (SI) coprocessor. The tasks on the coprocessors and the ARM communicate with each other
using the TTL interface. By adopting the TTL interface for the coprocessors, we expect that the integration
of these blocks into future systems will be signicantly simplied.
25.7.2 TTL Shells
This subsection presents the TTL shells used for the smart imaging core. These are a full hardware shell
for the SI coprocessor and software shells for the ARMand the motion estimator (VLIW) coprocessor.
25.7.2.1 TTL Shell for the SI Coprocessor
The TTL interface type used for the SI is the RB interface type using indirect data access. As already
explained in the Multi-DSP section a TTL channel implementation consists of two parts, the channel
buffer and the channel administration. In the SI core the channel buffers are always located in main
(on-chip) memory. The channel administration can be placed both in the shell and in main memory.
Producer side Consumer side
BASE_POINTER
CH_SIZE
TOKEN_SIZE
N_WRITTEN
N_READ
WRITE_POINTER
REMOTE_POINTER
BASE_POINTER
CH_SIZE
TOKEN_SIZE
N_WRITTEN
N_READ
READ_POINTER
REMOTE_POINTER
Tokens
Bytes
Tokens
Tokens
Token aligned
Unit
FIGURE 25.22 Channel administration.
r
e
q
u
e
s
t
a
c
k
n
o
w
l
e
d
g
e
p
o
r
t
_
t
y
p
e
p
r
i
m
_
r
e
q
_
t
y
p
e
i
s
_
n
o
n
_
b
l
o
c
k
i
n
g
i
s
_
g
r
a
n
t
e
d
p
o
r
t
_
i
d
o
f
f
s
e
t
e
s
i
z
e
(
n
)
w
r
_
r
e
q
w
r
_
d
a
t
a
w
r
_
a
c
k
r
d
_
r
e
q
r
d
_
d
a
t
a
r
d
_
a
c
k
1 1 1 2 1 1 np 32 ns 1 32 1 1 32 1
Co-processor
Shell
FIGURE 25.23 TTL signal interface.
We alsouse twocopies of the channel administration; one at the producer side andanother at the consumer
side. Figure 25.22 depicts the channel administration structure.
To make sure that the channel status is handled correctly by both the producer and consumer,
without the need for atomic access to the variables of the channel administration, n_written,
n_read, read_pointer, and write_pointer are used. Only the producer modies n_written and
write_pointer. Similarly only the consumer modies n_read and read_pointer. The equation
n_written n_read is used to calculate the amount of data available in the channel while ch_size
(n_writtenn_read)is usedtoderive the amount of free room. The hardware implementationincludes
correct handling of wraparounds. Withthis approachboththe consumer and producer have a conservative
viewon the channel status. The use of two token counters n_read and n_writteninstead of two pointers
as in the Multi-DSP case is due to the variable token_size that can be handled with this implementation.
Using the counters the implementation of the acquire functions is more efcient because no multiplica-
tion with token_size is needed. The variable remote_pointer is used to reference the remote channel
administration. The variable base_pointer together with the offset parameter provided through the
TTL load and store calls is used to calculate the physical address for accessing the channel buffer. The
buffer behaves as a FIFO and is implemented as a xed size (ch_size) circular buffer. This results in
the equationaddress =base_pointer +(read_pointer +offset
token_size) %ch_size for each

read access.
The hardware implementation consists of a hardware shell with two interfaces. A TTL signal inter-
face connects the shell to a coprocessor, see Figure 25.23. A DTL interface connects the shell to the
communication interconnect. The request and acknowledge signals are used to handshake a TTL call from
the coprocessor to the shell. The shell is able to handle both input (port_type =0) and output ports
(port_type =1). Both the RB and RN interface types are supported through the is_non_blocking and
is_granted signals. The signal is_granted indicates if access is granted in non-blocking acquire operations.
The esize signal indicates the vector length (max. 2
ns
tokens). The wr_ and rd_ signals are used to hand-
shake the load and store data between the coprocessor and shell. In total 2
np
logical ports (port_id) are
handled independently by the shell. The TTL calls for these ports are however handled sequentially as
each TTL call has to rst nish by reasserting the acknowledge signal, hence only one set of port_id, offset,
etc. signals is provided. Usually this limitation is acceptable as the number of concurrent ports that can
be handled efciently is also limited by the physical connection of the shell to the DTL infrastructure
(see Figure 25.21). Using one DTL interface with multiple sets of TTL signals complicates the shell signi-
cantly. In that case multiple state-machines have to arbitrate for the DTL access. Obviously also multiple
DTL interfaces for the shell could be used. This however is similar to just using multiple (simple) shells
for one coprocessor as we provide.
The shell architecture consists of an indirect addressing unit, a control unit, two ALUs, and a DTL inter-
face unit. Furthermore, it includes a table that maps port_ids to their respective channel administration.
The table brings the exibility to map the channel administration at an arbitrary location (even within the
shell itself). For the smart imaging core this exibility enables the designer to optimize the communication
infrastructure per application. For a channel between tasks on the ARM (producer) and SI (consumer),
the channel buffer and administration of the producer side are mapped in main memory, while the
consumer side administration is mapped in the hardware shell. This mapping minimizes the time spent
for the acquire functions as the administrations are distributed and mapped locally to the producer and
consumer. The performance of the hardware shell with the channel administration local in the shell is as
follows: the acquire functions use 5 cycles, the release functions use 7 cycles, the load function uses 5 + 2n
cycles, and the store function uses 5 + n cycles. The parameter n is the number of tokens specied in the
TTL call. In total, the implementation of a shell for the SI coprocessor with 2 logical ports, synthesized
for 100 MHz, takes 0.2 mm
2
(8.3 Kcells) in 0.18 CMOS technology.
25.7.2.2 TTL Shell for the ME and ARM
The motion estimator coprocessor is synthesized using a high-level architectural synthesis tool calledA|RT
Designer. A|RT Designer takes as input a C description of an algorithm and generates a custom VLIW
processor, consisting of a data path and a controller. The controller contains an FSM, which determines
the next instruction to be executed, and a micro-code ROMmemory, where the motion estimation main
task is stored. The data path is parameterizable in the number and the type of functional units (FUs)
used. The communication to the external systemis arranged via input and output FUs that implement a
standard DTL interface towards the systemand have a simple register le interface inside the VLIW. The
output of A|RT Designer is a synthesizable RTL description of the generated VLIWprocessor. Instead of
implementing the TTL shell for the ME completely in hardware as an FU, the implementation choice for
the ME is to use the A|RT high-level synthesis tool for the implementation of both the motion estimation
main process and the shell. In this case the implementation of the TTL functions is by means of C-code
that is executed on the ME VLIW. This is achieved by compiling the TTL C-code together with the design
description of the ME. The A|RT high-level synthesis tool adds the TTL implementation into the micro-
code of the coprocessor, and executes it as part of the VLIW program. The physical communication is
done via the FUs that provide standard DTL interfaces.
The ARMsoftware implementation is a simple C-code compilation of the TTL functions. The number
of ARMinstructions used is: 40 for the acquire functions; 42 for the release functions; 27 for (scalar) load
and store functions (6 extra for each element of a vector).
25.8 Conclusions
We have presented an interface-centric design method based on a task-level interface named TTL. TTL
offers a framework of interoperable interface types for application development and for implementing
applications on platform infrastructures. We have shown that the TTL interface provides a basis for
a method and tools for mapping applications onto multiprocessors. Furthermore, we have demon-
strated that TTL can be implemented efciently on two different architectures, in hardware and in
software.
Industry-wide standardization of a task-level interface like TTL can help to establish system-level
design technology that supports efcient MPSoC integration with reuse of function-specic hardware
and software modules across companies. Future extensions of TTL will be concerned with the modeling
of timing constraints and associated verication technology.
Acknowledgments
We acknowledge the contributions of Jeffrey Kang, Ondrej Popp, Dennis Alders, Avneesh Maheshwari,
and Ghiath Alkadi fromPhilips Research and of Victor Reyes fromthe University of Las Palmas. The work
on smart imaging is partly sponsored by the European Commission in the IST-2001-34410 CAMELLIA
project.
References
[1] Rowson, J.A. and A. Sangiovanni-Vincentelli. Interface-Based Design. In Proceedings of the 34th
Design Automation Conference, Anaheim, June 1997, pp. 178183.
[2] Cesrio, W.O., D. Lyonnard, G. Nicolescu, Y. Paviot, S. Yoo, A.A. Jerraya, L. Gauthier, and
M. Diaz-Nava. Multiprocessor SoC Platforms: A Component-Based Design Approach. IEEE
Design and Test, NovemberDecember, 19(6), 5263, 2002.
[3] Dziri, M.-A., W. Cesrio, F.R. Wagner, and A.A. Jerraya. Unied Component Integration Flow
for Multi-Processor SoC Design and Validation. In DATE-04, Paris, February 1620, 2004,
pp. 11321137.
[4] Cai, L. and D. Gajski. Transaction Level Modeling: An Overview. CODES +ISSS, Newport Beach,
October 13, 2003, pp. 1924.
[5] De Kock, E.A., G. Essink, W.J.M. Smits, P. van der Wolf, J.-Y. Brunel, W.M. Kruijtzer, P. Lieverse,
and K.A. Vissers. YAPI: Application Modeling for Signal Processing Systems. In Proceedings of the
37th DAC, Los Angeles, June 59, 2000, pp. 402405.
[6] Rutten, M.J., J. van Eijndhoven, E. Jaspers, P. van der Wolf, O.P. Gangwal, A. Timmer, and E.J. Pol.
AHeterogeneous Multiprocessor Architecture for Flexible Media Processing. IEEE Design and Test,
19(4), 3950, JulyAugust, 2002.
[7] Nieuwland, A., J. Kang, O.P. Gangwal, R. Sethuraman, N. Busa, K. Goossens, R.P. Llopis, and
P. Lippens. C-HEAP: A Heterogeneous Multi-Processor Architecture Template and Scalable and
Flexible Protocol for the Design of Embedded Signal Processing Systems. Journal of Design
Automation for Embedded Systems, 7(3), 229266, 2002.
[8] Catthoor, F., K. Danckaert, C. Kulkarni, E. Brockmeyer, P.G. Kjeldsberg, T. Van Achteren, and
T. Omnes. Data Access and Storage Management for Embedded Programmable Processors. Kluwer,
Dordrecht, 2002.
[9] Wong, C., P. Marchal, and P. Yang. Task Concurrency Management Methodology to Schedule the
MPEG4 IM1 Player on a Highly Parallel Processor Platform. In Proceedings of the CODES 2001,
Copenhagen, April 2527, 2001, pp. 170177.
[10] Grotker, T., S. Liao, G. Martin, and S. Swan, System Design with SystemC, Kluwer, Dordrecht, 2002.
[11] ARM, AMBA AXI Protocol Specication, June 2003.
[12] OCP International Partnership, Open Core Protocol Specication, 2.0 Release Candidate, 2003.
[13] Philips Semiconductors, Device Transaction Level (DTL) Protocol Specication, Version 2.2,
July 2002.
[14] Brunel, J.-Y., W.M. Kruijtzer, H.J.H.N. Kenter, F. Petrot, L. Pasquier, E.A. de Kock, W.J.M. Smits.
COSY Communication IPs. In Proceedings of the 37th DAC, Los Angeles, June 59, 2000,
pp. 406409.
[15] De Kock, E.A. Multiprocessor Mapping of Process Networks: A JPEG Decoding Case Study.
In Proceedings of the International Symposium on System Synthesis (ISSS), Kyoto, October 24,
2002, pp. 6873.
[16] http://www.antlr.org/.
[17] Cimpian, I., A. Turjan, E.F. Deprettere, and E.A. de Kock. Communication Optimization
in Compaan Process Networks. In Proceedings of the 4th International Workshop on Systems,
Architectures, Modeling and Simulation (SAMOS), 2004.
[18] Schiffelers, R. et al. Epics7B: A Lean and Mean Concept. In Proceedings of International Signal
Processing Conference 2003, Dallas, TX, March 31April 3, 2003.
[19] Abramson, Y. and B. Steux. Hardware-Friendly Pedestrian Detection and Impact Prediction.
In IEEE Intelligent Vehicle Symposium 2004, Parma, Italy, June 2004.
[20] Steux, B. and Y. Abramson. Robust Real-Time On-Board Vehicle Tracking SystemUsing Particles
Filter. In IFAC Symposium on Intelligent Autonomous Vehicles, July 2004.
[21] Gehrke, W., J. Jachalsky, M. Wahle, W.M. Kruijtzer, C. Alba, and R. Sethuraman. Flexible Cop-
rocessor Architectures for Ambient Intelligent Applications in the Mobile Communication and
Automotive Domain. In Proceedings of the SPIE VLSI Circuits and Systems, Vol. 5117, April 2003,
pp. 310320.
26
A Multiprocessor SoC
Platform and Tools for
Communications
Applications
Pierre G. Paulin,
Chuck Pilkington,
Michel Langevin,
Essaid Bensoudane, and
Damien Lyonnard
Advanced System Technology,
STMicroelectronics
Gabriela Nicolescu
Ecole Polytechnique de Montreal
26.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-2
26.2 Wire-Speed Packet Forwarding Challenges . . . . . . . . . . . . 26-2
Impact on NPU Architectures Survey of Multiprocessor
SoC Platforms
26.3 Platform and Development Environment
Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-4
26.4 StepNP Architecture Platform . . . . . . . . . . . . . . . . . . . . . . . . . . 26-5
StepNP Processors Network-on-Chip Use of Embedded
FPGAs and Embedded Sea-of-Gates Congurable Processor
Implementation Hardware Processing Elements
26.5 Multiex MP-SoC Tools Overview . . . . . . . . . . . . . . . . . . . . . 26-9
26.6 Multiex Modeling and Analysis Tools. . . . . . . . . . . . . . . . . 26-10
Modeling Language Hardware Multithreaded Processor
Models SOCP Network-on-Chip Channel Interface
Distributed Simulation Using SOCP
26.7 Multiex Model Control-and-View Support
Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-12
SystemC Structural Introspection SIDL Interface
Instrumentation of an SOCP Socket with SIDL
26.8 Multiex Programming Models . . . . . . . . . . . . . . . . . . . . . . . . 26-14
Survey of Multiprocessor Programming Models Multiex
Programming Model Overview
26.9 DSOC Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-16
DSOC Message Passing The DSOC ORB
26.10 SMP Programming Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-19
Target SMP Platform Hardware Support for SMP RTOS
Interoperable DSOC and SMP Programming Models
26.11 An IPv4 Packet Forwarding Application . . . . . . . . . . . . . . . 26-21
Networking Application Framework StepNP Target
Architecture Simulation Speed Multiprocessor
Compilation and Distribution IPv4 Results
26-1
26.12 A Trafc Manager Application . . . . . . . . . . . . . . . . . . . . . . . . . . 26-25
Application Reference DSOC Model StepNP Target
Architecture DSOC +SMP Model Experimental Results
Hardware Implementation
26.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-29
26.14 Outlook. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-30
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26-31
26.1 Introduction
The continuing growth in network bandwidth and services, the need to adapt products to rapid market
changes, and the introduction of new network protocols has created the need for a new breed of
high performance, exible system-on-chip (SoC) design platforms. Emerging to meet this challenge is the
network processor unit (NPU). An NPU is an SoC that includes a highly integrated set of programmable
or hardwired accelerated engines, a memory subsystem, high-speed interconnect, and media interfaces to
handle packet processing at wire speed [1].
Programmable NPUs preserve customers investments by letting them track ongoing specication
changes [2]. By developing a programmable NPU as a reusable platform, network designers can amortize
a signicant design effort over a range of architecture derivatives. They can also meet technical chal-
lenges arising from a products time-to-market constraints, as well as economic constraints arising from
a products short time-in-market.
Network processors present a whole new set of requirements. In our bandwidth hungry world, OC12
and OC48 network speeds are becoming common. On the horizon is OC192 (10 Gb/sec), which allows
for less than 50 nsec of processing time per packet received. It is clear that traditional microprocessors
cannot keep up with the speed and programmability requirements of network processors.
In the next section, we examine high-speed packet processing requirements, and highlight the resulting
NPUchallenges. We then describe the StepNPexible MP-SoCplatformand its key architectural compon-
ents. This is followed by a review of the Multiex modeling and analysis tools developed to support this
platform. We present the distributed message passing and symmetrical multiprocessing (SMP) parallel
programming models supported in Multiex. We describe the approach used for real-time task scheduling
and allocation, which automatically maps high-level parallel applications onto the StepNP platform.
Finally, we present detailed results of the mapping of IPv4 packet forwarding and trafc management
applications onto the StepNPplatform, for a range of architectural parameters. We also provide anoutlook
on the use of this approach for consumer multimedia (audio, video) applications.
26.2 Wire-Speed Packet Forwarding Challenges
Packet forwarding over a network includes the following main tasks [1]:
Header parsing: This consists of pattern matching of bits in the header eld.
Packet classication: Identication of the packet type (e.g., IP, MPLS, ATM) and attributes (e.g.,
quality of service requirement, encryption type).
Lookup: Consists of looking up data based on a key. It is mostly used in conjunction with pattern
matching to nd a specic entry in a table.
Computation: This varies widely by application. Examples include checksum, CRC, time-to-live
eld decrement, and data encryption.
Data manipulation: Any function that modies the packet header.
Queue management: Scheduling and storage of incoming and outgoing packet data units.
Multiprocessor SoC Platform 26-3
Control processing: Encompasses a large number of different tasks that usually do not need to be
performed at wire speed. These are usually performed on a standard fast reduced instruction set
computer (RISC) processor linked to the NPU. This so-called control plane is not the focus of this
chapter.
26.2.1 Impact on NPU Architectures
Wire-speedpacket forwarding, at rates oftenexceeding 1 Gb/sec, poses many more challenges thangeneral-
purpose data processing. First of all, data locality can be poor in network applications. One packet arrival
has little in common, generally, with any other packet arrival. This eliminates much of the utility of the
traditional data cache used in general-purpose processors.
In network processing, both memory capacity and bandwidth are extremely demanding. Routing
lookup tables are not only extremely large, but they must also support high-throughputs. The interconnect
between processors, memories, and coprocessors must support a very high, cost-effective, scalable
bandwidth [25].
Furthermore, a variety of bit-oriented data manipulations are needed throughout the packet pro-
cessing. Therefore, specialized instructions for efcient multibit eld manipulation are an important
requirement [6].
A key aspect of efcient NPU hardware (HW) use is latency hiding. Three main approaches are used
to hide latency:
1. Multithreading
2. Memory prefetching
3. Split-transaction interconnect
The most common latency-hiding approach is multithreading, which efciently multiplexes a processing
elements (PEs) hardware. Multithreading lets the hardware process other streams while another thread
waits for memory access or coprocessor execution. Most NPUs have separate register banks for different
threads, with hardware units that schedule threads and swap them in one cycle. We call this function
hardware multithreading.
26.2.2 Survey of Multiprocessor SoC Platforms
A number of multiprocessor platforms designed for SoC-scale applications have been described.
Daytona [7] was an early attempt to reach high DSP performances through MIMD (multiple instruc-
tion, multiple data) PEs. Each PE consists of a 32b GP-RISC and a vector unit with four 16b-MACs. The
performances reach a peak value of 1.6 billion 16b-MACs, assuming no cache misses. Such results are
extremely dependent on the instruction locality and require homogeneous data stream rates. This would
not be expected for applications that are more control dominated.
The PROPHID-based platform [8], namely Eclipse [9], has already been tuned into several dedicated
instances. Among them is the well-known Viper [10] that provides set-top boxes applications with relev-
ant multimedia features. Unfortunately, the use of numerous application-specic hardware accelerators
inevitably leads to the high NRE costs of ASIC-style design.
The Mescal system developed at U.C. Berkeley [11] allows a platform designer to build a platform
instance in a targeted, domain-specic way. This is achieved through a range of activities spanning PE
architecture and microarchitecture design, to network topology denition achieved with the assistance
of the Mescal development system [12]. An OSI-like message passing model [13] is used. While this
approach may be used to achieve the best cost/performance trade-off, it still implies high design and
maskset NREs.
S
3
E
2
S [14] is a design environment for heterogeneous multiprocessor architectures based on libraries
of components. A sequential model of the application is rst translated in a CDFG-like objects graph.
Then each object is targeted to the most relevant processor selected from the libraries. The design-
space exploration addresses the CPUs choice (ranging from GP CPU, DSP, to highly specic ones) while
taking into account the local memory accesses. Message-passing is the only supported communication
mechanism. Also, the NoC topology and implementation is not addressed (nor modeled).
While there are a number of commercial NPU platforms on the market [1], to our knowledge, there are
no platforms built specically as an exploration environment for multiprocessor design methodologies
andtools. Ideally, sucha platformwouldbe built withonly public domaincomponents andbe distributable
to a large R&Dcommunity. For example, the Mescal environment [11] is currently based on a commercial
NPU platform, which necessarily has limited distribution rights.
26.3 Platform and Development Environment Overview
In order to explore tool and architectural issues in a range of high-speed communications applications,
we have developed a system-level exploration platform for multiprocessor system-on-chip (MP-SoC)
architectures and the associated platform automation tools. Our current focus is on packet processing
applications used in network infrastructure SoCs. As depicted in Figure 26.1, the MP-SoC environment
consists of three main components:
1. An application development framework supporting parallel programming models.
2. The StepNP high-level multiprocessor-architecture simulation model.
3. The Multiex toolset for the programming, analysis, and debug of complex parallel applications.
In developing this MP-SoC exploration platform, we had several objectives. We wanted the
platform to be:
A challenging internal driver for our existing system design technology [15,16] and embedded
system [17] developments, as well as a driver for high-level multiprocessor platform methods
under development.
Reference platform MultiFlex SoC Tools
Application software
P1
Pn
RISC RISC RISC
P1
Pn
P1
Pn
Semaphore
engine
Network
search
engine
Coprocessor
I/O
Parallel programming models
(a)
(b) (c)
Network-on-Chip Network-on-Chip
RED Discard Classifier
Strip
Pulltopush
RISC RISC
FIGURE 26.1 MP-SoC exploration environment.
A vehicle for long-term research in multiprocessor-architecture exploration, design tools, and
methods.
An open, easily accessible environment, built with public-domain components as much as possible.
Furthermore, the architecture platform must also be representative of real NPU characteristics, or at least
serve as a baseline fromwhich realistic NPUarchitectures could be easily derived. The main characteristics
we chose to implement in StepNP are as follows:
Scalable multiprocessing.
Exploration of simple processor pipeline congurations.
Scalable hardware multithreading capability for the embedded processor models.
Support for a range of interconnect topologies to link the processor array, coprocessors, memories,
and I/O
Finally, the Multiex tools and the application framework must allow for:
Rapid development of packet processing applications.
Extensible tool framework for embedded software (SW) development, and model control, debug,
and analysis.
Support of appropriate high-level parallel programming models, based on an appropriate mix of
distributed message passing and shared-memory paradigms.
26.4 StepNP Architecture Platform
Commercial NPUs feature a wide range of architectural styles, fromfully software programmable toalmost
completely hardwired [1]. For the StepNP initial platform, or base platform, we use a fully programmable
architecture based on standard RISCprocessors, and a simple but exible interconnect. The base platform
allows easy plug-and-play replacement with more specialized processors, coprocessors, and interconnect.
Figure 26.2 depicts the StepNP exible MP-SoC architecture platform.
The StepNP platform instance used here includes:
Models of (re)congurable processors
A network-on-chip (NoC)
Recongurable HW (embedded eld programmable gate array [FPGA] or embedded congurable
sea-of-gates) and standard HW
As well as communication-oriented I/Os
Proc.
eRAM eSoG
P1
Pn
Proc.
eRAM
P1
Pn
MPU
I/O
SPI4.2
Hyper-
transport
QDR
Mem I/O
eSoG
H/W
Scheduler
SoG/FPGA
Memory
Processor
Processor 1 Processor N
NoC
H/W PE
(eSoG)
H/W PE
(eFPGA)
Packeti-
zation
ASIC
Gen-purpose
I/O eMEM
FIGURE 26.2 StepNP exible communications platform.
Note that, aside from these domain-specic I/Os, StepNP is a general-purpose, exible multiprocessor
platform. It has been subsequently used for consumer audio and video applications, based on a variant of
the StepNP platform which uses different I/Os.
26.4.1 StepNP Processors
It is our conviction that the large-scale use of software programmable embedded processors will emerge
as the key means to improve exibility and productivity [1820]. These processors will come in a
wide diversity, from general-purpose RISC to specialized application-specic instruction-set processors
(ASIPs), with different trade-offs in time-to-market versus product differentiation (power, performance,
cost). Domain- or application-specic processors will play an important role in bridging the gap between
the required ease-of-use and high exibility of general-purpose processors on one end, and the higher
speed and lower power of hardware on the other. Congurable processors are one possible means to
achieve processor specialization from an RISC-based platform.
A common requirement for all classes of processors is the efcient handling of the latencies of the
interconnect, memory, and coprocessors. A variety of approaches can be used. These include cach-
ing, multithreading, memory prefetching, and split-transaction interconnect. Multithreading lets the
processor execute other streams while another thread is blocked on a high-latency operation. A hardware
multithreaded processor has separate register banks for different threads, allowing low-overhead switching
between threads, often with no disruption to the processor pipeline.
The StepNP simulation framework allows easy integration of a range of general-purpose to application-
specic processor models. We have integrated public domain instruction-set models of the most
popular RISC processors. The base StepNP architecture platform includes the public-domain models
(http://www.fsf.org) of the ARM v4 and the PowerPC (versions 603, 603a, and 604), and MIPS
(32 and 64 bits) instruction-set architectures, as well as the Stanford DLX processor model [21].
In order to explore network-specic instruction-set optimizations, the Tensilica Xtensa congurable
processor model [22] has been integrated by our academic research partners [6]. Other researchers within
ST have demonstrated the use of embedded FPGA to implement user-dened instructions, therefore
implementing a recongurable processor [23].
For the exploration of ASIPs, we support the inclusion of instruction-set simulator (ISS) models
generated from the Coware/LisaTek instruction-set simulation model generator toolset [24]. As a rst
demonstrator of this approach, we have developed a LisaTek-based ISS for the Xilinx MicroBlaze soft
RISC processor, and are currently extending it for hardware multithreading. Researchers can use this as a
basis for further architecture extension or specialization.
26.4.2 Network-on-Chip
A key component of an MP-SoC platform is the interconnect technology. The StepNP platform makes a
very important assumption on the interconnect topology: namely, it uses a single interconnect channel
that connects all I/Oand PEs. An orthogonal, scalable, interconnect approach with predictable bandwidth
and latency is essential:
1. It provides a regular, plug-and-play methodology for interconnecting various hardwired, recong-
urable, or SW programmable IPs.
2. It supports the high-level communication between processes on multiple processors, and simplies
the automatic mapping onto the interconnect technology.
However, it moves the complexity of the effective use of communication resources to the resource alloc-
ation tools, which must be tuned to the interconnect topology. This will be explored in further sections.
We advocate the recent so-called NoC approaches currently under development [3,4]. We also strongly
support the need for a standard NoC interface denition. Among others, we are currently using the
OCP-IP standard [25] in our SoC platform developments [26], as discussed further.
Bus-like
Low latency
Blocking (large contention)
Not easy to scale, need hierarchy
Tree-like
Medium latency
Blocking (blind routing)
Medium scalability
Ring-like
Large latency
Can be nonblocking
Scalable
Crossbar-like
Low latency
Nonblocking
Costly, poor scalability
FIGURE 26.3 NoC topologies.
We are exploring a range of NoC topologies, ranging from high-performance buses, rings, tree-based
networks, and crossbars. The pros and cons of some of the main topologies are summarized in Figure 26.3.
A common issue with all NoC topologies is communication latency. In 50 nm process technologies, it
is predicted that the intrachip propagation delay will be between six and ten clock cycles [4]. A complex
NoC could therefore exhibit latencies many times larger. Moreover, the increasing gap between processor
clock cycle times and memory access times further increases the need for latency hiding. Finally, copro-
cessors can introduce additional latencies. Effective latency hiding is therefore key in achieving efcient
parallel processing.
This is the key reason for the adoption of hardware multithreading processors in the StepNP platform.
This implies that the programming tools must be able to automatically exploit this capability. This was
achieved using hardware-assisted dynamic task allocation, as described further.
The interconnect model developed inthe base StepNPplatformis a simple functional model supporting
split-transaction communication, as depicted in Figure 26.4. We gave the communication channels
interface denition particular attention, and we describe it in the next sections.
The rst planned physical implementation of the StepNP platform will be based on an ST interconnect
framework, the STBus [27] that supports a wide range of topologies, including buses, bridges, and
crossbars. The STBus protocol supports similar advanced features to OCP-IP, for example, out-of-order
and split-transactions. Despite the name, STBus is not a bus per se, but is infact aninterconnect generation
framework, which supports the automatic generation of a range of interconnect topologies made up of
buses, bridges, and crossbars. The STBus toolset generates an RTL-synthesizable implementation. We have
integrated the STBus SystemC model into StepNP. Other NoC approaches are also being investigated:
In cooperation with the UPMC/LIP6 laboratory in Paris, we have developed a 32 port version of
the SPINpacket-switched interconnection NoC[28,29], to be implemented using STs 0.13 micron
process.
A ring-based NoC topology is also under development. This provides high scalability and can be
designed as nonblocking, but at the expense of higher latencies.
Thread 1
Thread 2
Thread 3
R
e
a
d
W
r
it
e
D
a
ta
A
c
k
Buffer
Buffer
Buffer
Buffer
RISC
RISC
Master side Slave side
Thread 1
Thread 2
Thread 3
FIGURE 26.4 Split-transaction communication channel.
Finally, for the emerging 65 nm process technology node and beyond, we are exploring globally
asynchronous, locally synchronous approaches. One interesting example of this approach is the Star
network [30], which serializes packets and uses plesiochronous clocking regions.
26.4.3 Use of Embedded FPGAs and Embedded Sea-of-Gates
It is our belief that the large majority of end-user SoC product functionality will run on the heterogeneous
embedded processors. However, power and performance constraints will dictate partitions where the
majority of performance will come from a combination of optimized hardware, embedded sea-of-gates
(eSoGs), or embedded eld programmable gate array (eFPGA), implementing critical inner loops and
parallel operations, but of lower functional complexity. Industrial case studies justifying the use of the
various types of platform components were described in Reference 20.
Embedded FPGAs are used in the StepNP platform to complement the processors, but with limited
scope. The 50 times cost and 5 times power penalty of eFPGAs restrict more widespread use. Never-
theless, for high-throughput and simple functions, or highly parallel and regular computations, eFPGAs
can play an important role. An eFPGAtest chip was developed in STs 0.18 mCMOS process, and results
were presented in Reference 23.
Embedded SoGtechnology, for example, such as that proposed by eASIC[31], which is congured with
one or two masks, is another interesting cost and exibility compromise which we are also incorporating
in StepNP. A test chip including this technology was developed in STs 0.13 m CMOS process. The 3 to
3.5 times cost penalty over standard cells is compensated by the lower mask set NRE, which can be 10 to
30 times of lower cost than a complete mask set.
The StepNP platformuses eFPGAs and eSoGin two roles: for (re)congurable processor customization
and for hardware processing elements (HW PEs), as described next.
26.4.4 Congurable Processor Implementation
Embedded FPGA and SoG fabrics are used to implement recongurable and congurable processors.
SThas developed and manufactured a 1 GOPS recongurable signal processing chip [23]. This combines a
commercial congurable RISCcore with an eFPGAthat implements the application-specic instructions.
In the StepNP physical implementation, we extend this approach to use eSoGs to achieve a low-cost,
one-time congurable version of these application-specic instructions.
26.4.5 Hardware Processing Elements
The HW PEs of StepNP are implementable using a user-dened combination of eFPGAs and eSoGs.
To facilitate interoperability, all PEs communicate to the NoC via a standard protocol. The conversion
between the HW PEs internal data representations and the packet-oriented format of the NoC (as depicted
by the packetization blocks of Figure 26.2), is performed by hardware wrappers automatically generated
by the SIDL compiler described further.
To support applications in the networking space, we have also integrated an ST-proprietary high-
performance network packet search engine (NPSE) optimized for IPv4/IPv6 forwarding. This search
engine is a pipelined SRAM-based solution to the lookup and classication problems. In comparison with
CAM-based lookup methods, this SRAM-based approach has more memory and is power-efcient [32].
The StepNP platform served as a validation environment for this search engine during the architecture
design phase.
The key characteristic of the StepNP platform is that, although it is composed of heterogeneous
HWSW PEs, memories, and I/O blocks, the use of a single standardized protocol to communicate with
a single global NoC allowed us to build a homogeneous programming environment supporting automatic
application-to-platform mapping.
26.5 Multiex MP-SoC Tools Overview
It is our conviction that the success of an MP-SoC platform will depend mostly on the quality and
effectiveness of the platform programming, debug, and analysis tools. In particular, it will depend on
the platforms ability to support a high-level parallel programming model, therefore enabling higher
productivity tools. Unless abstractions are introduced, MP-SoC devices will be very hard to program.
This is due to the different instruction sets of the processors, different data representations, computations
that are split between hardware and software blocks, evolving partitioning of computation, etc. The
parallel programming model and supporting tools are the key means to bridge the gap between system
specications and the platform capabilities, as discussed in Reference 19.
To support the objective of bridging the gap between system specications and the platform capabil-
ities, we have developed the Multiex toolset for MP-SoC systems, as depicted in Figure 26.5. Although
networking and communications applications are the rst key drivers, the toolset and platform are in fact
quite general and are being applied to other application domains, for example, advanced image processors
and wireless base stations but this is beyond the scope of this chapter.
The Multiex environment leverages our existing system-level design tools [15,16], and embedded
software technologies [17], but also adds several multiprocessor-oriented capabilities, using StepNP as an
MP-SoC driver platform.
The environment starts with a SystemC-based [33] congurable model of the multiprocessor platform.
It supports interaction with the model via a well-dened application-programming interface (API) called
SystemC interface denition language (SIDL). This not only allows for model input and output, but also
allows for orthogonal interactions with the model for various debug and analysis tools.
The Multiex tools support the automatic dynamic scheduling and allocation of tasks to multi-
processors from two high-level parallel programming models: a distributed system object component
(DSOC) message passing model, and a symmetrical multiprocessing (SMP) model using shared memory,
as described in Section 26.8. This is supported by a GNU-based multiprocessor processor source-level
debugger and by STs FlexPerf processor performance analysis tools [17].
The top of Figure 26.5 depicts the top-level control layer that provides interaction, control, and analysis
of the architecture model from a domain-specic, high-level of abstraction. It allows visualization and
MP platform models:
functional / TLM / (RTL)
PE H/W
PE
MP mapping,
scheduling
Appln.
Mem
SIDL Access API
Data Out
Data In
MP performance
analysis
Standard
views
User-defined
views
SoC Tools
MP debugger
Model
control
Pulltopush
Strip
FIGURE 26.5 The Multiex MP-SoC tool environment.
control of the models execution from various perspectives (programming, temporal, spatial, and user-
dened) for a range of abstraction levels (functional, transaction, and cycle-based).
The programming perspective is a multiprocessor version of a conventional source-level debugger. The
logical perspective lets the user track a single packets processing logically, even though the processing is
distributed over multiple processor, interconnect, and hardware resources. The temporal perspective lets
the user visualize parallel activities on a timeline. The timeline can represent various abstraction levels
for example, the name of the top-level C function running on a processor at a given time, or the signal
value on a bus. The spatial perspective allows event tracking in a hierarchical block diagram. StepNP
automatically extracts the graphical representation of the hierarchy from the model via an introspective
API described later.
Finally, an automatic API generator allows easy connection of scripting languages or graphical
environments enabling user-dened perspectives, a key requirement for STMicroelectronics customers.
26.6 Multiex Modeling and Analysis Tools
This section describes the tools and methods used to model and analyze the StepNP architecture platform
described in Section 26.4.
26.6.1 Modeling Language
We chose SystemC2.0 [33], with its wide range of modeling abstraction capability, as the StepNP architec-
ture platforms main modeling language. A range of abstraction levels are supported: untimed functional,
transaction-level modeling [16], and cycle-accurate modeling. Transaction-level modeling is the most
useful for our purposes, combining reasonable accuracy and higher simulation speeds than cycle-accurate
models [15,16].
Where appropriate, we also included more-specialized languages for example, Python and tool
command language (Tcl) for user scripts and Java/Forte for graphical user interfaces and user-dened
extensions.
26.6.2 Hardware Multithreaded Processor Models
As discussed in Section 26.4, one important requirement for the processors in the StepNP platform is
the efcient handling of the latencies of the interconnect, memory, and co-processors. Multithreading lets
the processor execute other streams while another thread is blocked on a high-latency operation. A hard-
ware multithreaded processor has separate register banks for different threads, allowing low-overhead
switching between threads.
As models of processors supporting hardware multithreading were not readily available, we chose
to use standard monothreaded processors and emulate this capability via modeling. For the standard
RISC-style processors supported in StepNP, namely the ARM, PowerPC, MIPS, and DLX models,
we encapsulate the functional instruction-set models into a SystemCwrapper. The resulting encapsulation
produces a cycle-based model implementing a congurable m-wide hardware multithreading capability
and a simple n-stage pipeline. This approach could also be used in principle for the congurable or
application-specic processor models.
To achieve this, each thread in the SystemC processor wrapper calls the ISS to implement the thread
instructions. The wrapper interleaves the sequencing and timing of these calls to model the execution of
a hardware-multithreaded processor. The ISS returns memory reference operations to the wrapper for
implementation in SystemC. The wrapper communicates with the rest of the StepNP platform via the
SystemC open core protocol (SOCP) communication channel interface described in the next section.
26.6.3 SOCP Network-on-Chip Channel Interface
Akey component of the StepNPplatformis the interconnect technology, as discussedinSection26.4.2. The
StepNP platform uses a single NoC that connects all I/O and PEs (HW and SW). The NoC interface is an
important part of the modeling environment. Our goal increating aninterface tothe NoC, was the creation
of a standardized API, which would enable plug and play of different SoCIPs at various abstraction levels.
The following requirements motivated the StepNP communication channel interfaces design:
The interface must operate at the functional and transaction levels. The interface should contain
no bit-level signals, polarities, clock cycles, or detailed timing. This requirement does not preclude
an adapter that, for example, maps the interface to a cycle-accurate internal model.
The interface-modeling approachshoulduse SystemC2.0 constructs ina manner as close as possible
to their original intent. In other words, the communication between master, channel, and slave
should use the SystemC 2.0 port, interface, and channel methodology.
The interface should make no assumptions about the underlying interconnect architecture. It must
support anything from a simple point-to-point connection to an arbitrarily complex, multilevel
network on a chip.
The interface must support split transactions and should not assume that requests and responses
are atomic.
It must support multithreaded masters and slaves.
If possible, the interface should be compatible with existing interfaces designed for IP reuse.
This channel interface is dubbed SOCP and was fully described in Reference 26. We developed the
SOCP channel interface model using the SystemC 2.0 language. The model follows the same high-level
semantics as the open core protocol (OCP) [25] and the virtual component interface (VCI) [34] but
has no notion of signals or detailed timing. For transaction-level modeling, these standards and can be
considered functionally identical, and we refer to them interchangeably. Our modeling approach has the
following advantages:
The SOCP channel interface model can inherit semantics of parameters and behavior largely from
the OCP/VCI specication.
The StepNP user can rene the SOCP to transform it to an OCP/VCI lower-level interface or to
other interconnect implementations, such as industry-standard buses or complex NoCs.
The channel interface model achieves higher simulation speeds than OCP/VCI or bus-level channel
implementations because of its higher abstraction level.
26.6.3.1 Extending OCP Semantics for High-Level Modeling
Although we followed the base OCP and VCI semantics, the SOCP needed selective extensions for
functional- and transaction-level modeling.
Data crosses an OCP interface one word at a time. In SOCP, however, we also allow a complete burst
transaction in one put request. We do this by specifying an additional length parameter with a value
greater than one and allowing pointers for the address-, data-, and byte-enable parameters. We assume
that the other parameters are constant for each data item of the transfer, except for the MBurst parameter,
which need not be set by the master. However, a cycle-accurate adapter on the other side of the interface
can choose to feed the data into a lower-level internal model and generate the burst signal for each word
as appropriate. The SOCP interface requires the response length to match the request length.
Calls across anSOCPinterface canblock the caller. Therefore, the caller shouldbe a SystemCthreadcon-
struct (SC_THREAD). If a slave IP component blocks a put request from a channel, however, the channel
could be blocked until the request is serviced. Therefore, if channel blocking is an issue, slave IP com-
ponents should avoid blocking a put request (by using the SThreadBusy back-pressure mechanism or by
buffering requests). The same applies to a response from the slave (or channel) to the master.
It is possible to implement both master and slave in a purely functional manner. If a purely functional
channel model is used, some mechanism should be provided in the channel to schedule new threads;
otherwise, one master will dominate the simulation, and no other threads will run.
26.6.3.2 SOCP Outlook
The SOCP work was used to help dene the requirements for a new transaction-level modeling interface
proposal which was submitted to the Open SystemCInitiative [33] community, and was recently accepted.
This newframework supports a range of protocols (e.g., STBus, Amba, and OCP) ontop of a generic trans-
port mechanism. The StepNPplatformmodels are currently being adaptedtoalignwiththis newapproach.
26.6.4 Distributed Simulation Using SOCP
A benet of a well-dened communication interface is that it can serve as the basis for partitioning the
SystemC simulation model over multiple workstations. To achieve this capability in StepNP, we imple-
mented an SOCP channel called dsim that allows distributed simulation over a pool of workstations. The
masters andslaves plug into the dsimchannel without any changes. The channel implementation, however,
is aware of the workstations involved in the simulation and of the components addresses and identiers.
If a master sends a request to a local slave, the transaction completes in much the same way as the func-
tional channel. However, if the slave is on another workstation, the channel implementation packages the
request and sends it over the network to the destination slave. In this approach, either the communication
rate or the slowest distributed SystemC model component limits the overall simulation speed.
We measured communication rates of about 30 kHz using the dsim channel implemented with the
transmission-control and Internet protocols (TCP/IP) over a standard 100-Mbit Ethernet line.
26.7 Multiex Model Control-and-View Support Framework
Designers and programmers using the StepNP platform need to debug, verify, understand, initialize,
and measure the platform models execution. This diversity of uses requires a robust methodology for
controlling and viewing models.
In most cases, the control-and-view components are parts of a process separate from the model itself
and are often implemented in another language. For example, scripting languages such as Tcl and Python
are used to automate verication. A typical verication scenario could have a script that populates the
routing table in the model with known values and then injects packets into a network device in the model.
(a) (b) (c)
ICE socket
SoCP
channel
Slave
A1
A2
Master
Port Interface Probe
C++
Tcl
Java
Access API Access API Access API
Router
test-
bench
SIDL ICE I/F
SystemC
model
SIDL probe I/F
SoC tools
(d)
Performance
analysis tools
FIGURE 26.6 Multiex control-and-view methodology: (a) application testbench, (b) performance analysis tool,
(c) SoC tools platform, and (d) model.
The script would then start the model execution and examine the emitted packet for correct header values
and processing latency.
Given these control-and-view requirements, the Multiex environment supports the introspective
approachillustratedinFigure 26.6. Anexternal control-and-viewcomponent for example, the SoCtools
platformconnects to the models access API. This component can query the models structure, discover
components the model uses, and discover the supported interfaces to these components. In Figure 26.6,
the external SoC tool component has discovered a SystemC probed signal between two slave model
subcomponents, A1 and A2. The Multiex control-and-view framework automates much of this process.
26.7.1 SystemC Structural Introspection
SystemCprovides anAPI for traversing the models structural hierarchy. The Multiex access API enhances
this basic support and builds a structural representation of the model. Low-level signals (such as objects of
type sc_signal) and simple state representations can use a Multiex-supplied probe class. This probe class
extends signals and state representation variables with functionality that connects to external control-
and-view components. These components can use this functionality to discover the models signals and
probed state and to recover the time history as needed. They can also use the probe class for automating
sc_trace control, setting breakpoints, or for other functions.
26.7.2 SIDL Interface
For software written using high-level SystemC 2.0 modeling facilities, it is more difcult to automatically
extract state information and allow control access. Therefore, we developed an instrumentation meth-
odology and an associated language called the SIDL. An SIDL interface allows external control-and-view
components written in various languages to access a SystemC model object implementing this
interface.
An SIDL looks much like a pure virtual C++ class and is patterned after the sc_interface approach in
SystemC. For example, the SIDL interface to a simple counter could be
class CounterCandV {
public:
virtual int getCount() = 0;
virtual int setCount(int) = 0;
};
An SIDL compiler parses an SIDL header le and produces all the client-server glue. The glue on
the server end connects an object implementing this interface to the low-level access API. The compiler
produces the client-end glue in the desired language (e.g., Java, C++, Python, or Tcl). The function
parameters can be basic types (integers, oats, strings, and so forth), structures, or vector containers of
these types. The SIDL compiler handles all marshaling and remote procedure call issues.
In the CounterCandV example, a counter in the SystemC model needs only to inherit the server instance
of this class (CounterCandVServer) generated by the compiler, and implement the getCount and setCount
methods. The client can call the access API in the server to discover all control-and-view interfaces and
the names of these instances. For example, a client might nd that the model supports a CounterCandV
object, named counter0. The client can then create a CounterCandVClient object, supplying the name
counter0 to the constructor. The client can then call the getCount and setCount methods of this object,
which transparently calls the getCount and setCount methods in the corresponding SystemC object.
At one level, SIDL looks like a distributed object model such as the common object request broker
architecture (CORBA) [40]. However, SIDL is more restricted in scope than CORBA; it follows the
interface style of SystemC, and is integrated in the SystemC environment.
26.7.3 Instrumentation of an SOCP Socket with SIDL
It is possible to develop generic control-and-view interfaces for common master and slave components,
such as processors and memories. However, the instrumentation of an SOCP interface is of particular
interest because instrumentation tools developed for an SOCP interface can be used with any IP block.
An IP block with an SOCP interface can be plugged into an object whose function is analogous to
an in-circuit emulator (ICE) socket, with no change of the channel object or the device under test, as
illustrated for the master module in Figure 26.6(d). The abstract ICE socket can transparently pass master
requests and slave responses. However, external software can monitor or generate the transactions using
an SIDL interface. The ICE socket can also perform transaction recording and store the transactions
in a trace le for viewing by standard CAD tools or, as Figure 26.6(b) shows, by more specialized SoC
performance analysis tools such as STMicroelectronics, SysProbe [15], and FlexPerf [17].
26.8 Multiex Programming Models
The new capabilities of emerging SoC devices will require a radical shift in how we program complex
systems [18,19]. Unless abstractions are introduced, MP-SoC devices will be very hard to program. This
is due to the different instruction sets of the processors, different data representations, computations that
are split between hardware and software blocks, evolving partitioning of computation, etc.
26.8.1 Survey of Multiprocessor Programming Models
The importance of new high-level programmer views of SoC, and the associated languages that support
the development of these views, is presented in Reference 35. The capabilities supported by high-level
programming models (e.g., hiding hardware complexity, enhancing code longevity, and portability, etc.)
and different programming model paradigms (e.g., message passing and shared memory), are presented
in the context of NoC design [4,36].
A number of multiprocessor programming models designed for SoC-scale application have been
presented. The Mescal approach [11] is based on a programming model dened as a set of APIs abstract-
ing a microarchitecture. It serves the dual function of capturing enough of the application domain to pass
down to the compiler as well as exporting just enough of the architecture to enable the programmer to
efciently program the hardware platform. Based on this denition, in Reference 37 the authors present
a programming model that is inspired from the Click language, however, it is essentially dedicated to the
IXP1200 NPU.
Kiranet al. [38] propose a parallel programming model for communication, inthe context of behavioral
modeling of signal processing applications. This model, named shared messaging model, integrates
the message passing and shared-memory communication paradigms. It exploits the advantages of both
paradigms providing up to an order of magnitude improvement in the communication latency over a
pure message-passing model. However, this approach only addresses communication modeling, and not
the subsequent implementation.
Forsell [39] presents a sophisticated programming model that is realized through multithreaded pro-
cessors, interleaved memory modules, and a high-capacity interconnection network. This is based on
PRAM (parallel random access machine) paradigm. However, this is restricted to a xed NoC architecture
(Eclipse).
26.8.2 Multiex Programming Model Overview
It is our conviction that programming model development will be evolutionary, rather than revolutionary,
and the trend will be to support established software languages and technologies, rather than the devel-
opment of entirely new programming paradigms. Currently, and in the foreseeable future, large systems
will be written mostly in C++, Java, or languages supported by the Microsoft common language runtime
(CLR), such as C#. The Java and CLRsupported languages have established programming models for both
tightly coupled and loosely coupled programming. Briey stated, tightly coupled computing is done with
some variant on an SMP model (threads, monitors, conditions, signals), and heterogeneous distributed
computing is accomplished with some variant on a component object model (CORBA [40], Enterprise
Java Beans, Microsoft DCOM [41] and its evolutions, etc). Recent proposals for C++evolution [42] have
also called for SMP and component models inside the C++language specication.
The two SoC parallel programming models presented here are inspired by leading-edge approaches for
large system development, but adapted and constrained for the SoC domain:
DSOC model. This model supports heterogeneous distributed computing, reminiscent of CORBA
and Microsoft DCOM distributed component object models. It is a message-passing model and it
supports a very simple CORBA-like interface denition language.
SMP, supporting concurrent threads accessing shared memory. The SMP programming concepts
used here are similar to those embodied in Java and Microsoft C#. The implementation performs
scheduling, and includes support for threads, monitors, conditions, and semaphores.
Both programming models have their strengths and weaknesses, depending on the application. In the
Multiex system, both can be combined in an interoperable fashion. On one hand, this should be natural
for programmers familiar with Java or C#, and on the other hand, be efcient enough (both in terms of
execution efciency and resource requirements) for use in emerging SoC devices. The following sections
describe both of the Multiex programming models in more detail.
In comparison with the systems cited in the survey above, we believe the methodology supported by
the Multiex system has four key contributions:
1. Support of interoperable distributed message passing and SMP programming models.
2. An extremely efcient implementation of these models is achieved using novel hardware
accelerators for message passing, context switching and dynamic task scheduling and allocation.
H/W
PE
Output
Object2
Object1
NoC
Fast
MT-RISC
RISC,
Std. O/S
Input
S/WS/W com
S/WH/W com
PE1.1
PE1.n
PE3 PE2
H/W ORB
Message
passing
engine
Object3
Object4
FIGURE 26.7 DSOC model to platform mapping.
3. Support of homogenous programming styles for MP-SoC platforms composed of heterogeneous
HWSW PEs. This is achieved via an interface denition language and compiler supporting a
neutral message format.
4. All application programming is done using high-level language (C or C++ combined with high-
level calls to the parallel programming model APIs). Moreover, all the Multiex tools supporting
the mapping, compilation, and runtime execution are also written in processor-independent
C and C++.
26.9 DSOC Programming Model
The DSOC programming model relies on a high-level representation of parallel communicating objects,
as illustrated in the simple example of Figure 26.7, where the four objects represent application func-
tions. DSOC objects exchange messages via a well-dened interface denition language (IDL). The IDL
description denes the interface to an object, in a language neutral way. As illustrated in Figure 26.7, the
DSOCobjects canbe assignedtogeneral-purpose processors running a standardoperating system(e.g., for
Object2), to multiple hardware multithreaded processors (Object1 and Object3), or to HWPEs (Object4).
Due to the underlying heterogeneous components involved in the implementation of the interobject
communication, a translation to a neutral data format is required. In the Multiex system, we have
implemented an IDL dubbed SIDL.
1
SIDL looks much like a pure virtual C++ class, and is patterned
after the sc_interfaceapproach in SystemC. As explained below, the use of SIDL is key to the message
passing implementation.
1
Although the SIDL syntax used here is identical to the one used for API generation in the Multiex control and
view framework described in Section 25.7, the usage is in a different context, and this SIDL compiler uses a different
back-end.
The DSOC programming model relies on three key services. As we are targeting this platform at high-
performance applications, such as network trafc management at 2.5 and 10 Gb/sec line-rates, a key
design choice is the hardware support for remote object calls:
The hardware Message Passing accelerator engine is used to optimize remote object calls. It translates
outgoing messages into a portable representation, formats them for transmission on the NoC, and
provides the reverse function on the receiving end.
The hardware Object request broker (ORB) engine is used to coordinate object communication. As
the name suggests, the ORB is responsible for brokering transactions between clients and servers.
Currently, a simple rst-come, rst-served scheduling mechanism is implemented. The ORB engine
allows multiple servers to service a particular service type. This allows client requests to be load
balanced over the available servers.
Hardware Thread Management coordinates and synchronizes execution threads. All logical applic-
ation threads are directly mapped onto hardware threads of processing units. No multiplexing of
software threads onto hardware threads is done. While the global thread management is performed
by the ORB (i.e., synchronizing and matching client threads with server threads), the cycle-by-cycle
scheduling of active hardware threads on individual processors is done by the processor hardware,
which currently uses a round-robin scheduler. A priority-based thread scheduler is also being
explored.
26.9.1 DSOC Message Passing
In the Multiex system, a compiler is used to process the SIDL object interface description, and generate
the client or server wrappers that are appropriate for the (hardware or software) PE sending/receiving a
message. For processors, the SIDL compiler generates the low-level communication software driving the
message passing hardware. For HW PEs, the compiler generates the data conversion hardware and links it
to the NoC interface.
The end result is that the client wrapper takes client calls, and, with the help of the message passing
engine, marshals the data into a language neutral, portable representation. This marshaled data is
transferred over the NoC to the server wrapper. The server wrapper unmarshals the data, and invokes
the server object implementation. The return values are then marshaled, sent back to the client, and
unmarshaled in a similar way. Due to the hardware support for message passing, the software overhead
for the remote invocation is a few dozen instructions. Note that no software context switching is done. If
the server method returns a result, the client hardware thread is stalled until the result is ready.
26.9.2 The DSOC ORB
Figure 26.8 illustrates a sample DSOC execution platform. The top of the gure depicts a mix of software
DSOC objects, running on one or more multithreaded RISC processors, as depicted on the top left-hand
side. It also includes a number of hardware DSOC objects, shown in the top right-hand side of the gure.
The role of the ORB in parallel execution and systemscheduling is the key. Parallel execution in a DSOC
application is achieved using one or more of the following mechanisms:
Many client objects may execute in parallel.
A service may be load balanced over a number of resources.
The programming model inside a DSOC object is not specied, but could involve SMP or message
passing.
In some cases, a client call may return to the client before the server has completed the request,
allowing both client and server to execute in parallel.
Simply stated, the approach used here involves replicating object servers over a server farm. The ORB
matches client requests for service with a server, according to some criteria. In the current implementation,
the least loaded server is selected.
H/W MT
RISC (1 to N)
S/W
DSOC
Obj
S/W
DSOC
Obj
S/W
DSOC
Obj
NoC
Message Engine
and NoC I/F
Message Engine
and NoC I/F
Message Engine
and NoC I/F
H/W ORB
ID
ID ID FIFOs
NoC I/F
H/W
DSOC
Obj
H/W
DSOC
Obj
H/W
DSOC
Obj
FIGURE 26.8 Execution platform for DSOC programming model.
In our approach, logical threads are mapped one-to-one to the physical threads of the hardware
multithreaded PEs. This may seem like a limitation, but on the other hand, even fairly small systems
we envisage have 64 hardware threads or more (e.g., eight processors with eight threads each). In actual
systems we have designed, the limitation on the number of threads has not proved to be an issue (indeed,
until recently, some Unix workstations had process table restrictions limiting the number of processes to
under 256).
As a result of this mixed HW/SW implementation, the software overhead for a complete DSOC call is
very low. For example, a call with a few integer parameters takes less than 50 instructions for a complete
round trip between client and server: a half-dozen on the client-side and 20 on the server-side (assuming
client and server tasks are implemented in software the overhead is lower in the case of the server
running in hardware). This includes:
Call from the client to the server proxy object
Insertion of the call arguments into message passing engine
Retrieval of the arguments at the server-side
Dispatching of the call to the server object
Insertion of the result into the server message passing engine
Reading of the results by the server proxy from the client message passing engine
Return from the server proxy object to the client with result
The client-side code emitted by the SIDL compiler for the above looks much like a normal function
call. However, instead of pushing arguments on a stack, the arguments are pushed into the MPE. Instead
of a branch to subroutine instruction, a special MPE command is given to trigger the remote call. If the
object call returns a result, the client thread is stalled until the request is serviced. No special software
accomplishes the stall, rather the client immediately reads the return result from the MPE, and this read
stalls the client thread until results are ready. All this is inlined, so the client-side DSOC code can be a
handful of assembler instructions.
The server-side code is slightly more complex, as it rst reads an incoming service identier and function
identier from the MPE. It then does a table lookup, and branches to the code handling this object method.
This is implemented in less than a dozen RISC instructions, typically. From there, arguments are read
from the MPE, and the object implementation is called. Finally, results (if any) are put in the MPE for
transmission back to the client. Again, the overhead for this is roughly the same as a local object method call.
The end result of this HW/SW architecture is that we are able to sustain end-to-end DSOC object
calls from one processor to another, at a rate of about 35 million per second, using 500 MHz RISC-style
processors.
26.10 SMP Programming Model
Modern languages such as Java and C# support both tightly coupled SMP-style programming (with shared
memory, threads, monitors, signals, etc.), as well as support for distributed object models, as described
above. Unfortunately, SoC resource constraints make languages such as Java or C# impractical for high-
performance embedded applications. For example, in a current STMicroelectronics consumer application,
the entire operating system budget is less than 1000 instructions. As we have seen in the previous
section, DSOC provides an advanced object-oriented programming model that is natural to Java or C#
programmers, with essentially no operating system software or language runtime. Next, we will describe
how we support a high-level SMP programming model in the same resource-constrained environment.
26.10.1 Target SMP Platform
To address the efciency requirements of high-performance SoCs, SMP functionality in the Multiex
system is implemented by a combination of a lightweight software layer and a hardware concurrency
engine (CE), as depicted in Figure 26.9. The SMP access functions to the CE are provided by a C++ API.
It denes classes and methods for threads, monitors (with enter/exit methods), condition variables (with
methods for signal and wait), etc.
The CEappears to the processors as a memory-mappeddevice, whichcontrols a number of concurrency
objects. For example, a special address range in the CE could correspond to a monitor, and operations
on the monitor are achieved by reading and writing addresses within this address range. Most operations
associated with hundreds or thousands of instructions on a conventional SMP operating system are
accomplished by a single read or write operation to a location in the CE.
To motivate the need for a hardware CE, consider the traditional algorithmfor entering a monitor. This
usually consists of the following:
1. Acquire lock for monitor control data structures. This is traditionally done with some sort of
atomic test and set instruction, with a spin and back-off mechanism for heavily contested locks.
2. Look at busy ag of monitor. If clear, the thread can enter the monitor (by setting busy ag, and
releasing the monitor lock). If the busy ag is set, the thread must: (1) link itself into a list of
threads trying to enter the monitor, (2) release the lock for the monitor, (3) save the state of the
calling thread (e.g., CPU registers), and switch to another thread.
This control logic is quite involved. The logic for signaling a condition inside the monitor is even more
complex.
In contrast, with the Multiex CE, entering a monitor, or signaling a condition, is done with one
memory load instruction at a special address that indicates the monitor object index and the operation
type. Similarly, forking up to 8192 (2
13
) threads at a time can be accomplished with one memory write.
The atomic maintenance of the linked lists, busy ag indicators, timeout queues, etc., is done in hardware.
Pipe1
PipeP
.
.
T1 Tm
Data$
Pipe1
PipeP
RISC N RISC 1
.
.
T1 Tm
Data$
NoC
Message engine
and NoC I/F
Message engine
and NoC I/F
H/W CE
Entry list
Monitor
Condition
Semaphore Run queue
NoC I/F
FIGURE 26.9 Execution platform for SMP programming model.
Any operation that should block the caller (such as entering a busy monitor) will cause the CE to
defer the response to the read until the blocking condition is removed (e.g., by the owner of the monitor
exiting). This causes suspension of execution of the hardware thread. The split-transaction nature of the
StepNP interconnect makes this possible, since the response to a read request can be delivered at any time
in the future, and does not block the interconnect. Therefore, the response to a read from a CE location
representing a monitor entry will not return a result over the interconnect until the monitor is free.
Notice that no software context switching takes place for concurrency operations. The hardware thread
is simply suspended, allowing other hardware threads enabled on the processor to run. This can often be
done with nobubblesin the processor hardware pipeline. The large number of systemhardware threads,
in the order of thousands, would make software context switching unnecessary for most applications. Of
course, it is possible for an operating system layer to run in the background, and take blocked hardware
threads that have been suspended for a long time, for example, and context switch them in software to
execute another task. However, for our current applications, we have not yet found a need for this.
The CE is also responsible for other tasks such as run queue management, load balancing, etc. Our
experiments to date indicate that a simple rst-come rst-served task scheduling hardware mechanism
results in excellent performance with good resource utilization.
Therefore, the Multiex system provides a SMP programming model with essentially no operating
system software, in the conventional sense. The C++ classes controlling concurrency are implemented
directly with in-line read/write instructions to the CE. A C-based POSIX thread (p-thread) API is also
available, with only slightly lower efciencies.
This high-performance SMP implementation simplies programming for the application developer.
Withconventional SMPimplementations, the cost of forking a thread, synchronizing, etc. must be carefully
considered, and balanced against the granularity of the task to be executed. Making the task granularity too
large can reduce opportunities for parallelism, while making tasks too small can result in poor performance,
due to SMP overhead. Finding the right balance requires a great deal of trial and error. However, with the
high-performance Multiex SMP implementation the trade-off analysis is much simplied.
26.10.2 Hardware Support for SMP RTOS
The overhead of a software-only RTOS context switch is typically over 1,000 cycles, and in the context
of MP-SoCs with long NoC latencies can exceed 10,000 cycles [43]. The use of hardware engines for
message passing, context switching, and task scheduling are therefore essential in both the DSOC and
SMP model implementations in order to achieve efcient mapping of medium-grain parallel tasks onto
MP architectures.
For example, Mooney and coworkers [44] have reported an experiment where RTOS context switch
times take 3,218 cycles and communication takes 18,944 cycles, for a medium-grained computation taking
8,523 cycles. This results in application execution efciency of only 30%. They also experimented with a
partial hardware acceleration for scheduling and lock management, and observe efciencies approaching
63%. However, they note that the improvement due to hardware acceleration of the scheduling and
synchronization was limited by the software context switch overheads.
Kohout et al. [45] perform RTOS task scheduling in HW and demonstrate up to 10 times speedup of
RTOS processing time (from 10 to 1% overhead). Absolute times for RTOS processing are not given, but
the interrupt response time portion was between 1400 and 2200 nsec, for a clock cycle of 200 MHz.
In the Multiex system running on StepNP (assuming the same 200 MHz clock frequency), context
switches occur in 5 nsec (1 clock cycle), message passing in less than 100 nsec (i.e., 15 to 20 instructions
typically), and scheduling of DSOC and SMP objects in less than 100 nsec (10 to 20 instructions). More
importantly, it is the combination of the three accelerator engines that enables the effective mapping
of medium- to ne-grain parallelism onto MP architectures like StepNP. In the trafc manager applic-
ation examples below, we are dealing with ne-grained parallel tasks that represent less than 500 RISC
instructions typically.
26.10.3 Interoperable DSOC and SMP Programming Models
As depicted in Figure 26.10, the Multiex system supports the mapping of applications expressed in
DSOC and SMP programming models. In this simple example, application Object1, Object3, and Object4
communicate via the DSOC message passing model described in the previous section. On the other hand,
Object2 is an SMP object that contains three parallel threads that access shared memory. In order to
achieve this, the platform makes use of both the DSOC scheduler (ORB) and the SMP scheduler (CE)
described above.
26.11 An IPv4 Packet Forwarding Application
To illustrate the concepts discussed in this chapter, we have mapped a Multiex model of a complete
IPv4 fast-path application of Figure 26.11(a) onto the multiprocessor execution platform depicted in
Figure 26.11(b).
26.11.1 Networking Application Framework
Our application software platform makes use of MITs open source Click modular router framework for
the rapid development of embedded routing-application software [46]. Figure 26.11(a) depicts sample
Click modules performing packet classication, discarding, stripping, and queuing.
We have extended the Click IPv4 packet forwarding model to be compatible with the DSOC message-
passing programming model by encapsulating Click functions in SIDL-based interfaces. The granularity
H/W
PE
Out
Object1
Object2
Object4
Object3
NoC
Fast
MT-RISC
MP RISC,
std. O/S
In
S/WS/W com
S/WH/W com
PE1.1
PE1.n
PE3 PE2
Shared
memory
SMP
CE
SRAM
Shared memory com
T1
T2
T3
Message
passing
engine
DSOC
ORB
FIGURE 26.10 Mixed DSOC-SMP model to platform mapping.
of the partitioning is user-dened and can be dened at the intrapacket and interpacket levels. It is the
latter that is most naturally exploited here, due to low interpacket dependencies.
Furthermore, we have added DSOC objects to the model. For example, the lookup element makes use
of a DSOC server object for IPv4 address lookup. Similarly, DSOC server objects are used for packet I/O.
SMP primitives are used to fork threads, and to guard updates to shared data, where necessary.
26.11.2 StepNP Target Architecture
The target architecture platform used here is depicted in Figure 26.11(b). In order to support wire-speed
network processing, a mixed hardware and software architecture is used. For example, at a 10 Gb/sec
line-rate, the packet processing time is approximately 40 nsec (for packets of 54 bytes). To achieve this,
the packet I/O and network address searches are implemented in hardware. The packet I/O component
is implemented using 32 hardware threads. It receives input packets, and controls the DMA to memory.
The network search engine model is based on STs NPSE search engine [32].
The platform was congured with the following variable parameters:
RISC processor ISA ARM v4
Number of processor pipeline stages 4
Processor clock frequency 500 MHz
Data/program cache size 4 KB
Number of processors 1 to 48
Number of hardware threads (per proc.) 4 to 32
One-way NoC latency 0 to 160 nsec
NoC latency jitter 25%
In this experiment, the NoC latency is a congurable parameter. It represents the total lumped one
way delay of a data item sent between one master (or slave) to another. We have not attempted to model
the detailed behavior of the NoC interconnect, such as contention, blocking, etc. Since the latency value
used in the experiment was varied from 0 to 160 nsec (or 80 clock cycles at 500 MHz), we believe this
10 Gb/sec
IPv4
packet
forwarding
Pipe1
PipeP
RISC 1
...
.
.
T1 Tm
Cache
Packet
I/O
Network
search
engine
Memory
Message
passing
engine
Task
scheduling
Dynamic mapping
.
.
T1 Tm
Pipe1
PipeP
RISC 2 RISC 40
...
.
.
T1 Tm
Cache
DSOC
ORB
(a)
RISC
H/W
(b)
Pipe1
PipeP
Cache
Parameterizable NoC (latency L
i
,
j
)
Pulltopush
Strip
DSOC / Click
FIGURE 26.11 Platform for IPv4 application: (a) packet forwarding applications, (b) architecture platform.
is a realistic upper bound on the effect of NoC contention or blocking. Moreover, we include a random
jitter on the NoC latency of 25%. This emulates the effect of out-of-order packet processing.
We assume a simple ne-grained multithreaded processor architecture in which there is a different
thread in each of the pipeline stage; thus, for the four pipeline stages we considered, a minimum of four
threads per processors is necessary to fully use this resource.
26.11.3 Simulation Speed
The use of the transaction-level modeling approach described in Section 25.6 leads to fast simulation
models for this complex platform. For the congurationabove using 40 RISCprocessors, we have observed
simulation speeds of over 7K cycles/sec, running on a 2.4 GHz PC-based platform running Linux.
26.11.4 Multiprocessor Compilation and Distribution
Figure 26.11 illustrates the basic compilation and mapping process used in the Multiex multiprocessor
compilationandallocationapproach. The three superimposedboxes of Figure 26.11(a) represents the pro-
cessing of three different packets. The top-level IPv4 application can be partitioned at two different levels:
At the interpacket processing level. For IPv4, the inter-packet dependencies are very low. This allows
for a high level of natural parallelism. This parallelism is depicted in Figure 26.11(a) via the overlapping
boxes. Each packet processing is assignable to a different thread.
At the intrapacket processing level. This involves cutting the processing graph at basic block boundaries.
This is depicted in with the dotted lines illustrating cut points within a single packet processing.
0
50
100
150
200
250
300
4 8 12 16 20 24 28 32
F
o
r
w
a
r
d
i
n
g

r
a
t
e

(
M
b
/
s
e
c
)
Latency 0
Latency 40
Latency 80
Latency 120
Latency 160
Number of Threads
FIGURE 26.12 IPv4 simulation results (normalized per single processor).
As explained above, the packet I/Oand network address search functions are manually assigned to hard-
ware. The remaining IPv4 packet processing is automatically distributed over the RISC processor threads.
The hardware ORB load-balances the input from the I/O object to the IPv4 packet forwarding clients
executing on the RISC processors. When a RISC processor thread completes the main IPv4 processing,
another call is issued to the I/O object for output transmission.
26.11.5 IPv4 Results
Some simulation results using a minimum packet length of 54 bytes
2
are depicted in Figure 26.12. This
depicts the packet processing rate obtained, normalized for one processor, while varying the number
of hardware threads supported per processor (from 4 to 32) and the one way NoC latency (from 0 to
160 nsec). Three important observations:
The highest processing rate achievable for a single processor is approximately 260 Mb/sec. This
provides a lower bound of 40 processors to target a system performance aggregate throughput of
10 Gb/sec.
Assuming a perfect, zero latency interconnect, we observe that the four threads conguration nearly
achieves the highest processing rate. However, realistic latencies cause signicant degradation of
processor utilization, dropping below 15% for a 160 nsec latency.
As the number of threads is increased from 4 to 32, the NoC latency is increasingly well masked.
This results in high processor utilization (as high as 97%), even with NoC latencies as high as
160 nsec.
Proling of the compiled IPv4 application code running on the ARM shows that 623 instructions are
required to process one packet. Approximately 1 out of 8 (78 out of 623) of these instructions lead to an
NoCaccess. This high access rate, which is inherent to the IPv4 application code, highlights the importance
of masking the effect of communication latency.
2
This minimum packet size is the worst-case condition as it requires the highest processing rates.
TABLE 26.1 IPv4 Simulation Results for 2.5 and 10 Gb/sec
Line-rate Number Number of NoC latency ARM utilization Packet latency
(Gb/sec) of ARMs threads (nsec) (%) ( sec)
2.5 16 8 40 67 16
10 48 16 80 86 30
This example illustrates the importance of the effective utilization of multiple threads in the presence
of high latency interconnects. For example, in the presence of an NoC latency of 160 nsec, the throughput
per processor varies between 35 Mb/sec for 4 threads, to 255 Mb/sec for 32 threads.
The interprocessor communication represents only 8% of the total packet processing instructions. This
very low cost especially in this high-speed packet processing context is achieved using the hardware-
based message passing and task scheduling mechanisms described above. In fact, a monoprocessor
implementation would require approximately the same number of instructions, since you need to replace
the message passing instructions with a regular procedure call.
3
Two representative experimental results are summarized in Table 26.1. This provides architecture
parameters to achieve 2.5 Gb/sec (OC48) and 10 Gb/sec (OC192) with worst-case trafc.
For OC48, a conguration with 16 ARMs with 8 threads each can support the 2.5 Gb/sec line-rate with
an NoC latency of 40 nsec. For OC192, a conguration of 48 ARMs with 16 threads each can support a
10 Gb/sec line-rate. Here, we assumed a higher NoC latency (80 nsec instead of 40 nsec) due to the higher
number of PEs. The total packet latency for this conguration is 30 sec.
In comparison with the 2.5 Gb/sec result, a line-rate of 4X is achieved with only 3X processors, in spite
of a higher NoC latency (80 nsec instead of 40 nsec). This is a result of the higher processor utilization
(86%), which is achieved by using 16 threads instead of 8.
Note that even higher processor utilizations are achievable. With 40 processors and 24 threads, a pro-
cessor utilization of 97% was obtained. However, in practice, this does not offer enough headroom for
additional functionality.
For both congurations, 50% of the reported latency to process a packet is a consequence of the NoC
latency resulting from the required 78 NoC accesses.
Note that the StepNP platform instance used for this application makes use of standard RISC processors
only (ARM v4 ISA). The use of application-specic instructions is not straightforward in regular IPv4
packet forwarding since there are very few repeated sequences of operations to optimize. However, our
academic partners demonstrated that the use of a Tensilica congurable processor optimized for a secure
IPv4 packet forwarding application (using encryption) led to speedups up to 4.55x over an unoptimized
core [6]. This example better demonstrates the value of congurable processors in the general StepNP
platform.
26.12 A Trafc Manager Application
To further illustrate the concepts discussed in this chapter, we have mapped a Multiex model of the
IPv4 packet trafc management application of Figure 26.13(a) onto the StepNP multiprocessor platform
instance depicted in Figure 26.13(b). In comparison with the packet forwarding application example
presented above, this application has the following challenges:
While there is natural interpacket parallelism, there are also numerous interpacket dependencies
to account for, due to packet queuing and scheduling, for example.
3
In theory, you could rewrite the entire application as in-line code and run it on a single processor. This would
reduce the communication cost, but leads to unstructured code that is not scalable to multiple processors.
The intrapacket processing consists of a dozen tasks with complex data dependencies. Also, these
tasks are medium- to small-grain (one to two hundred RISC instructions typically).
All user-provided task descriptions are written in C ++. No low-level C or assembly code is used.
The DSOC and SMP library code is also entirely written in C ++.
We will demonstrate that, in spite of these constraints, the Multiex tools support the efcient mapping
of high-level application descriptions onto the StepNP platform. Packet processing at 2.5 Gb/sec implies
that for 54 byte packets (worst-case condition), the processing time per packet is less than 100 clock cycles
(at 500 MHz). As the trafc manager requires over 750 RISC instructions (compiled from C ++), this
implies a lower bound of 8 processors (at one instruction/cycle), with 94% utilization.
26.12.1 Application Reference
A packet trafc manager is a functional component located between a packet processor and the external
world, which can be a switch fabric [47]. We assume the packet processor is performing header
validation/modication, as well as address lookup and classication.
An SPI (system packet interface) is used to interface with the application, both as input and output.
Such interface, as the SPI4.2, can support a bandwidth of 10 Gb/sec, where packets are transmitted as
sequence of xed-size segments interleaved between multiple logical ports (up to 256 ports for SPI4.2).
The main functions of the trafc manager are:
Packet reassembly and queuing from SPI input
Packet scheduling per output port
Rate-shaping per output port
Packet dequeuing and segmentation for SPI output
Typically, the queues are implemented as linked-lists of xed-size buffers, and large queues are supported
using external memories. SRAMs are used to store the links and DRAMs for the buffer content. We assume
in the following that both the SPI segment size and buffer size are 64 bytes.
26.12.2 DSOC Model
A DSOC model of the trafc manager application is depicted in Figure 26.13(a). This model is composed
of the following tasks:
ingSPI: input SPI protocol
ingSeg: temporary buffer for input SPI segment
dataMgr: interface to link-buffer data storage
queMgr: linked-list management, supporting N lists for packet reassembly and N C lists for
packet queuing, where N is the number of SPI ports, and C the number of trafc classes
ingPkt: packet reassembly and queuing
schPkt: packet scheduling, implementing strict priority or round-robin per port
egrPkt: packet dequeuing and segmentation
shPort: output port rate-shaping
egrSeg: temporary buffer for output SPI segment
egrSPI: output SPI protocol
Each task is a parallel DSOC object (whose internal function is described in C++). The object granu-
larity is user-dened. The arrows in Figure 26.13(a) represent the object calls between the DSOC objects.
The object invocations are summarized as follows:
Ingress direction: (1) ingSPI invokes ingSeg to buffer segment; (2) at the end-of-segment, ingSPI invokes
ingPkt to manage a segment; (3) ingPkt invokes queMgr to push a buffer in the queue associated with the
segment input port; (4) ingPkt invokes ingSeg to forward the segment to the address associated with the
i
n
g
S
P
I
i
n
g
S
e
g
e
g
r
S
e
g
e
g
r
S
P
I
dataMgr
i
n
g
P
k
t
s
c
h
P
k
t
e
g
r
P
k
t
s
h
P
o
r
t
1
2
4
5
6 7 8
10
11
12
Pipe1
PipeP
RISC 1
.
.
T1 Tm
Data$
RISC N
T1 Tm
Data$
SMP
CE
Memory
DSOC
ORB
ingSPI
ingSeg
queMgr dataMgr
egrSPI
egrSeg
DSOC traffic manager
9
queMgr
3
(a)
S/W
H/W
Message
passing
(b)
Task
scheduling
Parameterizable NoC (latency L
i
,
j
)
Pipe1
PipeP
.
.
FIGURE 26.13 Application platform. (a) DSOC trafc manager. (b) Parameterizable NoC.
pushed buffer; (5) ingSeg invokes dataMgr to store the segment; (6) at the end-of-packet, ingPkt invokes
queMgr to append the packet (input queue) in its associated output queue, and invokes schPkt to inform
about the arrival of a new packet.
Egress direction: (7) shPort invokes egrPkt to request a segment for an output port; (8) at end-of-packet,
egrPkt invokes schPkt to decide from which class a packet needs to be forwarded for a given output port;
(9) egrPkt invokes queMgr to pop a buffer from the queue associated with the output port and scheduled
class; (10) egrPkt invokes dataMgr to retrieve the buffer content of the pop buffer; (11) dataMgr invokes
egrSeg to store the segment; (12) egrSeg invokes egrSPI to output the segment.
26.12.3 StepNP Target Architecture
The application described above is mapped on the StepNP platform instance of Figure 26.13(b). In order
to support wire-speed network processing, a mixed hardware and software architecture is used. The use
of DSOC objects, combined with the SIDL interface compiler, allows easy mapping of tasks to hardware
or software.
The simple but high-speed ingSPI/ingSeg, egrSeg/egrSPI, queMgr, and dataMgr tasks are mapped onto
hardware. A similar partition is used for the Intel IXP network processor [48]. The remaining blocks
of the DSOC application model are mapped onto software. Multiple instances of each of these blocks
are mapped on processor threads in order to support a given wire-speed requirement. For the output
processing, in order to use processor local memory to store output status, each instance of the schPkt,
egrPkt, and shPort are associated with a disjoint subset of ports.
The platform was congured with the following parameters:
RISC processor ISA ARM v4
Number of processor pipeline stages 4
Processor clock frequency 500 MHz
Number of hardware threads (per processor) 8
One-way NoC latency (jitter) 40 10 nsec
Using a conguration with 16 ports and 2 classes, and a mixed of packet with random length between
54B and 1024B, simulation shows that a bandwidth of at least 2.5 Gb/sec can be supported with seven
ARM processors: two ARMs for ingPkt, two ARMs for egrPkt (one thread per port), two ARMs for schPkt
(one thread per port), and one ARM for shPort (one thread per two ports). However, when using short
54B packets (worst case condition), the supported bandwidth drops to 2.1 Gb/sec (as predicted with
the eight processor lower bound calculation). Because some of the functional blocks are mapped on
a thread-per-port basis, it is not possible to simply increase the number of ARM to support a higher
bandwidth. In principle, this could be achieved by increasing the number of ports, but this will restrict the
supported application domain. Alternatively, the application could be rened in smaller tasks to permit
more freedom in mapping these tasks to threads, but this would require time-consuming code rewriting
and balancing of functions on different threads.
A simpler way is to relax the constraint on using local memory to store output port status. This is
achieved by using shared memory, as described next.
26.12.4 DSOC+SMP Model
A mixed DSOC and SMP model of the trafc manager application is depicted in Figure 26.14. The main
differences with the previous model are that (1) shared memory is used to store temporary segments, and
(2) port status data is protected with semaphores.
The DSOC and SMP model is mapped to the same platform as for the DSOC model (with minor
modicationof the ingSPI and egrSPI blocks). Using the same congurationparameters as for the reported
DSOC experiment, a bandwidth of 2.6 Gb/sec is supported for the case of 54B packet using nine ARMs.
In this conguration, one ARM is still used for the shPort, while the other eight ARMs are used to perform
any of the ingPkt, schPkt, and egrPkt functions, as scheduled by the ORB.
This automatic task balancing by the DSOC ORB, combined with shared-memory management using
the SMP support, allows for easy exploration of different application congurations, as described next.
i
n
g
S
P
I
e
g
r
S
P
I
dataMgr
queMgr
i
n
g
P
k
t
s
c
h
P
k
t
e
g
r
P
k
t
s
h
P
o
r
t
S
h
a
r
e

m
e
m
S
h
a
r
e

m
e
m
DSOC+SMP traffic manager S/W
H/W
Mem
FIGURE 26.14 DSOC +SMP trafc manager model.
TABLE 26.2 Experimental Results
Number Number Number Bandwidth (Gb/sec)
of ports of classes of ARMs
Strict-priority Round-robin
16 2 9 2.63 NA
16 8 9 2.54 2.33
64 8 9 2.55 2.47
64 32 10 2.54 2.44
256 32 10 2.50 2.45
26.12.5 Experimental Results
Experimental results for different number of ports and classes are summarized in Table 26.2, showing
the number of ARMs required to support a bandwidth of at least 2.5 Gb/sec when using strict-priority
scheduling. All the experiments used the same platform parameters described above, where one ARM is
dedicated to the shPort port shaping functionality.
We can see that increasing the number of classes requires more processing, while increasing the number
of ports has almost no impact. The table also shows the bandwidth achieved using a variant of schPkt
functionality supporting three class categories: (1) high-priority, (2) fair-sharing, and (3) best-effort.
Fair-sharing classes are scheduled following a round-robin scheme. The table indicates that the processing
impact of this improved scheduler is more signicant when there are less supported ports.
There are many variant of scheduler functionality that can be implemented in a packet trafc manager,
as well as other features as RED and policing [47], and there are many variant of multiprocessor cong-
urations. The experiments shown here demonstrate that the Multiex architecture, programming model
and tools form a very useful environment to explore and derive an HW/SW multiprocessor solution.
The average processor utilization in all the experiments varied from 85 to 91%, allowing us to get close
to the eight processor theoretical lower bound. For the most complex scheduler (the round-robin version
for 32 classes), the egrPkt +schPkt pair runs in 401 instructions on average. Of these, 87 instructions are
needed for 7 DSOC calls, or 22% of instructions. This demonstrates the importance of the fast message
passing, task scheduling, and context switching hardware for these medium- to ne-grain tasks.
26.12.6 Hardware Implementation
The implementation of the Multiex hardware O/S accelerator engines required for the trafc manager
above requires 58K gates and 18K bytes of memory in total (RTL synthesis result). This includes ORB to
support the message passing model, one CE for SMP, and ten message passing engines. The total area for
a high-speed implementation of these 12 components (gates and memory) is less than 0.6 mm
2
for STs
90 nm CMOS technology.
26.13 Summary
We have described the StepNP exible SoC platform and the associated Multiex programming and
analysis environment. The StepNP platform consists of multiple congurable hardware multithreaded
processors, congurable and recongurable HW PEs, shared-memory, and networking-oriented I/O, all
connected via a network-on-chip (NoC). The key characteristic of the StepNP platformis that, although it
is composedof heterogeneous hardware andsoftware precessing elements (PEs), memories andI/Oblocks,
the use of a single standardized protocol to communicate with a single global NoC allowed us to build a
homogeneous programming environment supporting automatic application-to-platform mapping.
The Multiex MP-SoC simulation and analysis tools are built on a transaction-level model (TLM) in
SystemC that supports interaction, instrumentation, and analysis. We demonstrate simulation speeds of
over 7 kHz for a complex TLM model of 32 RISC hardware multithreaded processors, 4 HW PEs, I/O, and
an NoC.
The Multiex MP-SoC programming environment supports two parallel programming models: a
DSOC message passing model, and a symmetrical multiprocessing (SMP) model using shared memory.
We believe the Multiex programming models should be immediately familiar and intuitive to software
developers exposed to mainstream SMP and distributed software techniques available in languages, such
as Java or C#. In addition, the framework allows capture of characteristics of objects in a way that can be
exploited by automation tools.
The Multiex programming tools supports the rapid exploration of algorithms written using interoper-
able DSOCandSMPprogramming models, automatically mappedtoa wide range of parallel architectures.
Using this approach, the system-level application development is largely decoupled from the details of
a particular target platform mapping. Application objects can be executed on a variety of processors, as
well as on congurable or xed hardware. Moreover, the StepNP platform includes hardware-assisted
messaging and dynamic task scheduling and allocation engines that support the platform mapping tools
in order to achieve low-cost communication and high processor utilization rates.
We presented the results of mapping two packet processing applications onto the StepNP platform,
for a range of architectural parameters: (1) an Internet IPv4 packet forwarding application, running at
2.5 and 10 Gb/sec, and (2) a more complex and ner grain trafc management application, running at
2.5 Gb/sec. The HW/SW architecture is able to sustain end-to-end object-oriented message passing from
one processor to another, at a rate of about 35 million per second (using 500 MHz RISC-style processors).
As a result, the interprocessor communication code cost is kept to a strict minimum.
For the IPv4 packet forwarding, which uses the DSOC distributed message-passing parallel program-
ming model and consists of medium-grain parallelism (500 RISC instructions or more), we achieved a
utilization rate of the embedded RISC processors as high as 97%, even in presence of NoC interconnect
latencies of up to 160 nsec (one-way), while processing worst-case IPv4 trafc at a 10 Gb/sec line-rate.
Only 8% of processor instructions are required for interprocessor communication.
For the 2.5 Gb/sec trafc management application, we have developed a mixed DSOC and SMP
programming model. Processor utilizations of 85 to 91% have been demonstrated. The low granularity of
the tasks parallelized (typically <200 RISCinstructions) highlight the importance of the efcient hardware
engines used for task scheduling, context switching, and message passing. In this case, and in spite of the
ne-grain parallelism, interprocessor communication represents only 22% of instructions.
26.14 Outlook
The Multiex technology has been recently applied to the mapping of a high-level MPEG4 video codec
(VGA resolution at 30 frames per second) onto a mixed multiprocessor and hardware platform. This
application makes heavy use of the SMP programming model. However, message passing is used for
the interaction with tasks implemented in hardware. In this case study, we demonstrated that 95% of
the MPEG4 functionality (as expressed in lines of MPEG4 application C code) can be mapped onto ve
simple RISC processors, running at 200 MHz. The average processor utilization rate is 88%. However,
this application code mapped to software represents only 20% of the performance. This provides a highly
exible, yet low-cost solutionsince the remaining 80%of the performance is achievedusing simple, regular
hardware components. This demonstrates the general nature of the programming models supported. We
will be exploring a H.264 application in the next step.
We are also currently exploring the mapping of Layer1 modem functions for a 3G basestation. This
includes CRC attachment, channel coding, rst interleaver, second interleaver and deinterleaver, rate
matching, spreading, and despreading. These base functions will be integrated with the public domain 3G
stack from the Eurecom engineering school [49].
Aphysical implementationof the StepNPplatformin90 nmprocess technology is currently inthe den-
ition phase and is planned in the next year. This is for a platform conguration with a large number of
multithreaded processors (16 or more), congurable HW PEs and I/O. This will be the basis for a high-end
prototyping platform, from which application-specic platforms can be derived.
Future activities on exible MP-SoC platforms include the evaluation and manufacturing of a range of
network-on-chip topologies. Finally, we are working with researchers from the Politecnico di Milano on
the integration of their power estimation framework [50].
Acknowledgments
We thank our ST colleagues Michele Borgatti, Pierluigi Rolandi, Bhusan Gupta, Naresh Soni, Philippe
Magarshack, Frank Ghenassia, Alain Clouard, Miguel Santana, and Serge De Paoli, as well as Profs Guy
Bois, David Quinn, Bruno Lavigueur, and Olivier Benny of Ecole Polytechnique de Montreal, Profs
Donatella Sciuto and Giovanni Beltrame of Politecnico di Milano, Prof. Alain Greiner of Univ. Paris VI,
Profs Martin Bainier and James Kanga Foaleng of Ecole Polytechnique de Grenoble, and Prof. Thomas
Philipp of University of Aachen for their contributions to the themes and tools discussed in this chapter.
References
[1] N. Shah, Understanding Network Processors, Internal report, Department of Electrical
Engineering and Computer Science, University of California, Berkeley, 2001; http://www-
cad.eecs.berkeley.edu/ niraj/papers/UnderstandingNPs.pdf
[2] P.G. Paulin, F. Karim, and P. Bromley, Network Processors: APerspective on Market Requirements,
Processor Architectures and Embedded SW Tools, in Proceedings of the Design, Automation, and
Test in Europe (DATE 2001), IEEE CS Press, Los Alamitos, CA, 2001, pp. 420429.
[3] A. Jantsch and H. Tenhunen (Eds.), Networks on Chip, Kluwer Academic Publishers,
Dordrecht, 2003.
[4] L. Benini and G. De Micheli, Networks on Chip: ANewSoCParadigm, Computer, 35: 7072, 2002.
[5] F. Karim et al., On-Chip Communication Architecture for OC-768 Network Processors, in
Proceedings of Design Automation Conference (DAC 01), ACM Press, NewYork, 2001, pp. 678683.
[6] D. Quinn et al., A System-Level Exploration Platform and Methodology for Network Applications
Based on Congurable Processors, in Proceedings of Design Automation and Test in Europe (DATE),
Paris, February 2004.
[7] B. Ackland et al., A Single Chip 1.6 Billion 16-b MAC/s Multiprocessor DSP, in Proceedings of
Custom Integrated Circuits Conference, 1999.
[8] J.A.J. Leijten et al., PROPHID: A Heterogeneous Multiprocessor Architecture for Multimedia, in
Proceedings of International Conference on Computer Design, 1997.
[9] M. Rutten et al., AHeterogeneous Multiprocessor Architecture for Flexible Media Processing, IEEE
Design & Test of Computers, 19: 3950, 2002.
[10] S. Dutta et al., Viper: A Multiprocessor SoC for Advanced Set-Top Box and Digital TV Systems,
IEEE Design & Test of Computers, 18: 2131, 2001.
[11] K. Keutzer, S. Malik, R. Newton, J. Rabaey, and A. Sangiovanni-Vincentelli, System Level Design:
Orthogonalizationof Concerns and Platform-Based Design, IEEETransactions on Computer-Aided
Design, 19(12): 15231543, 2000.
[12] M. Gries, S. Weber, and C. Brooks, The Mescal Architecture Development System (tipi) Tutorial,
Technical report, UCB/ERLM03/40, Electronics Research Lab, University of California at Berkeley,
October 2003.
[13] M. Sgroi, M. Sheets, A. Mihal, K. Keutzer, S. Malik, J. Rabaey, and A. Sangiovanni-Vincentelli,
Addressing the System-on-a-Chip Interconnect Woes through Communication Based Design, in
Proceedings of Design Automation Conference, June 2001, pp. 667672.
[14] L. Carro, M. Kreutz, F.R. Wagner, and M. Oyamada, System Synthesis for Multiprocessor
Embedded Applications, in Proceedings of Design Automation and Test in Europe, March 2000,
pp. 697702.
[15] A. Clouard et al., Towards Bridging the Gap between SoC Transactional and Cycle-Accurate Levels,
in Proceedings of Design, Automation, and Test in Europe Designer Forum, 2002, pp. 2229.
[16] A. Clouard, K. Jain, F. Ghenassia, L. Maillet-Contoz, and J.-P. Strassen, Using Transactional Level
Models in a SoC Design Flow, in SystemC Methodologies and Applications, W. Muller, W. Rosentiel,
and J. Ruf, Eds. Kluwer Academic Publishers, Dordrecht, 2003.
[17] P.G. Paulin and M. Santana, FlexWare: A Retargetable Embedded-Software Development
Environment, IEEE Design & Test of Computers, 19: 5969, 2002.
[18] J. Henkel, Closing the SoC Design Gap, IEEE Computer Magazine, 36(9): 19121, September 2003.
[19] P. Magarshack and P.G. Paulin, System-on-Chip Beyond the Nanometer Wall, in Proceedings of
40th Design Automation Conference (DAC), Anaheim, June 2003.
[20] P.G. Paulin et al., Chips of the Future: Soft, Crunchy or Hard, in Proceedings of Design Automation
and Test in Europe (DATE), Paris, 2004.
[21] J.L. Hennessy et al., Computer Architecture: A Quantitative Approach, Morgan Kaufmann,
San Mateo, CA, 1990.
[22] See Tensilica web site: http://www.tensilica.com
[23] M. Borgatti et al., A 0.18 m, 1GOPS Recongurable Signal Processing IC with Embedded FPGA
and 1.2 GB/s, 3-Port Flash Memory Subsystem, in Proceedings of the International Solid-State
Circuits Conference (ISSC), San Francisco, February 2003.
[24] See LisaTek tools on CoWare web site: http://www.coware.com
[25] See OCP-IP web site: http://www.ocpip.org
[26] P.G. Paulin, C. Pilkington and E. Bensoudane, StepNP: A System-Level Exploration Platform for
Network Processors, IEEE Design & Test of Computers, 19: 1726, 2002.
[27] See STMicroelectronics STBus web site: http://www.stmcu.com/inchtml-pages-STBus_intro.html
[28] P. Guerrier and A. Greiner, A Generic Architecture for On-Chip Packet-Switched Interconnections,
in Proceedings of Design Automation and Test in Europe (DATE), IEEE CS Press, Los Alamitos, CA,
March 2000, pp. 7078.
[29] A. Greiner et al., SPIN: A Scalable, Packet-Switched, On-Chip Micro-Network, in Proceedings of
Design Automation and Test in Europe (Designer Forum), Munich, March 2003.
[30] S.-J. Lee et al., An 800 MHz Star-Connected On-Chip Network for Application to Systems
on a Chip, in Proceedings of International Solid-State Circuits Conference (ISSC), San Francisco,
February 2003.
[31] See eASIC Corp. web site: http://www.easic.com
[32] N. Soni et al., NPSE: A High Performance Network Packet Search Engine, in Proceedings of Design
Automation and Test in Europe (Designer Forum), Munich, March 2003.
[33] See Open SystemC web site: http://www.systemc.org
[34] See VSIA web site: http://www.vsi.org
[35] J.M. Paul, Programmers Views of SoCs, in Proceedings of CODESS/ISSS, October 2003.
[36] G. DeMicheli, Networks on a Chip, in Proceedings of MPSoC 2003, Chamonix, France, 2003.
[37] N. Shah, W. Plishker et al., NP-Click: A Programming Model for the Intel IXP1200, in Proceedings
of Workshop on Network Processors, International Symposium on High Performance Architecture,
February 2003.
[38] S. Kiran et al., A Complexity Effective Communication Model for Behavioral Modeling of Signal
Processing Application, in Proceedings of 40th Design Automation Conference, Anaheim, June 2003.
[39] M. Forsell, A Scalable High-Performance Computing Solution for Network on Chip, IEEE Micro,
22(5): 4655, SeptemberOctober 2002.
[40] Object Management Group, www.omg.org
[41] Distributed Component Object Model (DCOM), http://www.microsoft.com/com/tech/
DCOM.asp
[42] See http://www.research.att.com/bs/C++0x_panel.pdf
[43] K. Goossens, Systems on Chip and Networks on Chip: Bridging the Gap with QoS, in Proceedings
of Application-Specic Multi-Processor Systems School, Chamonix, July 2003.
[44] J. Lee, V. Mooney et al., A Comparison of the RTU Hardware RTOS with HW/SW RTOS, in
Proceedings of the ASP-DAC, January 2003, pp. 683688.
[45] P. Kohout, B. Ganesh, and B. Jacob, Hardware Support for Real-Time Operating Systems, in
Proceedings of Codes-ISSS, Newport Beach, CA, October 2003, pp. 4551.
[46] E. Kohler et al., The Click Modular Router, ACM Transactions on Computer System, 18:
263297, 2000.
[47] G. Armitage, Quality of Service in IP Networks, MacMillan Technical Publishing, Indianapolis IN,
USA, 2000.
[48] E.J. Johnson and A.R. Kunze, IXP2400/2800 Programming, Intel Press, Hillsboro OR, USA, 2003.
[49] See web site: http://www.wireless3G4Free.com
[50] W. Fornaciari, F. Salice, and D. Sciuto, Power Modeling of 32-bit Microprocessors, IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 21: 13061316, 2002.
III
Testing of Embedded
Core-Based Integrated
Circuits
27 Modular Testing and Built-In Self-Test of Embedded Cores in System-on-Chip
Integrated Circuits
28 Embedded Software-Based Self-Testing for SoC Design
27
Modular Testing and
Built-In Self-Test of
Embedded Cores in
System-on-Chip
Integrated Circuits
Duke University
27.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-1
Core-Based SOC Testing a SOC Built-In Self-Test
27.2 Modular Testing of SOCs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-5
Wrapper Design and Optimization TAM Design and
Optimization Test Scheduling Integrated TAM
Optimization and Test Scheduling Modular Testing of
Mixed-Signal SOCs
27.3 BIST Using a Recongurable Interconnection
Network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-14
Declustering the Care Bits
27.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-22
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27-23
27.1 Introduction
Integrated circuits (ICs) are widely used in todays electronic systems, with applications ranging from
microprocessors and consumer electronics to safety-critical systems, such as medical instruments and
aircraft control systems. In order to ensure the reliable operation of these systems, high-quality testing
of ICs is essential. To reduce product cost, it is also necessary to reduce the cost of testing without
compromising product quality in any way.
Testing of an IC is a process in which the circuit is exercised with test patterns and its resulting response
is analyzed to ascertain whether it behaves correctly. Testing can be classied based on a number of
criteria. Depending upon the specic purpose of the testing process, it can be categorized into four types:
characterization, production test, burn-in, and incoming inspection [1]. Characterization, also known
as silicon debug, is performed on a new design before it is sent for mass production. The purpose is to
27-1
0
20
40
60
80
100
120
140
160
1999 2002 2005 2008 2011 2014
Year
R
e
l
a
t
i
v
e

g
r
o
w
t
h
FIGURE 27.1 Projected relative growth in test data volume for ICs [3]. (From L. Li and K. Chakrabarty. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 12891305, 2004. With permission.)
diagnose and correct design errors, and measure chip characteristics for setting specications. Probing
of the internal nodes of the chip, which is rarely done in production test, may also be required during
characterization. Every fabricated chip is subject to production test, which is less comprehensive than
characterization test. The test vectors may not cover all possible functions and data patterns, but must
have a high coverage of modeled faults. Since every device must be tested, test data volume and testing
time, which directly affects testing cost, must be minimized. Only a pass/fail decision is made and no fault
diagnosis is attempted.
Even after successfully passing the production test, some devices may fail very quickly when they are
put into actual use. These bad chips are driven to actual failure and weeded out by the burn-in process,
in which the chips are subject to a combination of production test, high temperature, and over-voltage
power supply for a period of time. System manufacturers perform incoming inspection on the purchased
devices before integrating theminto the system. Incoming inspection can be similar to production testing,
or tuned to the specic application. It can also be carried out for a random sample with the sample size
depending on the device quality and the system requirement. In this chapter, we focus on the production
test process, which may also be used for burn-in process and incoming inspection.
Recent advances in VLSI technology have lead to a rapid increase in the density of ICs. As projected in
the 2001 edition of the International Technology Roadmap for Semiconductors (ITRS) [2], the density
of ICs can reach 2 billion transistors per cm
2
and 16 billion transistors per chip are likely in 2014. The
increased density results in a tremendous increase in test data volume. Figure 27.1 shows the projected
relative growth in test data volume for ICs [3]. The test data volume for ICs in 2014 can be as much as
150 times the test data volume in 1999. Considering the fact that the test data volume for an IC exceed
800 Mbits in 1999 [4], the test data volume in the near future will be prohibitively high.
In addition to the increasing density of ICs, todays system-on-chip (SOC) designs also exacerbate
the test data volume problem. SOCs reduce design cycle time since prevalidated embedded cores can be
purchased fromcore vendors and plugged into the design. However, each embedded core in an SOCneeds
to be tested using a large number of precomputed test patterns, which leads to a dramatic increase in test
data volume [5].
The increase in test data volume not only leads to the increase in testing time, but the high test
data volume may also exceed the limited memory depth of automatic test equipment (ATE). Multiple
ATE reloads are time-consuming since data transfer from a workstation to the ATE hard disk or from
the ATE hard disk to ATE channels are very slow. For example, it may take up to 1 h to transfer 32 Gbits
(64 channels 512 Mbits per channel) of test data fromthe ATEhard disk to the channel, at approximately
9 Mbits/sec [6].
Modular Testing and Built-In Self-Test 27-3
Embedded
core
Wrapper
Test access
mechanism
Test access
mechanism
Source
Sink
FIGURE 27.2 Overview of the three elements in an embedded-core test approach: (1) test pattern source and sink,
(2) test access mechanism, and (3) core test wrapper [5].
27.1.1 Core-Based SOC
In modern SOC designs, predesigned, prevalidated, and highly optimized embedded cores are routinely
used. Based on the level of detail provided by the core vendors, embedded cores are categorized as soft
cores, rmcores, andhardcores. Soft cores come inthe formof a synthesizable register-transfer level (RTL)
description. Soft cores leave much of the implementation to the system integrator, but are exible and
process independent. Firm cores are supplied as gate-level netlists. Hard cores or legacy cores are available
as nonmodiable layouts. Hard cores are optimized for anticipated performance, with an associated loss
of exibility. In order to protect the intellectual property (IP) of the core vendors, the detailed structural
information is not released to system integrators. Core vendors must therefore develop the design for
testability (DFT) structures and provide corresponding test patterns.
A typical SOC consists of various cores, such as CPU, DSP, embedded memory, I/O controllers,
video/audio codec cores, ethernet MAC cores, encryption cores, and analog/digital data converters. By
reusing the IP cores across multiple generations, SOC designers greatly shorten the time-to-market and
reduce the design cost. Although SOCs offer a number of advantages, production testing of SOCs is more
difcult than that for traditional ICs. Due to the absence of structural information about the IP cores,
fault simulation and test generation are not feasible. Testing of the IP cores in SOCs must therefore be
based on the reuse of precomputed tests obtained from core vendors.
27.1.2 Testing an SOC
An SOC test is essentially a composite test comprised of the individual tests for each core, the user-dened
logic (UDL) tests, and interconnect tests. Each individual core or UDL test may involve surrounding
components and may imply operational constraints (e.g., safe mode, low power mode, bypass mode),
which necessitate special isolation modes.
System-on-chip test development is challenging due to several reasons. Embedded cores represent IP
and core vendors are reluctant to divulge structural information about their cores to users. Thus, users
cannot access core netlists and insert DFT hardware that can ease test application from the surrounding
logic. Instead, a set of test patterns is provided by the core vendor that guarantees a specic fault coverage.
These test patterns must be applied to the cores in a given order, using a specic clocking strategy. Care
must often be taken to ensure that undesirable patterns and clock skews are not introduced into these test
streams. Furthermore, cores are often embedded in several layers of user-designed or other core-based
logic, and are not always directly accessible from chip I/Os. Propagating test stimuli to core inputs may
therefore require dedicated test transport mechanisms. Moreover, translation of test data is necessary
at the inputs and outputs of the embedded-core into a format or sequence suitable for application to
the core.
Aconceptual architecture for testing embedded-core-based SOCs is shown in Figure 27.2 [5]. It consists
of three structural elements:
1. Test pattern source and sink. The test patternsource generates the test stimuli for the embedded cores,
and the test pattern sink compares the response(s) to the expected response(s).
2. Test access mechanism (TAM). The TAM transports test patterns. It is used for on-chip transport of
test stimuli from test pattern source to the core under test, and for the transport of test responses from
the core under test to a test pattern sink.
3. Core test wrapper. The core test wrapper forms the interface between the embedded core and its
environment. It connects the terminals of the embedded core to the rest of the IC and to the TAM.
Once a suitable test data transport mechanism and test translation mechanism have been designed, the
next major challenge confronting the system integrator is test scheduling. This refers to the order in which
the various core tests and tests for user-designed interface logic are applied. A combination of built-in
self-test (BIST) and external testing is often used to achieve high fault coverage [7,8], and tests generated
by different sources may therefore be applied in parallel, provided resource conicts do not arise. Effective
test scheduling for SOCs is challenging because it must address several conicting goals: (1) SOC testing
time minimization, (2) resource conicts between cores arising from the use of shared TAMs and on-chip
BIST engines, (3) precedence constraints among tests, and (4) power constraints.
Finally, analog and mixed-signal cores are increasingly being integrated onto SOCs with digital cores.
Testing mixed-signal cores is challenging because their failure mechanisms and testing requirements are
not as well modeled as they are for digital cores. It is difcult to partition and test analog cores, because
they may be prone to crosstalk across partitions. Capacitance loading and complex timing issues further
exacerbate the mixed-signal test problem.
27.1.3 Built-In Self-Test
In BIST solutions, test patterns are generated by an on-chip pseudorandom pattern generator, which
is usually a linear feedback shift register (LFSR). BIST alleviates a number of problems related to test
interfacing, for example, limited signal bandwidth and high pin count of the device-under-test. A typical
BIST architecture is shown in Figure 27.3. Since the output patterns of the LFSR are time-shifted and
repeated, they become correlated; this reduces the effectiveness for fault detection. Therefore, a phase
shifter (a network of XOR gates) is often used to decorrelate the output patterns of the LFSR. The
response of the circuit under test (CUT) is usually compacted by a multiple input shift register (MISR) to
a small signature, which is compared with a known fault-free signature to determine whether the CUT is
faulty.
Most BIST techniques rely on the use of a limited number of pseudorandom patterns to detect the
random-pattern-testable faults, which is subsequently followed by the application of a limited number of
deterministic patterns to detect the random-pattern-resistant faults. Based on the mechanisms that are
used to generate the deterministic patterns, logic BIST techniques can be classied into two categories:
methods that generate deterministic patterns by controlling the states of the LFSR [912], and techniques
that modify the patterns generated by the LFSR [1315].
Linear feedback shift register reseeding [10,12,1619] is an example of a BIST technique that is based
on controlling the LFSR state. LFSR reseeding can be static, that is, the LFSR stops generating patterns
while loading seeds, or dynamic, that is, test generation and seed loading can proceed simultaneously.
The length of the seeds can be either equal to the size of the LFSR (full reseeding) or less than the size
of the LFSR (partial reseeding). In [10], a dynamic reseeding technique that allows partial reseeding is
Phase
shifter
Scan chain 1 ( l bits)
Scan chain 2 ( l bits)
Scan chain m ( l bits)
LFSR MISR
.

.

.
.

.

.
FIGURE 27.3 A generic BIST architecture based on an LFSR, an MISR, and a phase shifter.
proposed to encode test vectors. An LFSR of length r s
max
+20, where s
max
is the maximum number of
specied bits in any deterministic test cube, is used to generate the test patterns. While the length of the
rst seed is r, the lengths of the subsequent seeds are signicantly smaller than r. A set of linear equations
is solved to obtain the seeds, and the test vectors are reordered to facilitate the solution of this set of linear
equations.
A BIST pattern generator based on a folding counter is proposed in [9]. The properties of the fold-
ing counter are exploited to nd the seeds needed to cover the given set of deterministic patterns.
Width compression is combined with reseeding to reduce the hardware overhead. In [11], a two-
dimensional test data compression technique that combines an LFSR and a folding counter is proposed
for scan-based BIST. LFSR reseeding is used to reduce the number of bits to be stored for each pattern
(horizontal compression) and folding counter reseeding is used to reduce the number of patterns (vertical
compression).
Bit-ipping [15], bit-xing [13,2024], and weighted random BIST [14,2528] are examples of tech-
niques that rely on altering the patterns generated by the LFSR to embed deterministic test cubes.
In [29], a hybrid BIST method based on weighted pseudorandom testing is presented. A weight of
0, 1, or u (unbiased) is assigned to each scan chain in CUT. The weight sets are compressed and
stored on the tester. During test application, an on-chip lookup table is used to decompress the data
from the tester and generate the weight sets. A three-weight weighted random scan-BIST scheme is
discussed in [14]. The weights in this approach are 0, 0.5, and 1. In order to reduce the hardware
overhead, scan cells are carefully reordered and a special ATPG approach is used to generate suitable
test cubes.
In this chapter, we reviewrecent advances in modular testing of core-based SOCs, and a BIST technique
that reduces test data volume and testing time. A comprehensive set of references is also provided for the
interested reader.
27.2 Modular Testing of SOCs
Modular testing of embedded cores in an SOC is being increasingly advocated to simplify test access and
test application [5]. To facilitate modular test, an embedded core must be isolated fromsurrounding logic,
and test access must be provided from the I/O pins of the SOC. Test wrappers are used to isolate the core,
while TAMs transport test patterns and test responses between SOCs pins and core I/Os [5].
Effective modular test requires efcient management of the test resources for core-based SOCs. This
involves the design of core test wrappers and TAMs, the assignment of test pattern bits to ATE channels,
the scheduling of core tests, and the assignment of ATE channels to SOCs. The challenges involved in the
optimization of SOC test resources for modular test can be divided into two broad categories:
1. Wrapper/TAM co-optimization. Test wrapper design and TAM optimization are of critical import-
ance during system integration since they directly impact hardware overhead, testing time, and tester data
volume. The issues involved in wrapper/TAM design include wrapper optimization, core assignment to
TAM wires, sizing of the TAMs, and routing of TAM wires. As shown in [3032], most of these problems
are NP-hard.
2. Constraint-driven test scheduling. The primary objective of test scheduling is to minimize testing
time, while addressing one or more of the following issues: (1) resource conicts between cores arising
from the use of shared TAMs and BIST resources, (2) precedence constraints among tests, and (3) power
dissipation constraints. Furthermore, testing time can often be decreased further through the selective use
of test preemption [33]. As discussed in [7,33], most problems related to test scheduling for SOCs are also
NP-hard.
In addition, the rising cost of ATE for SOC devices is a major concern [2]. Due to the growing demand
for pin counts, speed, accuracy, and vector memory, the cost of high-end ATE for full-pin, at-speed
functional test is predicted to rise to over $20 million by 2010 [2]. As a result, the use of low-cost ATE that
perform structural rather than at-speed functional test is increasingly being advocated for reducing test
costs. Multisite testing, in which multiple SOCs are tested in parallel on the same ATE, can signicantly
increase the efciency of ATE usage, as well as reduce testing time for an entire production batch of
SOCs. The use of low-cost ATE and multisite test involve test data volume reduction and test pin count
(TAM width) reduction, such that multiple SOC test suites can t in ATE memory in a single test
session [34,35].
As a result of the intractability of the problems involved in test planning, test engineers have adopted a
series of simple ad hoc solutions in the past [34]. For example, the problem of TAM width optimization
is often simplied by stipulating that each core on the SOC have the same number of internal scan
chains, say W ; thus, a TAM of width W bits is laid out and cores are simply daisy-chained to the TAM.
However, with the growing size of SOC test suites and rising cost of ATE, more aggressive test resource
optimization techniques that enable effective modular test of highly complex next-generation SOCs using
current-generation ATE is critical.
27.2.1 Wrapper Design and Optimization
A core test wrapper is a layer of logic that surrounds the core and forms the interface between the core and
its SOC environment. Wrapper design is related to the well-known problems of circuit partitioning and
module isolation, and is therefore a more general test problem than its current instance (related to SOC
test using TAMs). For example, earlier proposed forms of circuit isolation (precursors of test wrappers)
include boundary scan and BILBO [36].
The test wrapper and TAM model of SOC test architecture was presented in [5]. In this chapter,
three mandatory wrapper operation modes listed were (1) normal operation, (2) core-internal test, and
(3) core-external test. Apart from the three mandatory modes, two optional modes are core bypass and
detach.
Two proposals for test wrappers have been the test collar [37] and TestShell [38]. The test collar was
designed to complement the Test Bus architecture [37] and the TestShell was proposed as the wrapper
to be used with the TestRail architecture [38]. In [37], three different test collar types were described:
combinational, latched, and registered. For example, a simple combinational test collar cell consisting of
a 2-to-1 multiplexer can be used for high-speed signals at input ports during parallel, at-speed test. The
TestShell described in [38] is used to isolate the core and perform TAM width adaptation. It has four
primary modes of operation: function mode, IP test mode, interconnect test mode, and bypass mode.
These modes are controlled using a test control mechanism that receives two types of control signals:
pseudostatic signals (that retain their values for the duration of a test) and dynamic control signals (that
can change values during a test pattern).
An important function of the wrapper is to adapt the TAM width to the cores I/O terminals and internal
scan chains. This is done by partitioning the set of core-internal scan chains and concatenating them into
longer wrapper scan chains, equal in number to the TAM wires. Each TAM wire can now directly scan test
patterns into a single wrapper scan chain. TAM width adaptation directly affects core testing time and has
been the main focus of research in wrapper optimization. Note that to avoid problems related to clock
skew, internal scan chains in different clock domains must either not be placed on the same wrapper scan
chain, or antiskew (lockup) latches must be placed between scan ipops belonging to different clock
domains.
The issue of designing balanced scan chains within the wrapper was addressed in [39]; see Figure 27.4.
The rst techniques to optimize wrappers for test time reduction were presented in [32]. To solve
the problem, the authors proposed two polynomial-time algorithms that yield near-optimal results.
The largest processing time (LPT) algorithm is taken from the Multiprocessor Scheduling literature
and solves the wrapper design problem in very short computation times. At the expense of a slight
increase in computation time, the Combine algorithm yields even better results. It uses LPT as a start
solution, followed by a linear search over the wrapper scan chain length with the First Fit Decreasing
heuristic.
Wrapper
scan chain 1
8 FF
Wrapper
scan chain 2
Core
Wrapper
2 FF
4 FF
Wrapper
scan chain 1
Wrapper
scan chain 2
8 FF
Core
Wrapper
2 FF
4 FF
(a) (b)
FIGURE 27.4 Wrapper chains: (a) unbalanced, (b) balanced. (From V. Iyengar, K. Chakrabarty, and E.J. Marinissen.
In Proceedings of the IEEE Asian Test Symposium, pp. 320325, 2002. With permission.)
To perform wrapper optimization, the authors in [31] proposed Design_wrapper, an algorithm based
on the Best Fit Decreasing heuristic for the Bin Packing problem. The algorithm has two priorities:
(1) minimizing core testing time and (2) minimizing the TAM width required for the test wrapper. These
priorities are achieved by balancing the lengths of the wrapper scan chains designed, and identifying the
number of wrapper scan chains that actually need to be created to minimize testing time. Priority (2) is
addressed by the algorithm since it has a built-in reluctance to create a new wrapper scan chain, while
assigning core-internal scan chains to the existing wrapper scan chains [31].
Wrapper design and optimization continue to attract considerable attention. Recent work in this area
has focused on light wrappers, that is, the reduction of the number of register cells [40], and the design
of wrappers for cores and SOCs with multiple clock domains [41].
27.2.2 TAM Design and Optimization
Many different TAM designs have been proposed in the literature. TAMs have been designed based on
direct access to cores multiplexed onto the existing SOC pins [42], reusing the on-chip system bus [43],
searching transparent paths through and around neighboring modules [4446], and 1-bit boundary scan
rings around cores [47,48].
Recently, the most popular appear to be the dedicated, scalable TAMs such as Test Bus [37] and
TestRail [38]. Despite the fact that their dedicated wiring adds to the area costs of the SOC, their exible
nature and guaranteed test access have proven successful. Three basic types of such scalable TAMs have
been described in [49] (see Figure 27.5): (1) the Multiplexing architecture, (2) the Daisychain architecture,
and (3) the Distribution architecture. In the Multiplexing and Daisychain architectures, all cores get access
to the total available TAM width, while in the Distribution architecture, the total available TAM width is
distributed over the cores.
In the Multiplexing architecture, only one core wrapper can be accessed at a time. Consequently, this
architecture only supports serial schedules, in which the cores are tested one after the other. An even more
serious drawback of this architecture is that testing the circuitry and wiring in between cores is difcult
with this architecture; interconnect test requires simultaneous access to multiple wrappers. The other two
basic architectures do not have these restrictions; they allowfor both serial as well as parallel test schedules,
and also support interconnect testing.
w
SOC
In
Out
Control
X
Y
Z
w
w
w
w
w
In
Control
Out
SOC
Z
Y
X
w
w1
Out
w2
w3
w2
w1
w3
X
Y
Z
SOC
In
Control
(a) (b) (c)
FIGURE 27.5 The (a) Multiplexing, (b) Daisychain, and (c) Distribution architectures [49]. (From V. Iyengar,
K. Chakrabarty, and E.J. Marinissen. In Proceedings of the IEEE Asian Test Symposium, pp. 320325, 2002. With
permission.)
A B
C D E
F
SOC
A B
C D E
F
SOC
A B
C D E
F
SOC
(a) (c)
A B
C D E
F
SOC
A B
C D E
F
SOC
A B
C D E
F
SOC
(b) (c)
FIGURE 27.6 The (a) xed-width Test Bus architecture, (b) xed-width TestRail architecture, and (c) exible-width
Test Bus architecture [51]. (FromV. Iyengar, K. Chakrabarty, and E.J. Marinissen. In Proceedings of the IEEE Asian Test
Symposium, pp. 320325, 2002. With permission.)
The Test Bus architecture [37] (see Figure 27.6[a]) is a combinationof the Multiplexing andDistribution
architectures. Asingle Test Bus is in essence the same as what is described by the Multiplexing architecture;
cores connected to the same Test Bus can only be tested sequentially. The Test Bus architecture allows
for multiple Test Buses on one SOC that operate independently, as in the Distribution architecture.
Cores connected to the same Test Bus suffer from the same drawback as in the Multiplexing architecture,
that is, their wrappers cannot be accessed simultaneously, hence making core-external testing difcult or
impossible.
The TestRail architecture [38] (see Figure 27.6[b]) is a combination of the Daisychain and Distribution
architectures. A single TestRail is in essence the same as what is described by the Daisychain architecture:
scan-testable cores connected to the same TestRail can be tested simultaneously, as well as sequentially.
A TestRail architecture allows for multiple TestRails on one SOC, which operate independently, as in the
Distribution architecture. The TestRail architecture supports serial and parallel test schedules, as well as
hybrid combinations of those.
In most TAM architectures, the cores assigned to a TAM are connected to all wires of that TAM. We
refer to this as xed-width TAMs. A generalization of this design, is one in which the cores assigned to
a TAM each connect to a (possibly different) subset of the TAM wires [50]. The core-TAM assignments are
made at the granularity of TAM wires, instead of considering the entire TAM bundle as one inseparable
entity. We call these exible-width TAMs. We can apply this concept both for Test Bus as well as for TestRail
architectures. Figure 27.6(c) shows an example of a exible-width Test Bus architecture.
Most SOC test architecture optimization algorithms proposed have concentrated on xed-width Test
Bus architectures and assume cores with xed-length scan chains. In [30], the author describes a Test Bus
architecture optimization approach that minimizes testing time using integer linear programming (ILP).
ILP is replaced by a genetic algorithm in [52]. In [53], the authors extend the optimization criteria
of [30] with place-and-route and power constraints, again using ILP. In [54,55], Test Bus architecture
optimization is mapped to the well-known problem of two-dimensional bin packing and a Best Fit
algorithm is used to solve it. Wrapper design and TAM design both inuence the SOC testing time, and
hence their optimization needs to be carried out in conjunction in order to achieve the best results. The
authors in [31] were the rst to formulate the problem of integrated wrapper/TAM design; despite its
NP-hard character, it is addressed using ILP and exhaustive enumeration. In [56], the authors presented
efcient heuristics for the same problem.
Idle bits exist in test schedules when parts of the test wrapper and TAMare under-utilized leading to idle
time in the test delivery architecture. In [57], the authors rst formulated the testing time minimization
problem both for cores having xed-length as well as cores having exible-length scan chains. Next, they
presented lower bounds on the testing time for the Test Bus and TestRail architectures and then examined
three main reasons for under-utilization of TAM bandwidth, leading to idle bits in the test schedule and
testing times higher than the lower bound [57]. The problem of reducing the amount of idle test data was
also addressed in [58].
The optimization of a exible-width Multiplexing architecture (i.e., for one TAM only) was proposed
in [50]. This work again assumes cores with xed-length scan chains. The chapter describes a heuristic
algorithm for co-optimization of wrappers and Test Buses based on rectangle packing. In [50], the same
authors extended this work by including precedence, concurrency, and power constraints, while allowing
a user-dened subset of the core tests to be preempted.
Fixed-width TestRail architecture optimization was investigated in [59]. Heuristic algorithms have been
developed for the co-optimization of wrappers and TestRails. The algorithms work both for cores with
xed-length and exible-length scan chains. TR-Architect, the tool presented in [59] is currently in actual
industrial use.
27.2.3 Test Scheduling
Test scheduling for SOCs involving multiple test resources and cores with multiple tests is especially
challenging, and even simple test scheduling problems for SOCs have been shown to be NP-hard [7].
In [8], a method for selecting tests from a set of external and BIST tests (that run at different clock speeds)
was presented. Test scheduling was formulated as a combinatorial optimization problem. Reordering tests
to maximize defect detection early in the schedule was explored in [60]. The entire test suite was rst
applied to a small sample population of ICs. The fault coverage obtained per test was then used to arrange
tests that contribute to high fault coverage earlier in the schedule. The authors used a polynomial-time
algorithm to reorder tests based on the defect data as well as execution time of the tests [60]. A test
scheduling technique based on the defect probabilities of the cores has recently been reported [61].
Macro Test is a modular testing approach for SOC cores in which a test is broken down into a test
protocol and list of test patterns [62]. A test protocol is dened at the terminals of a macro and describes
the necessary and sufcient conditions to test the macro [63]. The test protocols are expanded from the
macrolevel to the SOC pins and can either be applied sequentially to the SOC, or scheduled to increase
parallelism. In [63], a heuristic scheduling algorithm based on pairwise composition of test protocols was
presented. The algorithm determines the start times for the expanded test protocols in the schedule, such
that no resource conicts occur and test time is minimized [63].
System-on-chips in test mode can dissipate up to twice the amount of power they do in normal
mode, since cores that do not normally operate in parallel may be tested concurrently [64]. Power-
constrained test scheduling is therefore essential in order to limit the amount of concurrency during test
application to ensure that the maximum power budget of the SOC is not exceeded. In [65], a method
based on approximate vertex cover of a resource-constrained test compatibility graph was presented.
In [66], the use of list scheduling and tree-growing algorithms for power-constrained scheduling was
discussed. The authors presented a greedy algorithm to overlay tests such that the power constraint is not
violated. A constant additive model is employed for power estimation during scheduling [66]. The issue
of reorganizing scan chains to trade-off testing time with power consumption was investigated in [67].
The authors presented an optimal algorithm to parallelize tests under power and resource constraints.
The design of test wrappers to allow for multiple scan chain congurations within a core was also studied.
In [33], an integrated approach to test scheduling was presented. Optimal test schedules with pre-
cedence constraints were obtained for reasonably-sized SOCs. For precedence-based scheduling of large
SOCs, a heuristic algorithm was developed. The proposed approach also includes an algorithm to obtain
preemptive test schedules in O(n
3
) time, where n is the number of tests [33]. Parameters that allow
only a certain number of preemptions per test can be used to prevent excessive BIST and sequential
circuit test preemptions. Finally, a new power-constrained scheduling technique was presented, whereby
power-constraints can be easily embedded in the scheduling framework in combination with precedence
constraints, thus delivering an integrated approach to the SOC test scheduling problem.
27.2.4 Integrated TAM Optimization and Test Scheduling
Both TAM optimization and test scheduling signicantly inuence the testing time, test data volume, and
test cost for SOCs. Furthermore, TAMs and test schedules are closely related. For example, an effective
schedule developed for a particular TAM architecture may be inefcient or even infeasible for a different
TAM architecture. Integrated methods that perform TAM design and test scheduling in conjunction are
therefore required to achieve low-cost, high-quality test.
In [68], an integrated approach to test scheduling, TAM design, test set selection, and TAM routing
was presented. The SOC test architecture was represented by a set of functions involving test generators,
response evaluators, cores, test sets, power and resource constraints, and start and end times in the test
schedule modeled as Boolean and integral values [68]. A polynomial-time algorithm was used to solve
these equations and determine the test resource placement, TAM design and routing, and test schedule,
such that the specied constraints are met.
The mapping between core I/Os and SOC pins during the test schedule was investigated in [54].
TAM design and test scheduling was modeled as two-dimensional bin-packing, in which each core test
is represented by a rectangle. The height of each rectangle corresponds to the testing time, the width
corresponds to the core I/Os, and the weight corresponds to the power consumption during test. The
objective is to pack the rectangles into a bin of xed width (SOC pins), such that the bin height (total
testing time) is minimized, while power constraints are met. A heuristic method based on the Best Fit
algorithm was presented to solve the problem [54]. The authors next formulated constraint-driven pin
mapping and test scheduling as the chromatic number problem from graph theory and as a dependency
matrix partitioning problem [55]. Both problem formulations are NP-hard. A heuristic algorithm based
on clique partitioning was proposed to solve the problem.
The problem of TAM design and test scheduling with the objective of minimizing the average testing
time was formulated in [69]. The problem was reduced to one of minimum-weight perfect bipartite
graph matching, and a polynomial-time optimal algorithm was presented. A test planning ow was also
presented.
In [50], a newapproach for wrapper/TAMco-optimization and constraint-driven test scheduling using
rectangle packing was described. Flexible-width TAMs that are allowed to fork and merge were designed.
Rectangle packing was used to develop test schedules that incorporate precedence and power constraints,
while allowing the SOC integrator to designate a group of tests as preemptable. Finally, the relationship
between TAM width and tester data volume was studied to identify an effective TAM width for the SOC.
The work reported in [50] was extended in [70] to address the minimization of ATE buffer reloads and
include multisite test. The ATE is assumed to contain a pool of memory distributed over several channels,
such that the memory depth assigned to each channel does not exceed a maximum limit. Furthermore,
the sum of the memory depth over all channels equals the total pool of ATE memory. Idle bits appear on
ATE channels whenever there is idle time on a TAM wire. These bit positions are lled with dont-cares if
they appear between useful test bits; however, if they appear only at the end of the useful bits, they are not
required to be stored in the ATE.
The SOC test resource optimization problem for multisite test was stated as follows. Given the test set
parameters for each core, and a limit on the maximum memory depth per ATE channel, determine the
wrapper/TAM architecture and test schedule for the SOC, such that (1) the memory depth required on
any channel is less than the maximum limit, (2) the number of TAM wires is minimized, and (3) the idle
bits appear only at the end of each channel. A rectangle packing algorithm was developed to solve this
problem.
A new method for representing SOC test schedules using k-tuples was discussed in [71]. The authors
presented a p-admissible model for test schedules that is amenable to several solution methods such
as local search, two-exchange, simulated annealing, and genetic algorithms that cannot be used in a
rectangle-representation environment.
Finally, recent work on TAM optimization has focused on the use of ATEs with port scalability fea-
tures [7274]. In order to address the test requirements of SOCs, ATE vendors have recently announced
a new class of testers that can simultaneously drive different channels at different data rates. Examples of
such ATEs include the Agilent 93,000 series tester based on port scalability and the test processor-per-pin
architecture [75] and the Tiger systemfromTeradyne [76] in which the data rate can be increased through
software for selected pin groups to match SOC test requirements. However, the number of tester channels
with high data rates may be constrained in practice due to ATE resource limitations, the power rating of
the SOC, and scan frequency limits for the embedded cores. Optimization techniques have been developed
to ensure that the high data-rate tester channels are efciently used during SOC testing [74].
The availability of dual-speed ATEs was also exploited in [72,73], where a technique was presented to
matchATE channels with high data rates to core scan chain frequencies using virtual TAMs. Avirtual TAM
is an on-chip test data transport mechanismthat does not directly correspond to a particular ATE channel.
Virtual TAMs operate at scan chain frequencies; however, they interface with the higher-frequency ATE
channels using bandwidth matching. Moreover, since the virtual TAM width is not limited by the ATE
pin-count, a larger number of TAM wires can be used on the SOC, thereby leading to lower testing times.
Adrawback of virtual TAMs however is the need for additional TAMwires on the SOCas well as frequency
division hardware for bandwidth matching. In [74], the hardware overhead is reduced through the use
of a smaller number of on-chip TAM wires; ATE channels with high data rates directly drive SOC TAM
wires, without requiring frequency division hardware.
27.2.5 Modular Testing of Mixed-Signal SOCs
Prior research on modular testing of SOCs has focused almost exclusively on the digital cores in an SOC.
However, most SOCs in use today are mixed-signal circuits containing both digital and analog cores
[7779]. Increasing pressure on consumer products for small form factors and extended battery life is
driving single chip integration, and blurring the lines between analog/digital design types. As indicated in
the 2001 International Technology Roadmap for Semiconductors [2], the combination of these circuits
on a single die compounds the test complexities and challenges for devices that fall in an increasing
commodity market. Therefore, an effective modular test methodology should be capable of handling
both digital and analog cores, and it should reduce test cost by enabling test reuse for reusable embedded
modules.
DAC ADC Analog
f
clk
From TAM
To TAM
N
FIGURE 27.7 On-chip digitization of analog test data for uniform test access. (From A. Sehgal, S. Ozev, and
K. Chakrabarty. In Proceedings of the IEEE International Conference on CAD, pp. 9599, 2003. With permission.)
In traditional mixed-signal SOCtesting, tests for analog cores are applied either fromchip pins through
direct test access methods, for example, via multiplexing, or through a dedicated analog test bus [80,81],
which requires the use of expensive mixed-signal testers. For mid- to low-frequency analog applica-
tions, the data is often digitized at the tester, where it is affordable to incorporate high-quality data
converters. In most mixed-signal ICs, analog circuitry accounts for only a small part of the total silicon
(big-D/small-A). However, the total production testing cost is dominated by analog testing costs. This
is because of the fact that expensive mixed-signal testers are employed for extended periods of time res-
ulting in high overall test costs. A natural solution to this problem is to implement the data converters
on-chip. Since most SOC applications do not push the operational frequency limits, the design of such
data converters on-chip appears to be feasible. Until recently, such an approach has not been deemed
desirable due to its high hardware overhead. However, as the cost of on-chip silicon is decreasing and the
functionality and the number of cores in a typical SOC are increasing, the addition of data converters
on-chip for testing analog cores now promises to be cost efcient. These data converters eliminate the
need for expensive mixed-signal test equipment.
Recently, results have been reported on the optimization of a unied test access architecture that is used
for both digital and analog cores [82]. Instead of treating the digital and analog portions separately, a
global test resource optimization problem is formulated for the entire SOC. Each analog core is wrapped
by an DACADC pair and a digital conguration circuit. Results show that for big-D/small-A SOCs, the
testing time and test cost can be reduced considerably if the analog cores are wrapped, and the test access
and test scheduling problems for the analog and digital cores are tackled in a unied manner.
Each analog core is provided a test wrapper where the test informationincludes only digital test patterns,
clock frequency, the test conguration, and pass/fail criteria. This analog test wrapper converts the analog
core to a virtual digital core with strictly sequential test patterns, which are the digitized analog signals. In
order to utilize test resources efciently, the analog wrapper needs to provide sufcient exibility in terms
of required resources with respect to all the test needs of the analog core. One way to achieve this uniform
test access scheme for analog cores is to provide an on-chip ADCDAC pair that can serve as an interface
between each analog core and the digital surroundings, as shown in Figure 27.7.
Analog test signals are expressed in terms of a signal shape, such as sinusoidal or pulse, and signal
attributes, such as frequency, amplitude, and precision. These tests are provided by the core vendor to
the system integrator. In case of analog testers, these signals are digitized at the high precision ADCs
and DACs of the tester. In case of on-chip digitization, the analog wrapper needs to include the lowest
cost data converters that can still provide the required frequency and accuracy for applying the core tests.
Thus, on-chip conversion of each analog test to digital patterns imposes requirements on the frequency
and resolution of the data converters of the analog wrapper. These converters need to be designed to
accommodate all the test requirements of the analog core.
Analog tests may also have a high variance in terms of their frequency and test time requirements.
While tests involving low-frequency signals require low bandwidth and high test times, tests involving
high-frequency signals require high bandwidth and low test time. Keeping the bandwidth assigned to the
analog core constant results in under-utilization of the precious test resources. The variance of analog test
needs have to be fully exploited in order to achieve an efcient test plan. Thus, the analog test wrapper
has to be designed to accommodate multiple congurations with varying bandwidth and frequency
requirements.
R
e
g
i
s
t
e
r
ADC
R
e
g
i
s
t
e
r
Phase
shift
Analog in Analog out
f
s
f
update
Test enable
Self-test
Control
Serial-to-parallel ratio
f
TAM
To
TAM
From
TAM
Analog Mux Mux
E
n
c
o
d
e
r
D
e
c
o
d
e
r
DAC
FIGURE27.8 Block diagramof the analog test wrapper. (FromA. Sehgal, S. Ozev, and K. Chakrabarty. In Proceedings
of the IEEE International Conference on CAD, pp. 9599, 2003. With permission.)
Figure 27.8 shows the block diagram of an analog wrapper that can accommodate all the abovemen-
tioned requirements. The control and clock signals generated by the test control circuit are highlighted
in this gure. The registers at each end of the data converters are written and read in a semiserial fashion
depending on the frequency requirement of each test. For example, for a digital TAM clock of 50 MHz,
12-bit DAC and ADC resolution, and an analog test requirement of 8 MHz sampling frequency, the input
and output registers can be updated with a serial-to-parallel ratio of 6. Thus, the bandwidth requirement
of this particular test is only 2 bits. The digital test control circuit selects the conguration for each test.
This conguration includes the divide ratio of the digital TAM clock, the serial to parallel conversion rate
of the input and output registers of the data converters, and the test modes.
27.2.5.1 Analog Test Wrapper Modes
In the normal mode of operation, the analog test wrapper is completely by-passed; the analog circuit
operates on its analog I/O pins. During testing, the analog wrapper has two modes, a self-test mode
and a core-test mode. Before running any tests on the analog core, the wrapper data converters have to
be characterized for their conversion parameters, such as the nonlinearity and the offset voltage. The
self-test mode is enabled through the analog multiplexer at the input of the wrapper ADC, as shown in
Figure 27.8. The parameters of the DACADC pair are determined in this mode and are used to calibrate
the measurement results. Once the self-test of the test wrapper is complete, core test can be enabled by
turning off the self-test bits.
For each analog test, the encoder has to be set to the corresponding serial-to-parallel conversion ratio
(cr), where it shifts the data from the corresponding TAM inputs into the register of the ADC. Similarly,
the decoder shifts data out of the DAC register. The update frequency of the input and output registers,
f
update
= f
s
cr, is always less than the TAM clock rate, f
TAM
. For example, if the test bandwidth
requirement is 2 bits and the resolution of the data converters is 12 bits, the input and output registers of
the data converters are clocked at a rate 6 times less than the clock of the encoder, and the input data is
shifted into the encoder and out of the decoder at a 2 bits/cycle rate. The complexity of the encoder and
the decoder depends on the number of distinct bandwidth and TAMassignments (the number of possible
test congurations). For example, for a 12-bit resolution, the bandwidth assignments may include 1, 2, 3,
4, 6, and 12 bits, where in each case the data may come from distinct TAMs. Clearly, in order to limit the
complexity of the encoderdecoder pair, the number of such distinct assignments have to be limited. This
requirement can be imposed in the test scheduling optimization algorithm.
The analog test wrapper transparently converts the analog test data to the digital domain through
efcient utilization of the resources, thus this obviates the need for analog testers. The processing of the
collected data can be done in the tester by adding appropriate algorithms, such as the FFT algorithm.
Further details and experimental results can be found in [82].
27.3 BIST Using a Recongurable Interconnection Network
In this section, we review a recent deterministic BIST approach in which a recongurable interconnection
network (RIN) is placed between the outputs of the LFSR and the inputs of the scan chains in the CUT [83].
The RIN, which consists only of multiplexer switches, replaces the phase shifter that is typically used in
pseudorandom BIST to reduce correlation between the test data bits that are fed into the scan chains. The
connections between the LFSR and the scan chains can be dynamically changed (recongured) during a
test session. In this way, the RIN is used to match the LFSR outputs to the test cubes in a deterministic
test set. The control data bits used for reconguration ensure that all the deterministic test cubes are
embedded in the test patterns applied to the CUT. The proposed approach requires very little hardware
overhead, only a modest amount of CPU time, and fewer control bits compared to the storage required
for reseeding techniques or for hybrid BIST. Moreover, as a nonintrusive BIST solution, it does not require
any circuit redesign and has minimal impact on circuit performance.
In a generic LFSR-based BIST approach shown in Figure 27.3, the output of the LFSR is fed to a
phase shifter to reduce the linear dependency between the data shifted into different scan chains. The
phase shifter is usually a linear network composed of exclusive-OR gates. In the proposed approach,
illustrated in Figure 27.9(a), the phase shifter is replaced by an RIN that connects the LFSR outputs
to the scan chains. The RIN consists of multiplexer switches and it can be recongured by applying
appropriate control bits to it throughthe inputs D
0
, D
1
, . . . , D
g1
. The parameter g refers to the number of
congurations used during a BIST session and it is determined using a simulation procedure. The control
inputs D
0
, D
1
, . . . , D
g1
are provided by a d-to-g decoder, where d = log
2
g. A d-bit conguration
counter is used to cycle through all possible 2
d
input combinations for the decoder. The conguration
counter is triggered by the BIST pattern counter, which is preset for each conguration by the binary
value corresponding to the number of test patterns for the corresponding conguration. Although the
elimination of the phase shifter may reduce the randomness of the pseudorandom patterns, complete
fault coverage is guaranteed by the RIN synthesis procedure described later.
As shown in Figure 27.9(b), the multiplexers in the RIN are implemented using tristate buffers with
fully decoded control inputs. While the multiplexers can also be implemented in other ways, we use tristate
buffers here because of their ease of implementation in CMOS. The outputs of the tristate buffers are
connected at the output of the multiplexer. Each input I
i
of a multiplexer is connected to the input of a
tristate buffer, which is controlled by the corresponding control signal. While the number of multiplexers
can at most be equal to the number of scan chains, in practice it is sometimes smaller than the number of
scan chains because not all scan chains need to be driven by different LFSR cells. The number of tristate
gates in each multiplexer is at most equal either to the number of congurations or to the number of LFSR
cells, whichever is smaller. Once again, in practice, the actual number of tristate gates is smaller than this
upper limit.
We next describe the test application procedure during a BIST session. First the conguration counter
is reset to the all-0 pattern, and the pattern counter is loaded with the binary value corresponding to
the number of patterns that must be applied to the CUT in the rst conguration. The pattern counter
is decremented each time a test pattern is applied to the CUT. When the content of the pattern counter
becomes zero, it is loaded with the number of patterns for the second conguration and it triggers the
conguration counter, which is incremented. This leads to a corresponding change in the outputs of the
decoder, and the RINis recongured appropriately. This process continues until the conguration counter
passes through all g congurations. The total number of test patterns applied to the CUT is therefore
g
i=1
n
i
, where n
i
is the number of patterns corresponding to conguration i, 1 i g. The BISTdesign
procedure described next is tailored to embed a given set of deterministic test cubes in the sequence of
g
i=1
n
i
patterns.
Pattern
counter
Configuration
counter
Stored
control bits
Data
from
LFSR
Scan chain 1 (l bits)
Decoder
LFSR
.

.

.
(a)
(b)
Multiplexer 1
Multiplexer 2
C
1
C
d1
. . .
D
1
D
g1
. . .
D
3
D
2
D
1
D
0
D
0
C
0
MISR
.

.

.

Scan chain m (l bits)
.
.
Reconfigurable
interconnection
network
(RIN)
.
FIGURE 27.9 (a) Proposed logic BIST architecture (b) RIN for m = 2 and g = 4. (From L. Li and K. Chakrabarty.
IEEETransactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 12891305, 2004. Withpermission.)
During test application, pseudorandom patterns that do not match any deterministic test cube are also
applied to the CUT. These pseudorandom patterns can potentially detect nonmodeled faults. However,
these patterns increase the testing time. Aparameter called MaxSkipPatterns, which is dened as the largest
number of pseudorandom patterns that are allowed between the matching of two deterministic cubes, is
used in the design procedure to limit the testing time. We rst need to determine for each conguration,
the number of patterns as well as the interconnections between the LFSR outputs and the scan chains. We
use the simulation procedure described next to solve this problem.
t :
t1 : 0xxx1
t2 : 1xx1x
t3 : xx01x
t4 : xxx10
xxx10 xx01x 1xx1x 0xxx1
FIGURE 27.10 An illustration of converting a test cube to multiple scan chain format (m = 4, l = 5). (From L. Li
and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 12891305,
2004. With permission.)
We start with an LFSR of length L, a predetermined seed, and a known characteristic polynomial. Let
T
D
= {c
1
, c
2
, . . . , c
n
} be the set of deterministic test cubes that must be applied to the CUT. The set T
D
can
either target all the single stuck-at faults in the circuit, or only the hard faults that cannot be detected by a
small number of pseudorandom patterns. As illustrated in Figure 27.10, each deterministic test cube c in
the test set is converted into the multiple scan chain format as a set of m l -bit vectors {t
1
, t
2
, . . . , t
m
}, where
m is the number of scan chains and l is the length of each scan chain. The bits in a test cube are ordered
such that the least signicant bit is rst shifted into the scan chain. We use Conn
(i )
j
to denote the set of
LFSR taps that are connected to the scan chain j in conguration i, where i = 1, 2, . . . , g, j = 1, 2, . . . , m.
The steps of the simulation procedure are as follows:
1. Set i = 1.
2. Set Conn
(i )
j
= {1, 2, . . . , L } for j = 1, 2, . . . , m, that is, initially, each scan chain can be connected
to any tap of the LFSR.
3. Driving the LFSR for the next l clock cycles, we obtain the output of the LFSR as a set of L l -bit
vectors {O
k
|k = 1, 2, . . . , L }, where vector O
k
is the output stream of the kth ipop of the LFSR
for the l clock cycles.
4. Find a test cube c

in T
D
that is compatible with the outputs of the LFSR under the current
connection conguration Conn
(i )
j
, that is, for all j = 1, . . . , m, there exists k Conn
(i )
j
such that
t
j
is compatible with O
k
, where c

has already been reformatted for m scan chains as a set of vector
{t
1
, t
2
, . . . , t
m
}. (A vector u
1
, u
2
, . . . , u
r
and a vector v
1
, v
2
, . . . , v
r
are mutually compatible if for
any i, 1 i r, one of the following holds: [i] u
i
= v
i
if they are both care bits; [ii] u
i
is a
dont-care bit; [iii] v
i
is a dont-care bit.)
5. If no test cube is found in Step 4, go to Step 6 directly. Otherwise, remove the test cube c
found in Step 4 from T

D
, and narrow down the connection conguration as follows. For each
j = 1, 2, . . . , m, let U Conn
(i )
j
such that for any k U, O
k
is not compatible with t
j
. Then set
Conn
(i )
j
= Conn
(i )
j
U.
6. If in the previous MaxSkipPatterns + 1 iterations, at least one test cube is found in Step 4, then go
to Step 3. Otherwise, the simulation for the current conguration is concluded. The patterns that
are applied to the circuit under this conguration are those that are obtained in Step 3.
7. Match the remaining cubes in T
D
to the test patterns for the current conguration, that is, if any
test vector in T
D
is compatible with any pattern for the current conguration, remove it from T
D
.
8. If no pseudorandom pattern for the current conguration is compatible with a test cube, the
procedure fails and exits. Otherwise, increase i by 1, and go to Step 2 to begin the iteration for the
next conguration until T
D
is empty.
Figure 27.11 shows a owchart corresponding to the above procedure, where the variable skip_patterns is
used to record the number of continuous patterns that are not compatible with any deterministic test cube,
and all _randoms is used to indicate if all the patterns for the current conguration are pseudorandom
patterns.
An example of the simulation procedure is illustrated in Figure 27.12. A 4-bit autonomous LFSR with
characteristic polynomial x
4
+x +1 is used to generate the pseudorandom patterns. There are four scan
chains and the length of each scan chain is 4-bits. The parameter MaxSkipPatterns is set to 1. The output of
the LFSRis divided into patterns p
i
, i = 1, 2, . . . . Each pattern consists of four 4 bit vectors. The procedure
i = 1; skip_patterns=0.
Obtain the outputs of LFSR for next l
clock cycles, {O
k
/ k =1, 2, , L}.
all_randoms=false; skip_ patterns=0;
remove c* from T
D
and
narrow down {Conn
j
(i )
}.
Does there exist
c*in T
D
that is compatible
with {O
k
} under current
Conn
j
(i )
?
skip_ patterns = skip_ patterns + 1.
skip_ patterns=
MaxSkipPatterns+1?
Compact the remaining cubes in
T
D
to the test patterns for the
current configuration.
Is T
D
empty?
all_randoms
= true?
End
i =i +1
No
Yes
No
Yes
No
Yes
Yes
No
Fail!
Conn
j
(i)
={1, 2, , L}, j = 1, 2, , m;
all_randoms =true.
Start
FIGURE 27.11 Flowchart illustrating the simulation procedure. (From L. Li and K. Chakrabarty. IEEE Transactions
on Computer-Aided Design of Integrated Circuits and Systems, 23, 12891305, 2004. With permission.)
that determines the connections is shown as Step (Init) to Step (f). Step (Init) is the initialization step in
which all the connections Conn
(1)
j
, j = 1, 2, 3, 4 are set to {1, 2, 3, 4}. In Step (a), the rst pattern p
1
is
matched with the test cube c
1
, and the connections are shown for each scan chain: scan chain 1 can be
connected to x
1
or x
4
, both scan chain 2 and scan chain 3 can only be connected to x
2
, scan chain 4 can
be connected to x
1
, x
2
, or x
4
. In Step (c), none of the cubes is compatible with p
3
. When neither p
5
nor p
6
matches any cubes in Step (e), the iterations for the current conguration are terminated. The patterns
that are applied to the CUT in this conguration are p
1
, p
2
, . . . , p
6
. We then compare the remaining
cube c
4
with the six patterns and nd that it is compatible with p
2
. So c
4
is also covered by the test patterns
for the current conguration. Thus the connections for this conguration are: scan chain 1 is connected
to x
4
, both scan chain 2 and scan chain 3 are connected to x
2
, scan chain 4 is connected to x
1
. Since p
5
and p
6
are not compatible with any deterministic cubes, the number of patterns for this conguration
is set to 4. If there are test cubes remaining to be matched, the iteration for the next conguration starts
from p
5
.
27.3.1 Declustering the Care Bits
The simulationprocedure to determine the number of patterns and the connections for eachconguration
can sometimes fail to embed the test cubes in the LFSR sequence. This can happen if MaxSkipPatterns
:
:
:
:
x
3
x
2
x
4
x
1
. . .
. . .
. . .
. . .
. . .
0001
1000
0100
0010
1001
1100
0110
1011
0101
1010
1101
1110
1111
0111
0011
0001
1000
0100
0010
1001
1100
0110
1011
0101
s
1
s
2
s
3
s
4
s
5
s
6
00xx
1xx0
10xx
x0xx
0xxx
xx1x
01xx
11xx
xx11
x10x
x1x0
10xx
xx11
1xxx
01xx
x001
t
1
t
2
t
3
t
4
(c) s
3
:none
(1, 4)
(2)
(2)
(4)
(f) s
2
:t
4
(4)
(2)
(2)
(1)
(e) s
5
, s
6
:none
(4)
(2)
(2)
(1)
(d) s
4
:t
2
(4)
(2)
(2)
(1)
(1, 2, 4)
(2)
(1, 4)
(a) s
1
:t
1
(2)
(b) s
2
:t
3
(4)
(2)
(2)
(1, 4)
(1,2,3,4)
(1,2,3,4)
(1,2,3,4)
(1,2,3,4)
Output of LFSR:
LFSR:
Test cubes:
x
1
x
2
x
3
x
4
Determination of connections:
Conn
11
:
Conn
12
:
Conn
13
:
Conn
14
:
(Init)
FIGURE 27.12 An illustration of the simulation procedure. (From L. Li and K. Chakrabarty. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems, 23, 12891305, 2004. With permission.)
is too small, or the test cubes are hard to match with the outputs of the LFSR. During our experiments,
we found that it was very difcult to embed the test cubes for the s38417 benchmark circuit. On closer
inspection, we found that the care bits in some of the test cubes for s38417 are highly clustered, even
though the percentage of care bits in T
D
is small. When these test cubes are converted into a multiple scan
chain format, most of the vectors contain very few care bits but a few vectors contain a large number of
care bits. These vectors with many care bits are hard to embed in the output sequence of the LFSR.
In order to embed test cubes with highly clustered care bits, we propose two declustering strategies. The
rst is to reorganize the scan chains such that the care bits can be scattered across many scan chains, and
each scan chain contains only a few care bits. Another strategy is based on the use of additional logic to
interleave the data that are shifted into the different scan chains. The rst strategy requires reorganization
of the scan chains, but it does not require extra hardware overhead. Care needs to be taken in scan chain
redesign to avoid timing closure problems. The interleaving method does not modify the scan chains, but
it requires additional hardware and control mechanisms.
The method of reorganization of scan chains is illustrated in Figure 27.13. As shown in the gure,
before the reorganization, all the care bits of the given test cube are grouped in the second vector, which is
Scan chain 0
Scan chain 1
Scan chain 2
Scan chain 3
Scan chain 4
Scan chain 0
Scan chain 1
Scan chain 2
Scan chain 3
Scan chain 4
1 2 3 4 5 6
7 8 9 10 11 12
13 14 15 16 17 18
19 20 21 22 23 24
25 26 27 28 29 30
1
7
13
19
25
2
8
14
3
9
15
4
10 5
6
12
18
24
30 20
26 21
27
16
22
28
11
17
23
29
Scan cells Vectors
x x x x 1 x
x x x 1 x x
x x 0 x x x
x 0 x x x x
1 x x x x 0
x x x x x x
1 0 0 1 1 0
x x x x x x
x x x x x x
x x x x x x
Reorganization
Test cube : x x x x x xx x x x x xx x x x x x1 0 0 1 10 x x x x x x
Reformat
FIGURE 27.13 An illustration of the reorganization of scan chains. (From L. Li and K. Chakrabarty. IEEETransactions
on Computer-Aided Design of Integrated Circuits and Systems, 23, 12891305, 2004. With permission.)
hard to match with the output of LFSR. After the reorganization, the care bits are scattered across all the
vectors, and the largest number of care bits in a vector is only two. This greatly increases the probability
that this vector can be matched to an output pattern of the LFSR. Note that the concept of reorganization
of scan chains is also used in [9]. However, the reorganization used in [9] changes the scan chain structure
and makes it unsuitable for response capture a separate solution is needed in [9] to circumvent this
problem. In our approach, the basic structure of the scan chains is maintained and the usual scan test
procedure of pattern shift-in, response capture, and shift-out can be used.
The scan cells in the CUT can be indexed as c
i,j
, i = 0, 1, . . . , m 1, j = 0, 1, . . . , l 1, where m is the
number of scan chains and l is the length of a scan chain. Note that we start the indices from 0 to facilitate
the description of the scan chain reorganization procedure. The ith scan chain consists of the l scan cells
c
i,j
, j = 0, 1, . . . , l 1. We use c
i,j
to denote the reorganized scan cells, in which the ith scan chain consists
of the l scan cells c
i,j
, j = 0, 1, . . . , l 1. For each j = 0, 1, . . . , l 1, the m cells c
0,j
, c
1,j
, . . . , c
m 1,j
constitute a vertical vector. The reorganized scan cell structure is obtained by rotating each such vertical
vector upwards by d positions, where d = j mod m, that is, c
i,j
= c
k,j
, where k is given by k = (i + d)
mod m.
An alternative method for declustering, based on the interleaving of the inputs to the scan chains, is
shown in Figure 27.14. We insert an extra stage of multiplexers between the outputs of the RIN and the
inputs of the scan chains. Fromthe perspective of the RIN, the logic that follows it, that is, the combination
of the multiplexers for interleaving and the scan chains, is simply a reorganized scan chain with an
appropriate arrangement of the connections between the two stages of multiplexers. For a CUT with m
scan chains, m multiplexers are used for reconguration, and m multiplexers are inserted for interleaving.
Each of the multiplexers used for interleaving has m inputs, which are selected in ascending order during
the shifting in of a test pattern, that is, the rst input is selected for the rst scan clock cycle, the second
input is selected for the second scan clock cycle, and so on. After the mth input is selected, the procedure
is repeated with the rst input. We use A
i
to denote the output the ith multiplexers for reconguration
and B
i,j
to denote the jth input of the ith multiplexers for interleaving, where i, j = 1, 2, . . . , m. The
interleaving is carried out by connecting the inputs of the multiplexers for interleaving with the outputs
Scan chain 2
Scan chain 3
Scan chain 4
Scan chain 5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Scan chain 1
Multiplexers
for interleaving
Multiplexers
for reconfiguration
...
...
...
FIGURE 27.14 An illustration of interleaving of the inputs of scan chains. (From L. Li and K. Chakrabarty.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 12891305, 2004. With
permission.)
of multiplexers for reconguration such that
B
i,j
=
A
i j +1
if i j
A
i j +1+m
if i < j
In order to control the multiplexers for interleaving, an architecture similar to the control logic for the
recongurable interconnection network can be used. However, for the interleaving, we do not need any
storage and the pattern counter. A bit counter counting up to m 1 (where m is the number of scan
chains) is used to replace the conguration counter. The bit counter is reset to 0 at the start of the shifting
in of each pattern, and it returns to 0 after counting to m 1.
Consider the test cube shown in Figure 27.13. After adding the second stage of multiplexers and connect-
ing the inputs of the multiplexers for interleaving with the outputs of the multiplexers for reconguration,
as shown in Figure 27.14 (only the connections related to the rst RIN multiplexer are shown for clarity),
the output of the rst multiplexer for reconguration should match withxxxx1x, the same string as that
in scan cell reorganization method. Note that the above reorganization and interleaving procedures yield
the same set of test cubes.
Detailed simulation results for benchmark circuits are presented in [83]. Here we discuss the inuence
of the initial seed on the effectiveness of test set embedding. Experiments were carried out with 20
randomly-selected initial seeds for the test set from [9] targeting all faults with scan cell reorganization
and scan chains. The statistics on the number of congurations are listed in Table 27.1(A). We also carried
out the same experiments for the test set from [9] targeting random-pattern-resistant faults and listed
results in Table 27.1(B). The results show that the number of congurations depends on the initial seed.
However, the dependency is not very signicant due in part to the recongurability of the interconnection
network.
In order to evaluate the effectiveness of the proposed approach for large circuits, we applied the method
to test sets for two production circuits fromIBM, namely CKT1 and CKT2. CKT1 is a logic core consisting
of 51,082 gates and its test set provides 99.80% fault coverage. CKT2 is a logic core consisting of 94,340
gates and its test set provides 99.76% fault coverage. The number of scan chains is xed to 64 and 128
for each of these two circuits. We modied the simulation procedure such that the conguration of the
interconnection network can be changed during the shifting in of a test cube, and we set the parameter
TABLE 27.1 Statistics on the Number of the Congurations with Random Seeds for Test Sets from [9] Targeting
(A) All Faults and (B) Random-Pattern-Resistant Faults, with Scan Chain Reorganization (Assuming 32 Scan
Chains for Each Circuit)
(A) All faults (B) Random-pattern-resistant faults
Circuit Minimum Maximum Mean
Standard
deviation Circuit Minimum Maximum Mean
Standard
deviation
s5378 6 9 7.2 0.83 s5378 3 5 3.55 0.60
s9234 29 33 30.5 1.10 s9234 32 36 33.7 1.38
s13207 10 14 11.85 1.18 s13207 5 8 5.95 0.89
s15850 16 20 17.5 1.24 s15850 19 25 21.8 1.51
s38417 180 192 185.9 3.23 s38417 118 129 121.45 3.07
s38584 9 12 9.8 0.89 s38584 9 12 10.8 0.83
Source: From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
23, 12891305, 2004. With permission.
TABLE 27.2 Results for Test Cubes for Circuits from IBM
Length Testing Hardware
No. of No. of No. of of scan time overhead in Storage
test scan scan chain No. of (clock GEs, and as a requirement Encoding
Circuit cubes cells chains (bits) congs cycles) percentage (bits) efciency CPU time
CKT1 17,176 12,256 64 192 1,792 1,351,104 68,145.5 (8.52%) 21,504 46.79 1 h 37 min
128 96 1,079 566,496 75,579.5 (9.45%) 12,948 77.71 1 h 26 min
CKT2 43,079 22,216 64 348 3,221 4,051,764 124,062.5 (7.26%) 38,652 26.03 6 h 35 min
128 174 1,828 2,338,005 128,009.5 (7.49%) 21,936 45.87 6 h 06 min
Source: From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
23, 12891305, 2004. With permission.
TABLE 27.3 The Number of Recongurations Per Pattern for Test Sets IBM
No. of recongurations per pattern
Circuit
No. of scan
chains
Length of a test cube
per scan chain (bits) Minimum Maximum Mean Standard deviation
CKT1 64 192 0 3 0.11 0.0210
128 96 0 3 0.07 0.0354
CKT2 64 348 0 3 0.11 0.0204
128 174 0 15 0.06 0.3018
Source: From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems, 23, 12891305, 2004. With permission.
MaxSkipPatterns to 0. Accordingly, in the proposed BIST architecture shown in Figure 27.9(a), the stored
control bits are the number of bits per conguration instead of the number of patterns per conguration,
and the pattern counter is replaced by a bit counter that counts the number of bits that have been
shifted into the scan chains. Table 27.2 lists the results for these two industrial circuits. The hardware
overhead is less than 10% and very high encoding efciency (up to 77.71) is achieved for both circuits.
As mentioned above, we allow the conguration of the interconnection network to be changed during
the shifting in of a test cube. Table 27.3, Figure 27.15, and Figure 27.16 present the statistics on the
number of recongurations per test cube. The number of intrapattern congurations is small for both
circuits.
0 1 2 3
0
20
40
60
80
95
No. of reconfigurations per pattern
P
e
r
c
e
n
t
a
g
e

o
f

p
a
t
t
e
r
n
s
64 scan chains
128 scan chains
FIGURE 27.15 The number of patterns versus the number of recongurations needed for CKT1. (From L. Li and
K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 23, 12891305, 2004.
With permission.)
0 1
0
20
40
60
80
100
No. of reconfigurations r per pattern, 0 r 1
(a)
P
e
r
c
e
n
t
a
g
e

o
f

p
a
t
t
e
r
n
s
64 scan chains
128 scan chains
2 5 10 15
0
0.005
0.01
0.015
0.02
0.025
0.03
(b)
No. of reconfigurations r per pattern, r 1
P
e
r
c
e
n
t
a
g
e

o
f

p
a
t
t
e
r
n
s
64 scan chains
128 scan chains
FIGURE 27.16 The number of patterns versus the number of recongurations r needed for CKT2. (a) 0 r 1,
(b) r 2. (From L. Li and K. Chakrabarty. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 23, 12891305, 2004. With permission.)
27.4 Conclusions
Rapid advances in test development techniques are needed to reduce the test cost of million-gate SOC
devices. This survey has described a number of state-of-the-art techniques for reducing test time and
test data volume, thereby decreasing test cost. Modular test techniques for digital, mixed-signal, and
hierarchical SOCs must develop further to keep pace with design complexity and integration density.
The test data bandwidth needs for analog cores are signicantly different than that for digital cores,
therefore unied top-level testing of mixed-signal SOCs remains a major challenge. Most SOCs today
include embedded cores that operate in multiple clock domains. Since the forthcoming P1500 standard
does not address wrapper design for at-speed testing of such cores, research is needed to develop wrapper
design techniques for multifrequency cores. There is also a pressing need for test planning methods that
can efciently schedule tests for these multifrequency cores. The work reported in [41] is a promising
rst step in this direction. In addition, compression techniques for embedded cores also need to be
developed and rened. Of particular interest are techniques that can combine TAM optimization and
test scheduling with test data compression. Some preliminary studies on this problem have been reported
recently [84,85].
We have also reviewed a new approach for deterministic BIST based on the use of a RIN. The RIN
is placed between the outputs of pseudorandom pattern generator, for example, an LFSR, and the scan
inputs of the CUT. It consists only of multiplexer switches and it is designed using a synthesis procedure
that takes as inputs the pseudorandom sequence from the LFSR and the deterministic test cubes for the
CUT. As a nonintrusive BIST solution, the proposed approach does not require any circuit redesign and
it has minimal impact on circuit performance.
Acknowledgments
This survey is based on joint work and papers published with several students and colleagues. In particular,
the author acknowledges Anshuman Chandra, Vikram Iyengar, Lei Li, Erik Jan Marinissen, Sule Ozev, and
Anuja Sehgal.
References
[1] M.L. Bushnell and V.D. Agrawal. Essentials of Electronic Testing. Kluwer Academic Publishers,
Norwell, MA, 2000.
[2] Semiconductor Industry Association. International Technology Roadmap for Semiconductors, 2001
Edition. http://public.itrs.net/Files/2001ITRS/Home.htm
[3] A. Khoche and J. Rivoir. I/O bandwidth bottleneck for test: is it real? Test Resource Partitioning
Workshop, 2002.
[4] G. Hetherington, T. Fryars, N. Tamarapalli, M. Kassab, A. Hassan, and J. Rajski. Logic BIST
for large industrial designs: real issues and case studies. In Proceedings of the International Test
Conference, pp. 358367, 1999.
[5] Y. Zorian, E.J. Marinissen, and S. Dey. Testing embedded-core-based system chips. IEEE
Computer, 32, 5260, 1999.
[6] O. Farnsworth. IBM Corp., personal communication, April 2003.
[7] K. Chakrabarty. Test scheduling for core-based systems using mixed-integer linear programming.
IEEE Transactions on Computer-Aided Design, 19, 11631174, 2000.
[8] M. Sugihara, H. Date, and H. Yasuura. A novel test methodology for core-based system LSIs
and a testing time minimization problem. In Proceedings of the International Test Conference,
pp. 465472, 1998.
[9] S. Hellebrand, H.-G. Liang, and H.-J. Wunderlich. A mixed-mode BIST scheme based
on reseeding of folding counters. In Proceedings of the International Test Conference,
pp. 778784, 2000.
[10] C.V. Krishna, A. Jas, and N.A. Touba. Test vector encoding using partial LFSR reseeding.
In Proceedings of the International Test Conference, pp. 885893, 2001.
[11] H.-G. Liang, S. Hellebrand, and H.-J. Wunderlich. Two-dimensional test data compres-
sion for scan-based deterministic BIST. In Proceedings of the International Test Conference,
pp. 894902, 2001.
[12] J. Rajski, J. Tyszer, and N. Zacharia. Test data decompression for multiple scan designs with
boundary scan. IEEE Transactions on Computers, 47, 11881200, 1998.
[13] N.A. Touba and E.J. McCluskey. Altering a pseudo-random bit sequence for scan based.
In Proceedings of the International Test Conference, 1996, pp. 167175.
[14] S. Wang. Low hardware overhead scan based 3-weight weighted random BIST. In Proceedings of
the International Test Conference, pp. 868877, 2001.
[15] H.-J. Wunderlich and G. Kiefer. Bit-ipping BIST. In Proceedings of the International Conference
on Computer-Aided Design, pp. 337343, 1996.
[16] A.A. Al-Yamani and E.J. McCluskey. Built-in reseeding for serial BIST. In Proceedings of the VLSI
Test Symposium, pp. 6368, 2003.
[17] A.A. Al-Yamani and E.J. McCluskey. BIST reseeding with very few seeds. In Proceedings of the
VLSI Test Symposium, pp. 6974, 2003.
[18] S. Chiusano, P. Prinetto, and H.-J. Wunderlich. Non-intrusive BIST for systems-on-a-chip.
[19] S. Hellebrand, S. Tarnick, J. Rajski, and B. Courtois. Generation of vector patterns through
reseeding of multiple-polynomial linear feedback shift registers. InProceedings of the International
Test Conference, pp. 120129, 1992.
[20] M.F. Alshaibi and C.R. Kime. Fixed-biased pseudorandom built-in self-test for random pattern
resistant circuits. In Proceedings of the International Test Conference, pp. 929938, 1994.
[21] M.F. AlShaibi and C.R. Kime. MFBIST: a BIST method for random pattern resistant circuits.
[22] S. Pateras and J. Rajski. Cube-contained random patterns and their application to the complete
testing of synthesized multi-level circuits. In Proceedings of the International Test Conference,
pp. 473482, 1991.
[23] N.A. Touba and E.J. McCluskey. Synthesis of mapping logic for generating transformed pseudo-
random patterns for BIST. In Proceedings of the International Test Conference, pp. 674682, 1995.
[24] N.A. Touba and E.J. McCluskey. Transformed pseudo-random patterns for BIST. In Proceedings
of the VLSI Test Symposium, pp. 410416, 1995.
[25] M. Bershteyn. Calculation of multiple sets of weights for weighted randomtesting. In Proceedings
of the International Test Conference, pp. 10311040, 1993.
[26] F. Brglez, G. Gloster, and G. Kedem. Built-in self-test with weighted random pattern hardware.
In Proceedings of the International Conference on Computer Design, pp. 161166, 1990.
[27] F. Muradali, V.K. Agarwal, and B. Nadeau-Dostie. Anewprocedure for weighted randombuilt-in
self-test. In Proceedings of the International Test Conference, pp. 660669, 1990.
[28] I. Pomeranz and S.M. Reddy. 3-weight pseudo-random test generation based on a deterministic
test set for combinational and sequential circuits. IEEE Transactions on Computer-Aided Design,
12, 10501058, 1993.
[29] A. Jas, C.V. Krishna, and N.A. Touba. Hybrid BIST based on weighted pseudo-random testing:
a newtest resource partitioning scheme. In Proceedings of the VLSI Test Symposium, pp. 28, 2001.
[30] K. Chakrabarty. Optimal test access architectures for system-on-a-chip. ACM Transactions on
Design Automation of Electronic Systems, 6, 2649, 2001.
[31] V. Iyengar, K. Chakrabarty, and E.J. Marinissen. Test wrapper and test access mechanism
co-optimization for system-on-chip. Journal of Electronic Testing: Theory and Applications,
18, 213230, 2002.
[32] E.J. Marinissen, S.K. Goel, and M. Lousberg. Wrapper design for embedded core test. In
Proceedings of the International Test Conference, pp. 911920, 2000.
[33] V. Iyengar and K. Chakrabarty. System-on-a-chip test scheduling with precedence relationships,
preemption, and power constraints. IEEE Transactions on Computer-Aided Design of ICs and
Systems, 21, 10881094, 2002.
[34] E.J. Marinissen and H. Vranken. On the role of DfT in ICATE matching. In International
Workshop on TRP, 2001.
[35] E. Volkerink et al. Test economics for multi-site test with modern cost reduction techniques.
In Proceedings of the VLSI Test Symposium, pp. 411416, 2002.
[36] M. Abramovici, M.A. Breuer, and A.D. Friedman. Digital Systems Testing and Testable Design.
Computer Science Press, New York, 1990.
[37] P. Varma and S. Bhatia. A structured test re-use methodology for core-based system chips.
[38] E.J. Marinissen et al. A structured and scalable mechanism for test access to embedded reusable
cores. In Proceedings of the International Test Conference, pp. 284293, 1998.
[39] T.J. Chakraborty, S. Bhawmik, and C.-H. Chiang. Test access methodology for system-on-chip
testing. In Proceedings of the International Workshop on Testing Embedded Core-Based System-
Chips, pp. 1.1-11.1-7, 2000.
[40] Q. Xu and N. Nicolici. On reducing wrapper boundary register cells in modular SOC testing.
[41] Q. Xu and N. Nicolici. Wrapper design for testing IP cores with multiple clock
domains. In Proceedings of the Design, Automation and Test in Europe (DATE) Conference,
pp. 416421, 2004.
[42] V. Immaneni and S. Raman. Direct access test scheme design of block and core cells for
embedded ASICs. In Proceedings of the International Test Conference, pp. 488492, 1990.
[43] P. Harrod. Testing re-usable IP: a case study. In Proceedings of the International Test Conference,
pp. 493498, 1999.
[44] I. Ghosh, S. Dey, and N.K. Jha. A fast and low cost testing technique for core-based system-on-
chip. In Proceedings of the Design Automation Conference, pp. 542547, 1998.
[45] K. Chakrabarty. A synthesis-for-transparency approach for hierarchical and system-on-a-chip
test. IEEE Transactions on VLSI Systems, 11, 167179, 2003.
[46] M. Nourani and C. Papachristou. An ILP formulation to optimize test access mech-
anism in system-on-chip testing. In Proceedings of the International Test Conference,
pp. 902910, 2000.
[47] L. Whetsel. An IEEE 1149.1 based test access architecture for ICs with embedded cores. In
[48] N.A. Touba and B. Pouya. Using partial isolation rings to test core-based designs. IEEE Design
and Test of Computers, 14, 5259, 1997.
[49] J. Aerts and E.J. Marinissen. Scan chain design for test time reduction in core-based ICs. In
[50] V. Iyengar, K. Chakrabarty, andE.J. Marinissen. Test access mechanismoptimization, test schedul-
ing and tester data volume reduction for system-on-chip. IEEE Transactions on Computers, 52,
16191632, 2003.
[51] V. Iyengar, K. Chakrabarty, and E.J. Marinissen. Recent advances in TAM optimization, test
scheduling, andtest resource management for modular testing of core-basedSOCs. InProceedings
of the IEEE Asian Test Symposium, pp. 320325, 2002.
[52] Z.S. Ebadi and A. Ivanov. Design of an optimal test access architecture using a genetic algorithm.
In Proceedings of the Asian Test Symposium, pp. 205210, 2001.
[53] V. Iyengar and K. Chakrabarty. Test bus sizing for system-on-a-chip. IEEE Transactions on
Computers, 51, 449459, 2002.
[54] Y. Huang et al. Resource allocation and test scheduling for concurrent test of core-based SOC
design. In Proceedings of the Asian Test Symposium, pp. 265270, 2001.
[55] Y. Huang et al. On concurrent test of core-based SOC design. Journal of Electronic Testing: Theory
and Applications, 18, 401414, 2002.
[56] V. Iyengar, K. Chakrabarty, and E. J. Marinissen. Efcient test access mechanism optimization for
system-on-chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
22, 635643, 2003.
[57] E.J. Marinissen and S.K. Goel. Analysis of test bandwidth utilization in test bus and TestRail
architectures in SOCs. Digest of Papers of DDECS, pp. 5260, 2002.
[58] P.T. Gonciari, B. Al-Hashimi, and N. Nicolici. Addressing useless test data in core-based
system-on-a-chip test. IEEE Transactions on Computer-Aided Design of ICs and Systems, 22,
15681590, 2003.
[59] S.K. Goel and E.J. Marinissen. Effective and efcient test architecture design for SOCs.
[60] W. Jiang and B. Vinnakota. Defect-oriented test scheduling. In Proceedings of the VLSI Test
Symposium, pp. 433438, 1999.
[61] E. Larsson, J. Pouget, and Z. Peng. Defect-aware SOC test scheduling. In Proceedings of the VLSI
Test Symposium, pp. 359364, 2004.
[62] F. Beenker, B. Bennetts, and L. Thijssen. Testability Concepts for Digital ICs The Macro Test
Approach. Frontiers in Electronic Testing, Vol. 3. Kluwer Academic Publishers, Boston, MA, 1995.
[63] E.J. Marinissen et al. On IEEE P1500s standard for embedded core test. Journal of Electronic
Testing: Theory and Applications, 18, 365383, 2002.
[64] Y. Zorian. A distributed BIST control scheme for complex VLSI devices. In Proceedings of the
VLSI Test Symposium, pp. 611, 1993.
[65] R.M. Chou, K.K. Saluja, and V.D. Agrawal. Scheduling tests for VLSI systems under power
constraints. IEEE Transactions on VLSI Systems, 5, 175184, 1997.
[66] V. Muresan, X. Wang, and M. Vladutiu. A comparison of classical scheduling approaches in
power-constrained block-test scheduling. In Proceedings of the International Test Conference,
pp. 882891, 2000.
[67] E. Larsson and Z. Peng. Test scheduling and scan-chain division under power constraint. In
Proceedings of the Asian Test Symposium, pp. 259264, 2001.
[68] E. Larsson and Z. Peng. An integrated system-on-chip test framework. In Proceedings of the DATE
[69] S. Koranne. On test scheduling for core-based SOCs. In Proceedings of the International Conference
VLSI Design, pp. 505510, 2002.
[70] V. Iyengar, S.K. Goel, E.J. Marinissen, and K. Chakrabarty. Test resource optimization for multi-
site testing of SOCs under ATE memory depth constraints. In Proceedings of the International Test
Conference, pp. 11591168, 2002.
[71] S. Koranne and V. Iyengar. A novel representation of embedded core test schedules. In Proceedings
of the International Test Conference, 2002, pp. 539540.
[72] A. Sehgal, V. Iyengar, M.D. Krasniewski, and K. Chakrabarty. Test cost reduction for SOCs using
virtual TAMs and Lagrange multipliers. In Proceedings of the IEEE/ACM Design Automation
[73] A. Sehgal, V. Iyengar, and K. Chakrabarty. SOC test planning using virtual test access architectures.
IEEE Transactions on VLSI Systems, 12: 12631276, December, 2004.
[74] A. Sehgal and K. Chakrabarty. Efcient modular testing of SOCs using dual-speed TAM architec-
tures. In Proceedings of the IEEE/ACM Design, Automation and Test in Europe (DATE) Conference,
pp. 422427, 2004.
[75] Agilent Technologies. Winning in the SOC market, available online at: http://cp.literature.
agilent.com/litweb/pdf/5988-7344EN.pdf
[76] Teradyne Technologies. Tiger: advanced digital with silicon Germanium technology.
http://www.teradyne.com/tiger/digital.html
[77] T. Yamamoto, S.-I. Gotoh, T. Takahashi, K. Irie, K. Ohshima, and N. Mimura. A mixed-signal
0.1865 m CMOS SoC for DVD systems with 432-MSample/s PRML read channel and 16-Mb
embedded DRAM. IEEE Journal of Solid-State Circuits, 36, 17851794, 2001.
[78] H. Kundert, K. Chang, D. Jefferies, G. Lamant, E. Malavasi, and F. Sendig. Design of mixed-
signal systems-on-a-chip. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 19, 15611571, 2000.
[79] E. Liu, C. Wong, Q. Shami, S. Mohapatra, R. Landy, P. Sheldon, and G. Woodward. Complete
mixed-signal building blocks for single-chip GSM baseband processing. In Proceedings of the
IEEE Custom Integrated Circuits Conference, pp. 1114, 1998.
[80] A. Cron. IEEE P1149.4 almost a standard. In Proceedings of the International Test Conference,
pp. 174182, 1997.
[81] S.K. Sunter. Cost/benet analysis of the P1149.4 mixed-signal test bus. In IEEE Proceedings
Circuits, Devices and Systems, 143, 393398, 1996.
[82] A. Sehgal, S. Ozev, and K. Chakrabarty. TAM optimization for mixed-signal SOCs using
test wrappers for analog cores. In Proceedings of the IEEE International Conference on CAD,
pp. 9599, 2003.
[83] L. Li and K. Chakrabarty. Test set embedding for deterministic BIST using a recongurable
interconnection network. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems, 23, 12891305, 2004.
[84] V. Iyengar, A. Chandra, S. Schweizer, and K. Chakrabarty. A unied approach for SOC testing
using test data compression and TAM optimization. In Proceedings of the IEEE/ACM Design,
Automation and Test in Europe (DATE) Conference, pp. 11881189, 2003.
[85] P.T. Gonciari and B. Al-Hashimi. A compression-driven test access mechanism design approach.
In Proceedings of the European Test Symposium, 2004, pp. 100105.
28
Embedded
Software-Based
Self-Testing for
SoC Design
University of California at Santa
Barbara
28.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-2
28.2 Embedded Processor Self-Testing. . . . . . . . . . . . . . . . . . . . . . . 28-4
Stuck-At Fault Testing
28.3 Test Program Synthesis Using VCCs . . . . . . . . . . . . . . . . . . . . 28-6
28.4 Delay Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-10
28.5 Embedded Processor Self-Diagnosis . . . . . . . . . . . . . . . . . . . 28-11
28.6 Self-Testing of Buses and Global Interconnects. . . . . . . . 28-11
28.7 Self-Testing of Other Nonprogrammable IP Cores. . . . 28-14
28.8 Instruction-Level DfT/Test Instructions . . . . . . . . . . . . . . . 28-15
28.9 Self-Test of On-Chip ADC/DAC and Analog
Components Using DSP-Based Approaches . . . . . . . . . . . 28-16
28.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28-18
The increasing heterogeneity and programmability associated with system-on-chip (SoC) architecture,
together with ever-increasing operating frequencies and technology changes, are demanding fundamental
changes in integrated circuit (IC) testing. At-speed testing of high-speed circuits with external testers
is becoming increasingly difcult owing to the growing gap between design and tester performance,
the growing cost of high-performance testers, and the increasing yield loss caused by inherent tester
inaccuracy. Therefore, empowering the chip to test itself seems like a sensible solution. Hardware-based
self-testing techniques (known as built-in self test, or BIST) have limitations owing to performance, area,
and design time overhead, as well as problems caused by the application of nonfunctional patterns (which
may result in higher power consumption during testing, over-testing, yield loss problems, etc).
The embedded software-based self-testing technique has recently become the focus of intense research.
One guiding principle of this embedded self-test paradigm is to utilize on-chip programmable resources
(such as embedded microprocessors and digital signal processors, DSPs) for on-chip test generation, test
delivery, signal acquisition, response analysis, and even diagnosis. After the programmable components
28-1
have been self-tested, they can be reused for testing on-chip buses, interfaces, and other nonprogrammable
components. Embedded test techniques based on such a principle reduces the need for dedicated test
hardware and makes possible relatively easier applications and more accurate analysis of at-speed test
signals on-chip. In this chapter, we give a survey and outline of the roadmap of this emerging embedded
software-based self-testing paradigm.
28.1 Introduction
System-on-chip has become a widely accepted architecture for highly complex systems on a single chip.
Short time-to-market and rich functionality requirements have driven the design houses to adopt the
SoC design ow. A SoC contains a large number of complex, heterogeneous components that can include
digital, analog, mixed-signal, radio frequency (RF), micromechanical, and other systems on a single piece
of silicon. As the lines gradually fade between traditional digital, analog, RF, and mixed-signal devices;
as operational frequencies are rapidly increasing; and as the feature sizes are shrinking, testing is facing a
whole new set of challenges.
Figure 28.1 shows the cost of silicon manufacturing versus the cost of testing given in the SIA and ITRS
roadmaps [1,2]. The top curve shows the fabrication capital per transistor cost reduction (Moores law).
The bottom curve shows the test capital per transistor (Moores law for test). From the ITRS roadmap it is
clear that unless fundamental changes to test are made, it may cost more, in the future, to test the chip than
to manufacture it [2]. Figure 28.1 also shows the historical trend in the test paradigms. On one hand, the
high cost of manually developed functional tests and difculties in translating the embedded component
tests to the chip boundary where the automatic test equipment (ATE) interface exists are making these
tests infeasible even for very high-volume products. On the other hand, even if automatically developed
structural tests (such as scan tests) are available, their application using ATEs poses challenges because the
testers performance is increasing at a slower rate than the device speed. This translates into an increasing
yield loss owing to external testing since guard-banding to cover tester errors results in the loss of more
and more good chips. In addition, high-speed and high-pincount testers are very costly.
Design-for-testability (DfT) and BIST have been regarded as possible solutions for changing the dir-
ection of the bottom curve in Figure 28.1. BIST solutions eliminate the need for high-speed testers and
Cost of silicon manufacturing and test
0.0000001
0.000001
0.00001
0.0001
0.001
0.01
0.1
1
1982 1985 1988 1991 1994 1997 2000 2003 2006 2009 2012
Fab capital/transistor (Moores law)
Test capital/transistor (Moores law for test)
Based on 97 SIA Roadmap Data
and 99 ITRS Roadmap

1999 Roadmap
Functional
testing
(manual TG)
Structural
testing
(scan, ATPG)
Built-in
self-test
(embedded
hardware tester)
Embedded
software-based
self-test
(embedded
software tester)
Test
paradigms:
Cost:
cents/transistor
FIGURE 28.1 Fab versus test capital.
Embedded Software-Based Self-Testing for SoC Design 28-3
show greater accuracy in their ability to apply and analyze at-speed test signals on-chip. Existing BIST
techniques belong to the class of structural BIST. Structural BIST, such as scan-based BIST techniques
[35], offer good test quality but require the addition of dedicated test circuitry (such as full scan, linear-
feedback shift registers [LFSRs] for pattern generation, multiple-inputs signature analyzers [MISRs] for
data analysis, and test controllers). Therefore, they incur nontrivial area, performance, and design time
overhead. Moreover, structural BIST applies nonfunctional, high-switching random patterns and thus,
causes much higher power consumption than normal system operations. Also, to apply at-speed tests to
detect timing-related faults, existing structural BISTneeds to resolve various complex timing issues related
to multiple clock domains, multiple frequencies, and test clock skews that are unique in the test mode.
A new embedded software-based self-testing paradigm [68] has the potential to alleviate the problems
caused by using external testers as well as structural BIST problems described earlier. In this testing
strategy, it is assumed that programmable components of the SoC (such as processor, DSP, and FPGA
components) are rst self-tested by running an automatically synthesized test program that can achieve
high-fault coverage. Next, the programmable component is used as a pattern generator and response
analyzer to test on-chip buses, interfaces between components, and other components including digital,
mixed-signal, and analog components. This self-test paradigm is sometimes referred to as functional
self-testing.
The concept of embedded software-based self-testing is illustrated in Figure 28.2 using a bus-based
SoC. In this illustration, the IP cores in the SoC are connected to a standard bus via the virtual component
interface (VCI) [9]. The VCI acts as a standard communication interface between the IP core and the
on-chip bus. First, the microprocessor tests itself by executing a set of instructions. Next, the processor
can be used for testing the bus as well as other nonprogrammable IP cores in the SoC. In order to support
the self-testing methodology, the IP core is encased in a test wrapper. The test wrapper contains test
support logic needed to control shifting of the scan chain, buffers to store scan data and support at-speed
test, etc. In this example, the on-chip bus is a shared bus, and the arbiter controls access to the bus.
There are several advantages to the embedded software-based self-test approach. First, it allows reuse
of programmable resources on SoCs for test purposes. In other words, this strategy views testing as an
application of the programmable components in the SoC and thus minimizes the need for additional
dedicated test circuitry for self-test or DfT.
Second, in addition to eliminating the need for expensive high-speed testers, it can also reduce the yield
loss owing to tester inaccuracy. Self-testing offers the ability to apply and analyze at-speed test signals
on-chip with accuracy greater than that obtainable with a tester.
Bus
Arbite r
DSP
BusInterface
CPU
External
Tester
Main Memor y
Test
program
CPU Responses CPU CPU
VCI
Signatures
VCI
System
Memory
IP Core
(with scan)
VCI VCI
Test
Support
BusIntf / VCI
glue logic
scan
Interface
data
buffe r
Wrappe r
IP core
BusInterface
Master Wrapper
BusInterface
Target Wrapper
BusInterface
Target Wrapper
Bus
arbiter
DSP
Bus interface
master wrapper
CPU
External
tester
Main Memor y
Test
program
Test
program
CPU CPU Responses CPU CPU CPU CPU
VCI
Signatures
VCI
System
memory
IP core
(with scan)
VCI VCI
Test
support
BusIntf / VCI
glue logic
scan
Interface
data
buffe r
Wrappe r
IP core
BusIntf / VCI
glue logic
scan
Interface
data
buffe r
Wrappe r
IP core
Bus intf/ VCI
glue logic
Scan
interface
Data
buffer
Wrappe r
IP core
Wrapper
IP core
Bus interface
master wrapper
Bus interface
target wrapper
Bus interface
target wrapper
On-chip bus
FIGURE 28.2 Embedded software-based self-testing for SoC.
Third, while the hardware-based self-test must be applied in the nonfunctional BIST mode, software-
based self-test can be applied in the normal operational mode of the design that is, the tests are
applied by executing instruction sequences as in regular system operations. This eliminates the problems
created by application of nonfunctional patterns that can result in excessive power consumption when
hardware BIST is used.
Also, functional self-test can alleviate many over-testing and yield loss problems owing to the applica-
tion of nonfunctional patterns during structural testing for delay faults and cross-talk faults (through
at-speed scan or BIST). Experiments have shown that many structurally testable delay faults in the
microprocessors can never be sensitized in the functional mode of the circuit [7]. This is because
no functionally applicable vector sequence can excite these delay faults and propagate the fault effects
into destination outputs/ipops at-speed. Defects on these faults will not affect the circuit perform-
ance and their testing is not necessary. However, if the circuit is tested by applying nonfunctional
patterns, these defects could be detected and the chip could be identied as faulty, resulting in
yield loss.
Software-based fault localization tools are on the high-priority list according to ITRS roadmap [2].
In addition to self-testing, functional information can also be used to guide diagnostic self-test program
synthesis.
Testing of analog and mixed-signal circuits has been an expensive process because of the limited access
to the analog parts and testers required to perform functional testing. The situation has become worse
owing to the trend of integrating various digital, mixed-signal, and analog components into the SoC, with
the result that testing the analog and the mixed-signal parts becomes the bottleneck of production testing.
Most of these problems can be alleviated by self-testing of on-chip ADC/DAC and analog components
based on DSP-based approaches and utilizing on-chip programmable resources.
In the rest of the chapter, we present some representative methods on this subject. We start by discussing
processor self-test methods targeting stuck-at faults and delay faults. We also give a brief description of
a processor self-diagnosis method. Next, we continue with a discussion on methods for self-testing of
buses and global interconnects as well as other nonprogrammable IP cores on SoC. We also describe
instruction-level DfT methods based on insertion of test instructions to increase the fault coverage and
reduce the test application time and test program size. Finally, we summarize DSP-based self-test for
analog/mixed-signal components.
28.2 Embedded Processor Self-Testing
Embedded software-based self-test methods for processors [69] consist of two steps: the test preparation
step and the self-testing step. The test preparation step involves generation of realizable tests for com-
ponents of the processor. Realizable tests are those that can be delivered using instructions; therefore, to
avoid producing undeliverable test patterns, the tests are generated under the constraints imposed by the
processor instruction set. The tests can then be either stored or generated on-chip, depending on which
method seems more efcient for a particular case. A low-speed tester can be used to load the self-test
signatures or the predetermined tests to the processor memory, prior to the application of tests. Note that
the inability to apply every conceivable input pattern to a microprocessor component does not necessarily
map to low-fault coverage. If a fault can be detected only by test patterns outside the allowed input space,
then by denition, the fault is redundant in the normal operational mode of the processor. Thus, there is
no need to test for this type of fault in production testing, even though we may still want to detect and
locate it in the debugging and diagnosis phase.
The self-testing step, illustrated in Figure 28.3, involves the application of these tests using a software
tester. The software tester can also compress the responses into self-test signatures that can then be stored
in memory. The signatures can later be unloaded and analyzed by an external tester. Here, the assumption
is that the processor memory has already been tested with standard techniques such as memory BIST
before the application of the test, and so the memory is assumed to be fault-free.
CPU
External
tester
Processor bus
Test response
Response
signatur e
Test Data fo
Stimulus
Applicatio
On -c hip test
applicatio n
program
Test response
analysis
program
CPU
Instruction memory
Data memory
Test response
Response
signature
Test Data fo
Stimulus
Applicatio
On -c hip test
applicatio n
program
Test response
analysis
program
Test data for
stimulus
application
On-chip test
application
program
Test response
analysis
program
FIGURE 28.3 Embedded processor self-testing.
In the following, we describe the embedded software-based self-test methods for testing stuck-at [6,9]
and path delay faults [7,8] in microprocessors.
28.2.1 Stuck-At Fault Testing
The method proposed by Chen and Dey [6] targets stuck-at faults in a processor core using a divide-and-
conquer approach. First, it determines the structural test needs for subcomponents in the processor (e.g.,
ALU, program counter) that are much less complex than the full processor, and hence more amenable to
random pattern testing. Next, the component tests are either stored or generated on-chip and then, at the
processor level, delivered to their target components using predetermined instruction sequences. To make
sure that the test patterns generated for a subcomponent under test can be delivered by instructions, the
test preparation step precedes the self-test step.
28.2.1.1 Test Preparation
To derive the realizable component tests (i.e., tests deliverable by instructions), the instruction-imposed
constraints must rst be derived for each component. These constraints can be divided into input and
output constraints. The input constraints dene the input space of the component allowed by instructions.
They describe the correlation among the inputs to the component and can be expressed in the form
of Boolean equations. The output constraints dene the subset of component outputs observable by
instructions. To obtain a close prediction of fault coverage in component-level fault simulation, errors
propagating to component outputs that are unobservable at processor-level are regarded as unobserved.
Also, the constraints imposed by the processor instruction set can be divided into those that can be
specied in a single time frame (spatial constraints) and those that span over several time frames (temporal
constraints). Temporal constraints are used to account for the loss of fault coverage owing to fault aliasing,
in the cases where the application of one test pattern involves multiple passes through a fault inside the
component.
If component tests are generated by automatic test pattern generation (ATPG), the spatial constraints
can be specied during test generation with the aid of the ATPG tool. Alternatively, they can be specied
with virtual constraint circuits (VCCs) as proposed in [10] (details of this alternative will be described
in Section 28.3). Similarly, temporal constraints can be modeled with sequential VCCs. Unlike the case
of ATPG, if random tests are used for components, random patterns can be used only on independent
inputs. Component-level fault simulation is used for evaluating the preliminary fault coverage of these
tests. The nal fault coverage can be evaluated with processor-level fault simulation once the entire self-
test program is constructed. Although component tests are generated only for the subset of components
that are easily accessible through instructions (e.g., ALU, program counter, etc.), other components such
as the instruction decoder are expected to be tested extensively during the application of the self-test
program.
28.2.1.2 Self-Test
After the realizable component tests have been derived, the next step is on-chip self-test using an embedded
software tester for the on-chip generation of component test patterns, the delivery of component tests,
and the analysis of their responses. Component tests can either be stored or be generated on-chip. If tests
are generated on-chip, the test needs of each component are characterized by a self-test signature, which
includes the seed, S, and the conguration, C, of a pseudo-random number generator as well as the
number of test patterns to be generated, N. The self-test signatures can be expanded on-chip into test sets
using a pseudo-random number-generation program. Multiple self-test signatures may be used for one
component if necessary. Thus, this self-test methodology will allow incorporation of any deterministic
BIST techniques that encode a deterministic test set as several pseudo-random test sets [11,12].
Since the component tests are developed under the constraints imposed by the processor instruction
set, it will always be possible to nd instructions for applying the component tests. On the output end,
special care must be taken when collecting component test response. Inasmuch as data outputs and status
outputs have different observability, they should be treated differently during response collection. In
general, although there are no instructions for storing the status outputs of a component directly to
memory, an image of the status outputs can be created in memory using conditional instructions. This
technique can be used to observe the status outputs of any component.
Using manually extracted constraints, the above scheme has been applied to a simple Parwan pro-
cessor [13]. The generated test program could achieve a high coverage for stuck-at faults in this simple
processor.
28.3 Test Program Synthesis Using VCCs
Tupuri et al. proposed an approach in Reference 10 for generating functional tests for processors by using
a gate-level sequential ATPG tool. It attempts to generate tests for all detectable stuck-at faults under the
functional constraints, and then applies these functional test vectors at the systems operational speed.
The key idea of this approach lies in the synthesized logic embodying the functional constraints, also
known as VCCs. After the functional constraints of an embedded module have been extracted, they are
described in hardware description language (HDL) and synthesized into logic gates. Then a commercial
ATPG is used to generate module-level vectors with such constraint circuitry imposed. These module-level
vectors are translated to processor-level functional vectors and fault simulated to verify the fault coverage.
Figure 28.4 illustrates this hierarchical test generation process using a gate-level test generator for sequential
circuits.
Chen et al. [9] performed the module-level test generation for embedded processors using the concept
of VCCs but with a different utilization such that the test vector generation can be directly plugged into the
settable elds (e.g., operands, source, and destination registers) in test program templates. This utilization
simplies the automated generation of test programs for embedded processors. Figure 28.5 shows the
overall test program synthesis process proposed in Reference 9, in which the nal self-test program can
be synthesized automatically from (1) a simulatable HDL processor design at RTL (register transfer level)
level, and (2) the instruction set architecture (ISA) specication of the embedded processor. The goal and
the process for each step are presented as follows:
Step 1. Partition the processor into a collection of combinational blocks as module-under-test (MUT),
and the test program for each MUT will be synthesized separately.
Step 2. Systematically construct a comprehensive set of test program templates. Test program
templates can be classied into single-instruction templates and multi-instruction templates.
Constraint extraction
and
synthesis
Virtual
Inputs
Virtual
Outputs
Synthesized logic
embodying constraints
Embedded
module
Tests for
faults in
module
with
constraints
Tests for
faults in
embedded
module
Translator
Commercial
ATPG
Commercial
ATPG
Surrounding logic
(determines module constraints)
Primary
Primary
Outputs
Inputs
Embedded
module
FIGURE 28.4 Use of VCCs for functional test generation. (From R.S. Tupuri and J.A. Abraham, in Proceedings of the
IEEE International Test Conference (ITC), September 1997. With permission.)
Single-instruction templates are built around one key instruction whereas multi-instruction tem-
plates include additional supporting instructions for example, to trigger pipeline forwarding.
Toexhaust all possibilities ingenerating test programtemplates wouldbe impossible, but generating
a wide variety of templates is necessary in order to achieve a high-fault coverage.
Step 3. Rank templates based on a controllability-/observability-based testability metric through simu-
lation. Templates at the top of the list T
m
have high controllability (meaning it is easy to set specic
values at the inputs of the MUT) and/or high observability (meaning it is easy to propagate the
values at the output of the MUT to data registers or to observation points, which can be mapped
onto and stored in the memory).
Step 4. Derive the input mapping functions for each template t from the program templates settable
elds (which include operands, source registers, and destination registers) to the inputs of MUT.
Also derive the output mapping functions from the MUTs outputs to the systems observation
points.
The input mapping functions can be derived by simulating a number of instances of template t
to obtain traces followed by regression analysis to construct the mapping function between settable
elds and inputs of MUT.
The output mapping functions can be derived by injecting the unknown X value at outputs
of MUT for simulation, followed by observing the propagation of the X values to the specied
templates destinations.
Step 5. Synthesize the mapping functions into VCCs. The utilization of VCCs not only enforces the
instruction-imposed constraints, but also facilitates the translation frommodule-level test patterns
to instruction-level test programs. First, implement the mapping functions between settable elds
in template t and inputs of MUT m as the input side VCC, and insert it into MUT m. Similarly,
insert output side VCC that embodies the output mapping functions.
Processor RTL
MUT partitioning
Extract candidate test
program templates
Set of test program
templates: T
Software self-test
programs
Acceptable fault coverage?
Processor-level fault simulation
Test program memory image
Template field assignments
Constrained test generation
Generate VCCS
Input/output mapping functions
Regression analysis to derive
input mapping
Regression analysis to derive
output mapping
Simulate template instances
WithXinjection at outputs of m
Simulate template instances
to obtain traces at inputs of m
Generate instances of t by assigning
random values to settable fields in t
Choose best ranked template t e T
m
Generate ranked list of templates t T
m
A
t e T: X-based simulation to
compute control abillity/observability metrics
Derivation
of mapping
functions
Template
ranking
Test program synthesis,
cross-compilation
More m in M?
More t in T
m
?
Y
Y
Y
N
N
N
8
7
6
5
4e
4d
4a
4b
9
4c
3b
3a
2
1
Pick m e M
Set of
MUTs: M
U
p
d
a
t
e

s
e
t

o
f

u
n
d
e
t
e
c
t
e
d

f
a
u
l
t
s
,
R
e
r
a
n
k

r
e
m
a
i
n
i
n
g

t
e
m
p
l
a
t
e
s

i
n

T
m
ISA
FIGURE 28.5 Overview of the scalable software-based self-test methodology. (From L. Chen, S. Ravi,
A. Raghunathan, and S. Dey, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), June 2003.
With permission.)
Settable fields
in template t
Observable
destination of m
Constrained
inputs to m
All outputs
of m
<val1>
VCC
Circuit as seen by test generator
VCC
MUT
m
<s>
i1
o1
o2
i2
i3
i4
FIGURE 28.6 Constrained test generation using VCCs. (From L. Chen, S. Ravi, A. Raghunathan, and S. Dey,
in Proceedings of the ACM/IEEE Design Automation Conference (DAC), June 2003. With permission.)
1. Test patterns
P
m,t
4. Final test program
TP
m,t 2. Assignment to settable fields
3. Test program template t
<val1>
ef12
0200
1ac0
ef12
0200
1ac0
1002
029a
9213
10
3
8
<s> <val1> <val2> <s> <t> <r>
... ...
10
3
8
...
7
2
9
...
...
...
8
12
3
... ... ...
load a<s>, <val1>
load a<t>, <val2>
nop; nop; nop; nop
add a<r>, a<s>, a<t>
store a<r>, <resp>
load a10, ef12
load a7, 1002
nop; nop; nop; nop
add a8, a10, a7
store a8, ff12
load a3, 0200
load a2, 029a
nop; nop; nop; nop
add a12, a3, a2
store a12, ff16
...
FIGURE28.7 Example of test programsynthesis. (FromL. Chen, S. Ravi, A. Raghunathan, and S. Dey, in Proceedings
of the ACM/IEEE Design Automation Conference (DAC), June 2003. With permission.)
Step 6. Generate module-level tests for the composite circuit of the MUT between input/output virtual
constraint components. During the constraint test generation, the test generator sees the circuit
including MUT m and the two VCCs, as shown in Figure 28.6. Note that faults within the VCCs
will be eliminated from the fault list and so will not be considered for test generation. With this
composite model, the pattern generator can generate patterns with values directly specied at the
settable elds in instruction template t .
Step 7. Synthesize the target test program for the patterns generated in Step 6. Note that the generated
test patterns of Step 6 assign values in some of the settable elds of each instruction template t . The
other settable elds without value assignment in Step 6 would be lled with randomvalues. The test
program is then synthesized by converting the values of each settable eld into its corresponding
position in instruction template t . Figure 28.7 gives an example of the ow for synthesizing the
target program.
Step 8. Perform fault simulation on the synthesized test program segment to identify the subset of
stuck-at faults detected by the program segment.
Step 9. Update the set of undetected faults and rerank the remaining templates in template list T
m
to
prepare for the next iteration of test program generation.
In Reference 14, the above process is further extended to synthesize test program for detecting cross-talk
faults. Unlike the stuck-at faults, the signal integrity problems such as cross-talk need to be tested by
applying a sequence of vectors at operational speed. The requirements for generating multiple specic
vectors, considering instruction-imposed constraints at the same time, pose challenges in test program
synthesis. The semiautomated test program generation framework proposed in Reference 14 combines
multiple instruction-level constraints (multiple VCCs) with a structural ATPG algorithm to select the
instruction sequences and their corresponding operand values for detecting cross-talk faults. Preliminary
results were demonstrated for an industrial processor, Xtensa, from Tensilica Inc.
28.4 Delay Testing
To ensure that the designs meet performance specications requires the application of delay tests. These
tests should be applied at-speed and contain two-vector patterns, applied to the combinational portion
of the circuit under test, to activate and propagate the fault effects to registers or other observation
points [33]. A software-based self-test method aiming at delay faults in processor cores has been proposed
by Lai et al. [7,8]. As in the case of stuck-at faults, not all delay faults in the microprocessor can be tested
in the functional mode. This is simply because no instruction sequence can produce the desired test
sequence that can sensitize the path and capture the fault effect into destination output/ipop at-speed.
A fault is said to be functionally testable if there exists a functional test for that fault. Otherwise, the fault
is functionally untestable.
To illustrate functionally untestable faults, consider part of a simple processors datapath as shown in
Figure 28.8. It contains an 8-bit ALU, an accumulator (AC), and an instruction register (IR). The data
inputs of the ALU, A7-A0, and B7-B0, are connected to the internal data bus and the AC, respectively.
The control inputs of the ALU are S2-S0. The values in S2-S0 instruct the ALU to perform the desired
arithmetic/logic operation. The outputs of the ALU are connected to the inputs of AC and the inputs
of IR. It can be shown that for all possible instruction sequences, whenever a rising transition occurs
on signal S1 at the beginning of a clock cycle, AC and IR can never be enabled at the end of the same
cycle. Therefore, paths that start at S1 and end at the inputs of IR or AC are functionally untestable,
since delay effects on them can never be captured by IR or AC immediately after the vector pair has been
applied. The goal of the test preparation step is to identify functionally testable faults and synthesize tests
for them.
The ow of the test program synthesis for self-test of path delay faults in a microprocessor using its
instructions consists of four major steps:
1. Giventhe ISAandthe micro-architecture of the processor core, the spatial andtemporal constraints,
between and at the registers and control signals, are rst extracted.
2. A path classication algorithm, extended from [15,16], implicitly enumerates and examines all
paths and path segments with the extracted constraints imposed. If a path cannot be sensitized
with the imposed extracted constraints, the path is functionally untestable and thus, is eliminated
Memory
Internal data bus
To controller
AC
B7-B0
A7-A0
ALU
IR
S2 S1S0
(op-code)
FIGURE 28.8 Datapath example. (From W.-C. Lai, A. Krstic, and K.-T. Cheng, in Proceedings of the IEEE VLSI Test
Symposium (VTS), April 2000. With permission.)
fromthe fault universe. This helps reduce the computational effort of the subsequent test generation
process. The preliminary experimental results showninReference 7 indicate a nontrivial percentage
of the paths in simple processors (such as the Parwan processor [13] and the DLX processor [17])
are functionally untestable but structurally testable.
3. A subset of long paths among the functionally testable paths are selected as targets for test gener-
ation. A gate-level ATPG for path delay faults is extended to incorporate the extracted constraints
into the test generation process, where it is used to generate test vectors for each target path delay
fault. If the test is successfully generated, it not only sensitizes the path but also meets the extracted
constraints. Therefore, it is most likely to be deliverable by instructions (if the complete set of
constraints has been extracted, the delivery by instructions could be guaranteed).
4. Inthe test programsynthesis process that follows, the test vectors specifying the bit values at internal
ipops are rst mapped back to word-level values in registers and values at control signals.
These mapped value requirements are then justied at the instruction level. Finally, a predened
propagating routine is used to propagate the fault effects captured in the registers/ipops of the
path delay fault to the memory. This routine compresses the contents of some or all registers in the
processor, generates a signature, and stores it in memory. The procedure is repeated until all target
faults have been processed. The test program, which is generated ofine, will be used to test the
microprocessor at-speed.
This test synthesis program has been applied to Parwan [13] and DLX [17] processors. On the average,
5.3 and 5.9 instructions were needed to deliver a test vector, and the achieved fault coverage for testable
path delay faults was 99.8% for Parwan and 96.3% for DLX.
28.5 Embedded Processor Self-Diagnosis
In addition to enabling at-speed self-test with low-cost testers, software-based self-test eliminates the use
of scan chains and the associated test overhead, making itself an attractive solution for testing high-end
microprocessors. The elimination of scan chains, on the other hand, poses a signicant challenge for fault
diagnosis. Though deterministic methods for generating diagnostic tests are available for combinational
circuits [18], sequential circuits are much too complex to be handled by the same approach. Consequently,
there have been several proposals on generating diagnostic tests for sequential circuits by modifying
existing detection tests [19,20]. A prerequisite for these methods is a high-coverage detection test set
for the sequential circuit under test. Thus, the success of these methods depends on the success of the
sequential test generation techniques.
Though current sequential ATPG techniques are not practical enough for handling large sequential
circuits, software-based self-test methods have the ability of successfully generating tests for a particular
type of sequential circuits microprocessors. If properly modied, these tests might possibly achieve
a high-diagnostic capability. In addition, functional information (ISA and micro-architecture) can be
used to guide and facilitate diagnosis.
Initial investigation for the diagnostic potential of software-based self-test was reported in Reference 21
which attempted to generate test programs geared toward diagnosis. Diagnosis is performed by analyz-
ing the combination of test responses to a large number of small diagnostic test programs. To achieve
a high-diagnostic resolution, the diagnostic test programs are generated in a way such that each test
program detected as few faults as possible, while the union of all test programs detects as many faults as
possible.
28.6 Self-Testing of Buses and Global Interconnects
In SoCdesigns, a large amount of core-to-core communications must be realized with long interconnects.
As we ndways to decrease gate delay, the performance of interconnect is becoming increasingly important
for achieving a high-overall performance [2]. However, owing tothe increase of cross-coupling capacitance
and mutual inductance, signals on neighboring wires may interfere with each other, causing excessive delay
or loss of signal integrity. While many techniques have been proposed to reduce cross-talk, owing to the
limited design margin and unpredictable process variations, the cross-talk must also be addressed in
manufacturing testing.
Owing to the nature of its timing, testing for cross-talk effects should be conducted at the operational
speed of the circuit under test. However, at-speed testing of GHz systems requires prohibitively expensive
high-speed testers. Moreover, with external testing, hardware access mechanisms are required for applying
tests to interconnects deeply embedded in the system. This may lead to unacceptable costs in area or
performance overhead.
A BIST technique in which a SoC tests its own interconnects for cross-talk defects using on-chip hard-
ware pattern generators and error detectors has been proposed in Reference 22. Although the amount of
area overhead may be amortized for large systems, for small systems, the amount of relative area overhead
may be unacceptable. Moreover, hardware-based self-test approaches, such as the one in Reference 22,
may cause over-testing and yield loss, as not all test patterns generated in the test mode are valid in the
normal operational mode of the system.
The problem of testing system-level interconnects in embedded processor-based SoCs has been
addressed in References 23 and 24. In such SoCs, most of the system-level interconnects, such as the
on-chip buses, are accessible to the embedded processor core(s). The proposed methodology, being
software-based, enables an embedded processor core in the SoC to test for cross-talk effects in these inter-
connects by executing a software program. The strategy is to let the processor execute a self-test program
with which the test vector pairs can be applied to the appropriate bus in the normal functional mode of
the system. In the presence of cross-talk-induced glitch or delay effects, the second vector in the vector
pair becomes distorted at the receiver end of the bus. The processor, however, can store this error effect to
the memory as a test response, which can be later unloaded by an external tester for off-chip analysis.
Maximum aggressor (MA) fault model proposed in Reference 25 is suitable for modeling cross-talk
defects on interconnects. It abstracts the cross-talk defects on global interconnects by a linear number
of faults. It denes faults based on the resulting cross-talk error effects, including positive glitch (g
p
),
negative glitch (g
n
), rising delay (d
r
), and falling delay (d
f
). For a set of N interconnects, the MA fault
model considers the collective aggressor effects on a given victim line Y
i
, while all other N 1 wires act as
aggressors. The required transitions on the aggressor/victim lines to excite the four error types are shown
in Figure 28.9. For example, the test for positive glitch (g
p
) at victim line Y
i
, as shown in the rst column
of Figure 28.9, would require line Y
i
has a constant 0 value while other N 1 aggressor lines have a
rising transition. Under this pattern, the victim line Y
i
would have a positive glitch owing to the cross-talk
effect. If excessive, the glitch would result in errors. These patterns, collectively called MA tests, excite the
worst-case cross-talk effects on the victim line Y
i
. For a set of N interconnects, there are 4N MA faults,
requiring 4N MA tests. It has been shown in Reference 25 that these 4N faults cover all cross-talk defects
on any of the N interconnects.
In a core-based SoC, the address, data, and control buses are the main types of global intercon-
nects with which the embedded processors communicate with memory and other cores of the SoC via
Test for g
p
Test for g
n
Test for d
f
Test for d
r
Y
1
0 1
Y
i
Y
N
Y
i 1
Y
i +1
Y
1
Y
i
Y
N
Y
i 1
Y
i +1
Y
1
Y
i
Y
N
Y
i 1
Y
i +1
Y
1
Y
i
Y
N
Y
i 1
Y
i +1
FIGURE 28.9 Maximal aggressor tests for victim Y
i
. (From M. Cuviello, S. Dey, X. Bai, and Y. Zhao, in Proceedings
of the IEEE International Conference on Computer-Aided Design (ICCAD), November 1999. With permission.)
ADDR
DATA
MEM
CPU
DATA
MEM
0100
1001
0100
1001
CPU
ADDR 0001
0001
1110
1111 ADDR
DATA
MEM
CPU
DATA
MEM
CPU
ADDR 0001
0001
0001
0001
Fault-free
address
bus
Faulty
address
bus

Address Address
Data Data
ADDR
DATA
MEM
CPU
DATA
MEM
0100
1001
0100
1001
0100
1001
0100
1001
ADDR 0001
0001
0001
0001
1110
1111
1110
1111 ADDR
DATA
CPU
DATA
ADDR 0001
0001
0001
0001
0001
0001
0001
0001
FIGURE 28.10 Testing the address bus. (From X. Bai, S. Dey, and J. Rajski, in Proceedings of the ACM/IEEE Design
Automation Conference (DAC), June 2000. With permission.)
memory-mapped I/O. Li et al. [23] concentrates on testing data and addresses bus in a processor-based
SoC. The cross-talk effects on the interconnects are modeled using the MA fault model:
Testing data bus. For a bidirectional bus such as a data bus, cross-talk effects vary as the bus is driven
from different directions. Thus cross-talk tests should be conducted in both directions [22]. However, to
apply a pair of vectors (v1, v2) in a particular bus direction, the direction of v1 is irrelevant, as long as
the logic value at the bus is held at v1. Only v2 needs to be applied in the specied bus direction. This is
because the signal transition triggering the cross-talk effect takes place only when v2 is being applied to
the bus.
To apply a test vector pair (v1, v2) for the data bus from a SoC core to the CPU, the CPU rst exchanges
data v1 with the core. The direction of data exchange is irrelevant, for example, if the core is a memory,
the CPU may either read v1 from the memory or write v1 to the memory. The CPU then requests data
v2 from the core (a memory-read if the core is memory). Upon the arrival of v2, the CPU writes v2 to
memory for later analysis.
To apply a test vector pair (v1, v2) to the data bus from the CPU to a SoC core, the CPU rst exchanges
data v1 with the core. Then, the CPU sends data v2 to the core (a memory-write if the core is memory).
If the core is memory, v2 can be directly stored to an appropriate address for later analysis. Otherwise, the
CPU must execute additional instructions to retrieve v2 from the core and store it to memory.
Testing address bus. To apply a test vector pair (v1, v2) to the address bus, which is a unidirectional bus
from the CPU to a SoC core, the CPU rst requests data from two addresses (v1 and v2) in consecutive
cycles. In the case of a nonmemory core, since the CPU addresses the core via memory-mapped I/O,
v2 must be the address corresponding to the core. If v2 is distorted by cross-talk, the CPU would
be receiving data from a wrong address, v2
, which may be a physical memory address or an address

corresponding to a different core. By keeping different data at v2 and v2
(i.e., mem[v2] = mem[v2
]),
the CPU is able to observe the error and store it in memory for analysis. Figure 28.10 illustrates this
process, for example, in the case where the CPU is communicating with a memory core, to apply test
(0001, 1110) in the address bus from the CPU to the memory core, the CPU rst reads data from address
0001 and then from address 1110. In the system with the faulty address bus, this address may become
1111. If different data are stored at addresses 1110 and 1111 (mem[1110] = 0100, mem[1111] = 1001),
the CPU would receive a faulty value from memory (1001 instead of 0100). This error response can later
be stored in memory for analysis.
The feasibility of this method has been demonstrated by applying it to test the interconnects of
a processor-memory system. The defect coverage was evaluated using a system-level cross-talk-defect
simulation method.
Functionally Maximal Aggressor (FMA) tests. Even though the MA tests have been proven to cover all
physical defects related to cross-talk between interconnects, Lai et al. [24] observe that many of them
can never occur during normal system operation owing to constraints imposed by the system. Therefore,
testing buses using MA tests might screen out chips that are functionally correct under any pattern
produced under normal system operation. Instead, functionally maximal aggressor (FMA) tests that meet
the system constraints and can be delivered under the functional mode are proposed in [24]. These tests
provide a complete coverage of all cross-talk-induced logical and delay faults that can cause errors during
the functional mode.
Given the timing diagrams of all bus operations, the spatial and temporal constraints imposed on the
buses can be extracted and FMA tests can be generated. A covering relationship between vectors extracted
from the timing diagrams of the bus commands is used during the FMA test generation process. Since the
resulting FMA tests are highly regular, they can be generated in an algorithmic way. Therefore, the FMA
tests are clustered and t into a few groups. The tests in each group are highly similar except that the
victim lines are different. Therefore, as with a marching sequence (which is commonly used for testing
memory), the tests in each group can be synthesized by a software routine. The synthesized test program
is highly modularized and very small. Experimental results have shown that a test program as small as
3000 to 5000 bytes can detect all cross-talk defects on the bus from the processor core to the target core.
Next, the synthesized test program is applied to the bus from the processor core, and the input buffers
of the destination core capture the responses at the other end of the bus. Such responses should be read
back by the processor core to determine whether any faults occurred on the bus. However, because the
input buffers of a nonmemory core cannot be read by the processor core, a DfT scheme is suggested to
allow direct observability of the input buffers by the processor core. The DfT circuitry consists of bypass
logic added to each I/O core to improve its testability.
With the DfT support on the target I/O core, the test generation procedure rst synthesizes instructions
to set the target core to the bypass mode, and then it continues with synthesizing instructions for the FMA
tests. The test generation procedure does not depend on the functionality of the target core.
28.7 Self-Testing of Other Nonprogrammable IP Cores
Testing nonprogrammable cores on a SoC is a complex problem with many unresolved issues [26].
Industry initiatives such as the IEEE P1500 Working Group [27] provide some solutions for IP core
testing. However, they do not address the requirements of at-speed testing.
A self-testing approach for nonprogrammable cores on a SoC has been proposed in Reference 26.
In this approach, a test program running on the embedded processor delivers test patterns to other IP
cores in the SoC at-speed. The test patterns can be generated on the processor itself or fetched from an
external ATE and stored in on-chip memory. This alleviates the need for dedicated test circuitry for pattern
generation and response analysis. The approach is scalable to large-size IP cores whose structural netlists
are available. Since the pattern delivery is done at the SoC operational speed, it supports delay test. A test
wrapper (shown in Figure 28.11) is placed around each core to support pattern delivery. It contains test
support logic needed to control shifting of the scan chain, buffers to store scan data, buffers to support
at-speed test, etc.
The test ow based on the embedded software self-testing methodology is illustrated in Figure 28.11.
It offers tremendous exibility in the type of tests that can be applied to the IP cores as well as in the
quality of the test pattern set without entailing signicant hardware overhead. Again, the ow is divided
into a preprocessing phase and a testing phase.
In the preprocessing phase, a test wrapper is automatically inserted around the IP core under test.
The test wrapper is congured to meet the specic testing needs for the IP core. The IP core is then
fault-simulated with different sets of patterns. Weighted random patterns generated with multiple weight
sets are used in [26]. In [5], multiple capture cycles are used after each scan sequence. Next, a high-level
test program is generated. This program synchronizes the software pattern generation, start of the test,
application of the test, and analysis of the test response. The programcan also synchronize testing multiple
cores in parallel. The test program is then compiled to generate a processor specic binary code.
Preprocessing
phase
Test
phase
Test code
generator
Fault
simulation
IP Core
IP core
BUS
Binary test
program
Processor
Specifi c
Test
wrapper
generator
IP/CUT
Embedded
Processor
BUS
Response
Processor
specific
parameters
Test specific
Parameters
IP/CUT
Embedded
processor
Finding weights
for PIs and PSIs
FIGURE 28.11 The test ow for testing nonprogrammable IP cores. (From J.-R. Huang, M.K. Iyer, and K.-T. Cheng,
in Proceedings of the IEEE VLSI Test Symposium (VTS), April 2001. With permission.)
In the test phase, the test program is run on the processor core to test various IP cores. A test packet
is sent to the IP core test wrapper informing it about the test application scheme (single- or multiple-
capture cycle). Data packets are then sent to load the scan buffers and the PI/PO buffers. The test wrapper
applies the required number of scan shifts and captures the test response for the programmed number of
functional cycles. The results of the test are stored in the PI/PO buffers and the scan buffers; from there
they are read out by the processor core.
28.8 Instruction-Level DfT/Test Instructions
Several potential benets can accrue from self-testing manufacturing defects in a SoC by running test
programs using a programmable core. These include at-speed testing, low-DfT overhead (owing to elim-
ination of dedicated test circuitry), and better power and thermal management during testing. However,
such a self-test strategy might require a lengthy test program and still might not achieve sufciently high-
fault coverage. These problems can be alleviated by applying a DfT methodology based on adding test
instructions to an on-chip programmable core such as a microprocessor core. This methodology is called
instruction-level DfT.
Instruction-level DfT inserts test circuitry in the form of test instructions. It should be a less intrusive
approach than the gate-level DfT techniques which attempt to create a separate test mode somewhat
orthogonal to the functional mode. If the test instructions are carefully designed such that their micro-
instructions reuse the datapath for the functional instructions and do not require any new datapath, then
the overhead, which occurs only in the controller, should be relatively low. This methodology is also more
attractive for applying at-speed tests and for power/thermal management during test, as compared with
the existing logic BIST approaches.
Instruction-level DfT methods have been proposed in References 28 and 29. The approach in
Reference 28 adds instructions to control the exceptions such as microprocessor interrupts and reset.
With the new instructions, the test program can achieve a fault coverage close to 90% for stuck-at faults.
However, this approach cannot achieve a higher coverage because the test program is synthesized based
on a random approach and cannot effectively control or observe some internal registers that have low
testability.
The DfT methodology proposed in Reference 29 systematically adds test instructions to an on-chip
processor core to improve the self-testability of the processor core, reduce the size of the self-test program,
and reduce its runtime (i.e., reduce the test application time). To decide which instructions to add, the
testability of the processor is analyzed rst. If a register in the processor is identied as hard-to-access, a test
instruction allowing direct accessing of the register is added. The testability of a register can be determined
based on the availability of data movement instructions between registers and memory. Aregister is said to
be fully controllable if there exists a sequence of data movement instructions that can move the desired
data frommemory to the register. Similarly, a register is said to be fully observable if there exists a sequence
of data movement instructions to propagate the register data to memory. Given the micro-architecture of
a processor core, it is possible to identify the fully controllable and fully observable registers. For registers
that are not fully controllable/observable, new instructions can be added to improve their accessibility.
In addition, test instruction can also be added to optimize the test program size and runtime. This
is based on the observation that in the synthesized self-test program some code segments (called hot
segments) appear repeatedly. Therefore, the addition of few test instructions can reduce the size of hot
segments. Test instructions can be added to speed up the process of preparing the test vectors by the
processor core, retrieving the responses from the on-chip core under test and analyzing the responses
(by the processor core).
When adding new instructions, the existing hardware should be reused as much as possible to reduce
the area overhead. Adding extra buses or registers to implement new instructions is unnecessary and
avoidable. In most cases, a new instruction can be added by introducing new control signals to the
datapath rather than by adding hardware.
Adding test instructions to the programmable core does not improve the testability of other non-
programmable cores on the SoC. Therefore, instruction-level DfT cannot increase the fault coverage
of the nonprogrammable cores. However, the programs for testing the nonprogrammable cores can be
optimized by adding new instructions. In other words, the same set of test instructions added for self-
testing the programmable cores can be used to reduce the size and runtime of the test programs for
testing other nonprogrammable cores. For pipelined designs, instructions can be added to manage the
difcult-to-control registers buried deeply in the pipeline.
The experimental results of two processors (Parwan and DLX) show that test instructions can reduce
the program size and program runtime by about 20% at the cost of about 1.6% increase in area overhead.
28.9 Self-Test of On-Chip ADC/DAC and Analog Components
Using DSP-Based Approaches
For mixed-signal systems that integrate both analog and digital functional blocks onto the same chip,
testing of analog/mixed-signal parts has become the bottleneck during production testing. Because most
analog/mixed-signal circuits are functionally tested, their testing needs expensive ATE for analog stimulus
generation and response acquisition. One promising solution to this problem is BIST that utilizes on-chip
resources (either shared with functional blocks or dedicated BIST circuitry) to perform on-chip stimulus
generation and response acquisition. Under the BIST approach, the demands on the external test equip-
ment are less stringent. Furthermore, stimulus generation and response acquisition is less vulnerable to
environmental noise during the test process.
With the advent of the CMOS technology, DSP-based BISTbecomes a viable solution for analog/mixed-
signal systems, as the required signal processing to make the pass/fail decision can be realized in the digital
domain with digital resources. In DSP-based BIST schemes [30,31], on-chip DA and AD converters are
used for stimulus generation and response acquisition, and DSP resources (such as CPU or DSP cores)
are used for the required signal synthesis and response analysis. The DSP-based BIST scheme is attractive
because of its exibility various tests, suchas AC, DC, andtransient tests, canbe performedby modifying
the software routines without altering the hardware. However, on-chip AD and DA converters are not
always available in mixed-signal SoC devices. In Reference 32, the authors propose to use the one-bit rst-
order deltasigma modulator as a dedicated BIST circuitry for on-chip response acquisition, in case an
on-chip AD converter is not available. Owing to its over-sampling nature, the deltasigma modulator can
tolerate relatively high-process variations and match inaccuracy without causing functional failure, and
is therefore particularly suitable for VLSI implementation. This solution is suitable for low-to-medium
frequency applications (for example, audio signal).
1- bit
modulator
Analog
CUT
Low-res. DAC
& LP F
Response
analysis
Programmable
core+memory
Test stimuli
& spec.
Pass/fail ?
Software
modulator
SOC
One-bit
modulator
Analog
CUT
Low-res. DAC
and LPF
Response
analysis
Test stimuli
and specifications
Pass/fail ?
Software
modulator
ATE
FIGURE 28.12 DSP-based self-test for analog/mixed-signal parts. (From J.L. Huang and K.T. Cheng, in Proceedings
of the Asia and South Pacic Design Automation Conference, January 2000. With permission.)
Figure 28.12 illustrates the overall deltasigma modulation-based BIST architecture. It employs the
deltasigma modulation technique for both stimulus generation [33] and response analysis [32]
A software deltasigma modulator converts the desired signal to a one-bit digital stream. The digital 1s
and 0s are then transferred to two discrete analog levels by a one-bit DAC followed by a low-pass lter
that removes the out-of-band high-frequency modulation noise, thus restoring the original waveform. In
practice, we extract a segment from the deltasigma output bit stream that contains an integer number of
signal periods. The extracted pattern is stored in on-chip memory, and then periodically applied to the
low-resolution DAC and low-pass lter to generate the desired stimulus. Similarly, for response analysis,
a one-bit modulator can be inserted to convert the analog DUT output response into a one-bit
stream, which is then analyzed by DSP operations performed by on-chip DSP/microprocessor cores.
Among the one-bit modulation architectures, the rst-order conguration is the most stable and
has the maximal input dynamic range. However, it is not quite practical for high-resolution applications
(as a rather high over-sampling rate will be needed), and it suffers inter-modulation distortion (IMD).
Compared to the rst-order conguration, the second-order conguration has a smaller dynamic range
but is more suitable for high-resolution applications.
Note that, the software part of this technique, that is, the software modulator andthe response ana-
lyzer, can be performed by on-chip DSP/microprocessor cores, if abundant on-chip digital programmable
resources are available (as indicated in Figure 28.12), or by external digital test equipment.
28.10 Conclusions
Embedded software-based self-testing has the potential to alleviate problems with many of the current
external tester-based and hardware BIST testing techniques for SoCs. In this chapter, we give a sum-
mary of the recently proposed techniques on this subject. One of the main tasks in applying these
techniques is extracting the functional constraints in the process of test program synthesis that is,
deriving tests that can be delivered by processor instructions. Future research in this area must address the
problem of automating the constraint extraction process in order to make the proposed solutions fully
automatic for general embedded processors. The software-based self-testing paradigmcan be further gen-
eralized for analog/mixed-signal components through the integration of DSP-based testing techniques,
modulation principles, and some low-cost analog/mixed-signal DfT.
Acknowledgments
The authors wish to thank L. Chen and T.M. Mak of Intel, Angela Krstic of Cadence, Sujit Dely of UC
San Diego, Larry Lai of Novas, Li.-C. Wang, and Charles Wen of UC Santa Barbara for their efforts and
contribution to this chapter.
References
[1] Semiconductor Industry Association, The National Technology Roadmap for Semiconductors, 1997.
[2] Semiconductor Industry Association, The International Technology Roadmap for Semiconductors,
2003.
[3] C.-J. Lin, Y. Zorian, and S. Bhawmik, Integration of Partial Scan and Built-In Self-Test, Journal of
Electronic Testing: Theory and Applications (JETTA), 7(1-2): 125137, August 1995.
[4] K.-T. Cheng and C.-J. Lin, Timing-Driven Test Point Insertion for Full-Scan and Partial-Scan BIST,
in Proceedings of the IEEE International Test Conference (ITC), Washington D.C., October 1995.
[5] H.-C. Tsai, S. Bhawmik, andK.-T. Cheng, AnAlmost Full-ScanBISTSolutionHigher Fault Cov-
erage and Shorter Test Application Time, in Proceedings of the IEEE International Test Conference
(ITC), Washington D.C., October 1998.
[6] L. Chen and S. Dey, Software-Based Self-Testing Methodology for Processor Cores, IEEE
Transactions on Computer-Aided Design (TCAD), 20(3): 369380, March 2001.
[7] W.-C. Lai, A. Krstic, and K.-T. Cheng, On Testing the Path Delay Faults of a Microprocessor using
its Instruction Set, in Proceedings of the IEEE VLSI Test Symposium (VTS), Montreal Canada, April
2000.
[8] W.-C. Lai, A. Krstic, and K.-T. Cheng, Test Program Synthesis for Path Delay Faults in Micro-
processor Cores, in Proceedings of the IEEE International Test Conference (ITC), Washington, D.C.,
October 2000.
[9] L. Chen, S. Ravi, A. Raghunathan, and S. Dey, AScalable Software-Based Self-Test Methodology for
Programmable Processors, in Proceedings of the ACM/IEEE Design Automation Conference (DAC),
Anaheim, CA, June 2003.
[10] R.S. Tupuri and J.A. Abraham, A Novel Functional Test Generation Method for Processors using
Commercial ATPG, in Proceedings of the IEEE International Test Conference (ITC), Washington
D.C., September 1997.
[11] S. Hellebrand and H.-J. Wunderlich, Mixed-Mode BIST Using Embedded Processors, in Proceed-
ings of the IEEE International Test Conference (ITC), Washington, D.C., October 1996.
[12] R. Dorsch and H.-J. Wunderlich, Accumulator Based Deterministic BIST, in Proceedings of the
IEEE International Test Conference (ITC), Washington, D.C., October 1998.
[13] Z. Navabi, VHDL: Analysis and Modeling of Digital Systems. McGraw-Hill, New York, 1997.
[14] X. Bai, L. Chen, and S. Dey, Software-Based Self-Test Methodology for Crosstalk Faults in Pro-
cessors, in Proceedings of the IEEE High-Level Design Validation and Test Workshop, San Francisco,
CA, November 2003, pp. 1116.
[15] K.-T. Cheng and H.-C. Chen, Classication and Identication of Nonrobustly Untestable Path
Delay Faults, IEEE Transactions on Computer-Aided Design (TCAD), 15(8): 845853, August 1996.
[16] A. Krstic, S.T. Chakradhar, and K.-T. Cheng, Testable Path Delay Fault Cover for Sequential
Circuits, in Proceedings of the European Design Automation Conference, Geneva, Switzerland,
September 1996.
[17] M. Gumm, VHDL Modeling and Synthesis of the DLXS RISC Processor. VLSI Design Course
Notes, University of Stuttgart, Germany, December 1995.
[18] T. Grning, U. Mahlstedt, and H. Koopmeiners, DIATEST: A Fast Diagnostic Test Pattern
Generator for Combinational Circuits, in Proceedings of the IEEE International Conference on
Computer-Aided Design (ICCAD), Santa Clara, CA, November 1991.
[19] X. Yu, J. Wu, and E.M. Rudnick, Diagnostic Test Generation for Sequential Circuits, in Proceedings
of the IEEE International Test Conference (ITC), Washington, D.C., October 2000.
[20] I. Pomeranz and S.M. Reddy, A Diagnostic Test Generation Procedure Based on Test Elimination
by Vector Omission for Synchronous Sequential Circuits, IEEE Transactions on Computer-Aided
Design (TCAD), 19(5): 589600, May 2000.
[21] L. Chen and S. Dey, Software-Based Diagnosis for Processors, in Proceedings of the ACM/IEEE
Design Automation Conference (DAC), New Orleans, LA, June 2002.
[22] X. Bai, S. Dey, and J. Rajski, Self-Test Methodology for At-Speed Test of Crosstalk in Chip Inter-
connects, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), Los Angeles, CA,
June 2000.
[23] L. Chen, X. Bai, and S. Dey, Testing for Interconnect Crosstalk Defects Using On-Chip Embedded
Processor Cores, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), Las Vegas,
NV, June 2001.
[24] W.-C. Lai, J.-R. Huang, and K.-T. Cheng, Embedded-Software-Based Approach to Testing
Crosstalk-Induced Faults at On-Chip Buses, in Proceedings of the IEEE VLSI Test Symposium
(VTS), Marina Del Rey, CA, April 2001.
[25] M. Cuviello, S. Dey, X. Bai, and Y. Zhao, Fault Modeling and Simulation for Crosstalk in System-
on-Chip Interconnects, in Proceedings of the IEEE International Conference on Computer-Aided
Design (ICCAD), San Jose, CA, November 1999.
[26] J.-R. Huang, M.K. Iyer, and K.-T. Cheng, A Self-Test Methodology for IP Cores in Bus-Based
Programmable SoCs, in Proceedings of the IEEE VLSI Test Symposium (VTS), Marina Del Rey, CA,
April 2001.
[27] IEEE P1500 Web Site, http://grouper.ieee.org/groups/1500/
[28] J. Shen and J.A. Abraham, Native Mode Functional Test Generation for Processors with Applica-
tions to Self Test and Design Validation, in Proceedings of the IEEE International Test Conference
(ITC), Washington D.C., October 1998.
[29] W.-C. Lai and K.-T. Cheng, Instruction-Level DFT for Testing Processor and IP Cores in System-
on-a-Chip, in Proceedings of the ACM/IEEE Design Automation Conference (DAC), Las Vegas, NV,
June 2001.
[30] M.F. Toner and G.W. Roberts, A BIST Scheme for a SNR, Gain Tracking, and Frequency Response
Test of a SigmaDelta ADC, IEEE Transactions on Circuits and Systems-II, 42: 115, January 1995.
[31] C.Y. Pan and K.T. Cheng, Pseudo-Random Testing and Signature Analysis for Mixed-Signal Cir-
cuits, in Proceedings of the International Conference on CAD (ICCAD), San Jose, CA, November
1995, pp. 102107.
[32] J.L. Huang and K.T. Cheng, A SigmaDelta Modulation Based BIST Scheme for MixedSignal
Circuits, in Proceedings of the Asia and South Pacic Design Automation Conference, Yakohama
Japan, January 2000.
[33] B. Dufort and G.W. Roberts, Signal Generation using Periodic Single and Multi-Bit SigmaDelta
Modulated Streams, in Proceedings of the IEEE International Test Conference (ITC), Washington,
D.C., October 1997.
IV
Networked Embedded
Systems
29 Design Issues for Networked Embedded Systems
Sumit Gupta, Hiren D. Patel, Sandeep K. Shukla, and Rajesh Gupta
30 Middleware Design and Implementation for Networked Embedded Systems
Venkita Subramonian and Christopher Gill
29
Design Issues for
Networked Embedded
Systems
Sumit Gupta
Tallwood Venture Capital
Hiren D. Patel and
Sandeep K. Shukla
Virginia Tech
Rajesh Gupta
University of California
29.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-1
29.2 Characteristics of NES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-2
Functionality and Constraints Distributed Nature
Usability, Dependability, and Availability
29.3 Examples of NES. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-5
Automobile: Safety-Critical Versus Telematics Data
Acquisition: Precision Agriculture and Habitat Monitoring
Defense Applications: Battle-Space Surveillance Biomedical
Applications Disaster Management
29.4 Design Considerations for NES. . . . . . . . . . . . . . . . . . . . . . . . . 29-8
29.5 System Engineering and Engineering Trade-Offs
in NES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-10
Hardware Software
29.6 Design Methodologies and Tools . . . . . . . . . . . . . . . . . . . . . . . 29-13
29.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-15
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29-16
29.1 Introduction
Rapid advances in microelectronic technology coupled with integration of microelectronic radios on the
same board or even on the same chip has been a powerful driver of the proliferation of a new breed of
Networked Embedded Systems (NESs) over the last decade. NES are distributed computing devices with
wireline and/or wireless communicationinterfaces embeddedina myriadof products suchas automobiles,
medical components, sensor networks, consumer products, and personal mobile devices. These systems
have been variously referred as EmNets (Embedded Network Systems), NEST (Networked Embedded
System Technology), and NES [13].
NES are often distributed embedded systems that must interact not only with the environment and
the user, but also with each other to coordinate computing and communication. And yet, these devices
must often operate in very constrained environments related to their size, energy availability, network
connectivity, etc. The challenges posed by the design and deployment of NES has captured the imagina-
tion of a large number of researchers and galvanized whole new communities into action. The design of
29-1
NES requires multidisciplinary, multilevel cooperation, and development to address the diverse hardware
(processor cores, radios, security cores) and software (applications, middleware, operating systems, net-
working protocols) needs. In this chapter, we briey highlight some of the design concerns and challenges
in deploying NES.
Some examples of NES are wireless data acquisition systems such as habitat [47], agriculture and
weather monitoring [8], disaster management and civil monitoring, Cooperative Engagement Capability
(CEC) [9] for military use [10], fabric e-textile [11,12], and consumer products such as cell phones and
Personal Digital Assistants (PDAs). Common to all these systems/applications is their ability to provide
interaction between the environment and humans through a medium of devices such as sensors for data
collection, computation processors to perform data computation, and remote storage devices to preserve
and collate the information.
A good exposition of the characteristics, parameters, examples, and design challenges of NES is
presented in Reference 1. We draw heavily on this book for material and examples. Similar surveys
and expositions of challenges in the applications and design and implementation of sensor networks are
given in References 1317.
The rest of this chapter is organized as follows: in Section 29.2, we describe the characteristic of NES
followed by some examples of such systems in Section 29.3. Based on these examples and characteristics,
we delve into the design considerations of NES in Section 29.4. In Section 29.5, we explore the engineering
trade-offs while designing and deploying NES. Finally, we discuss the design methodologies and design
tools available for designing NES in Section 29.6 and conclude the chapter with a discussion.
29.2 Characteristics of NES
The realmof possibilities where NES applications can be implemented makes characterizing these systems
an inherently difcult task. However, we make an attempt at characterizing the basic functionality and
constraints, distributed nature, and usability, dependability, and availability of such systems. Then, we
describe NES through some examples.
29.2.1 Functionality and Constraints
Networked embedded systems are typically designed to interact with and react to the environment and
people around them. Thus, often NES have sensors that measure temperature, moisture, movement, light,
and so on. By denition, NES have a communication mechanism either a wireline connection or a
wireless radio. Also, they typically have computation engines that can do at least a minimal amount of
computing on the data they acquire.
The environment and user needs place constraints on NES such as small size, low weight, harsh
working conditions, safety and reliability concerns, low cost, and poor resource availability in terms of
low computational ability, and low energy availability (limited battery) [18]. NES devices have to be small
in size so that their deployment does not interfere with the environment; that is, they must function
almost invisibly to the environment. For example, animals must not be aware of the habitat monitoring
sensors that are embedded on them or around them. This example also demonstrates the need for these
systems to be low weight and be able to work under harsh conditions, that is, be tolerant of temperature
changes, physical abuse, vibration, shock, and corrosion. Since NES are frequently deployed in the eld
with little or no access to renewable energy sources, they have to live off a limited energy source or battery.
Owing to real-time and mission-critical requirements, NES have to frequently meet safety and reliability
constraints. For example, the cruise control, antilock braking, and airbag systems in automobiles have to
respond within the given real-time constraints to meet safety requirements. The small form factor and
wide distribution of NES also places a constraint on cost; price uctuations of even a few cents on each
device have a big impact as the volume of devices deployed increases.
Design Issues for Networked Embedded Systems 29-3
FIGURE 29.1 U.C. Berkeley NES device the MICA processor and radio board. (From M. Horton, D. Culler,
K. Pister, J. Hill, R. Szewczyk, and Alec Woo. Sensors Magazine, 19, 2002. With permission.)
Figure 29.1 shows the Berkeley NES device that consists of a MICA processor and a radio board [19].
As technology advances, these devices are becoming smaller. In fact, the latest Berkeley mote is as
small as a coin. This has led to the notion of smart dust or a massively distributed sensor network that
is self-contained, networked, and provides multiple sensor and coordinated computational capabilities
[7,2022].
29.2.2 Distributed Nature
The application spectrum of NES often means that these systems are physically distributed. In fact, the
distributed nature of NES extends to distributed functionality and communication as well. Distributed
in function refers to NES components that perform specic roles and work together with other NES
components to complete the system. Automotive electronics is a good example where many different
function-specic components work in unison, such as the power control modules, the engine, airbag
deployment, cruise control, suspension, etc. Figure 29.2 shows the components from Kyocera that are
widely used in automotive electronics. Similarly, distributed communication refers to local and global
communication between the embedded systems distributed throughout the system. For example, auto-
motive systems have local wires from actuators/sensors to ECUs (Electronic Control Units) and global
wires/buses between ECUs [24].
29.2.3 Usability, Dependability, and Availability
Networked embedded systems are becoming an increasingly dominant part of a number of devices and
systems in all aspect of our daily lives from entertainment, transportation, personal communications to
biomedical devices.
The pervasiveness of these systems, however, raises concerns about dependability and availability.
Availability generally means access to the system. In scenarios where some of the components of the NES
fail, there must be a mechanism through which the users can interface and interact with the components
to investigate and rectify the problems. Mediums of access can be through PDAs, wired serial access points,
infrared technology, etc. Another dimension of availability is the long life expected fromNES components.
Often, NES do not have access to a renewable energy source and sometimes it is not possible to change
the battery source either. For example, sensors deployed to measure trafc on roads or sensor tags placed
on animals are inaccessible or difcult to reach after deployment.
Dependability or reliability is also a major concern that goes hand in hand with the availability require-
ment. The system must guarantee a certain level of service that the user can depend on. For example,
temperature sensors and smoke detectors are critical to the re safety requirements of any building.
Availability and dependability characteristics are especially crucial to safety-critical systems such as
avionics and biomedical applications. The sensors used in an airplane to monitor cabin pressure, oxygen
(a)
(b)
(c)
Integrated terminals
(Lead-offs)
Cross section of stepped
active metal bond
FIGURE 29.2 Automotive electronic components by Kyocera. (From Kyocera Website http://global.kyocera.com/
application/automotive/auto_elec/. With permission.)
levels, elevation, relative speed, etc. are all important to maintain safety. For example, the release of oxygen
into the oxygen masks in airplanes is controlled via sensors that monitor the cabin oxygen levels. For
biomedical applications such as electronic heart pacemakers, the need for dependability is obvious since
malfunctioning components can be life threatening. Military personnel monitoring is another example
for biomedical applications with NES where devices are used to transmit the location and vital statistics
of the personnel.
Since NES consist of diverse hardware and software components that interact with each other, com-
ponent interoperability becomes another important concern [1]. Today, cars have tens or sometimes even
a hundred embedded computing systems that are designed and manufactured by different contractors.
Such complex distributed, interacting embedded systems raise difcult challenges in system integration,
component interoperability, and system testing and validation.
These NES characteristics present new challenges and constraints for systems engineers that have not
been fully addressed in the design of past networking and distributed systems. Software and hardware
tools and techniques are required to satisfy the need for low cost, low power, Quality-of-Service (QoS)
guarantees, and fast time-to-market. Formalization of methodologies to ensure functional correctness and
efciency of design are paramount in meeting the time-to-market requirements by reducing the iterations
in the design process.
In the following sections, we discuss some of the interesting applications of NES and then, describe NES
tools and methodologies such as programming languages, simulation environments, and performance
measurement tools designed specically to address the design challenges posed by NES.
29.3 Examples of NES
To demonstrate the characteristics, constraints, and design challenges of NES, we present several examples
from current and future NES. These examples are representative of the diverse application domains where
NES can be found. We start with three examples from Reference 1.
29.3.1 Automobile: Safety-Critical Versus Telematics
Cars today typically have tens to a hundred microprocessors controlling all aspects of the automobile
from entertainment systems to the emergency airbag release mechanisms. Figure 29.3 shows some of
Communication
Infrared rays
Power electronics
Control tech.
Fuel cell
Super - conduction
Monitoring system
Radar tech.
Optical devices
Semiconductor
Computer and telephone
FIGURE 29.3 Telematics components in an automobile. (From Mitsubishi electric.
http://www.mitsubishielectric.ca/automotive/. With permission.)
the telematic components in a Mitsubishi car [25]. The microprocessors in charge of the functionality fre-
quently communicate and interact with other processors. For example, the stereo volume is automatically
reduced when the driver receives (or answers) a call on his or her cell phone.
Thus, a range of devices that perform different tasks are beginning to be organized in sophisticated
networks as distributed systems. Broadly speaking, there are two such distributed systems: safety-critical
processing systems and telematics systems. Clearly, the safety-critical aspects cannot be sacriced or com-
promised in any way. These two systems are an integral part of the design and construction of the
automobile and dictate several design parameters. Since automobiles have time-to-market that can span
up to ve years from concept to nal product, frequently the technology used for the telematics and
safety-critical components in the automobile is already outdated. This is a rising concern, especially for
the safety-critical components, because upgrading or switching out components is generally not per-
formed or usually not even feasible. Note that, systems that cannot be upgraded or altered after nal
production are considered to be closed systems.
Conversely, open systems allow plugging in newer components with more capabilities and features
similar to a plug-and-play environment. Thus, to make automobiles open systems, we have to develop
technologies that enable automobile designers to construct the safety-critical and telematics systems in
an abstracted manner such that components with a standardized communication protocol and interface
can simply be plugged into the nal product. This resolves the disparity between the long design cycles
for automobiles and the rapid advances in NES components used in them.
Increasing popularity of wireless technology has spawned interesting applications for automobiles.
For example, the OnStar system from General Motors can monitor the location of a car and customer
service staff can remotely unlock the car, detect when airbags have been deployed, and so on. Wireless
communication opens up innite possibilities such as automobile service requests and data collection for
automobile users, dealers, and manufacturers.
29.3.2 Data Acquisition: Precision Agriculture and Habitat Monitoring
The use of sensor nodes for data acquisition is becoming an useful tool for agricultural and habitat
monitoring. In Reference 7, the authors present a study in which they used wireless sensor networks in a
real-world habitat monitoring project. The small footprint and weight of modern sensor nodes make them
attractive for habitat and agricultural monitoring since they cause minimal disturbance to the animals,
plant population, and other natural elements of the habitats being monitored. This solution to monitoring
automates some of menial tasks such as data collection for researchers.
Precision agriculture is an important area of research where NES technology is likely to have a big impact
[1,26,27]. Precision agriculture envisages sensor deployment to monitor and manage crop productivity,
quality, and growth. Besides increasing productivity, better crop quality and crop management are also
key aspects of using NES in precision agriculture.
Crop management provides monitoring and adjusting the level of fertilizer, pesticides, water for partic-
ular areas resulting in better yields at less pollution, emissions, and denitely lower costs. The automation
of these functions requires an adaptive behavior to the changing surroundings such as water levels if it
rains, or pesticides for a particular season when bug problems are more common. This adaptation is an
integral aspect of precision agriculture. While there exist models that dictate the necessary amount of
fertilizer, water, nutrient combinations, these models are not always accurate for the specic locale. So
NES can also perform on-the-side data acquisition functions for purposes of reconstructing appropriate
models and recalibrating or reconguring the sensor metrics accordingly to better suit the specic climate
and locale (Figure 29.4).
Feedback into such systems is crucial to develop the notion of true automated precision agriculture.
Fine-grained tuning to crop management can be done automatically based on these regularly updated
models that can also be monitored by researchers. However, manual adjustments of some kind would
require appropriate interfaces between the deployed NES and the end-user attempting to make the change.
Once again, wireless interfaces can be used for such manual ne-tunes. Conguration and management
Sensor node
Patch network
Gateway
Transit network
Sensor patch
Client data
browsing and
processing
Base station
Data service
Internet
FIGURE 29.4 System architecture of NES for common data acquisition scenarios. (From David Culler et al. Wireless
sensor networks for HAB monitoring. ACM, 2002. With permission.)
of the network can be handled remotely via handheld devices or even desktops. A practical example of
such a deployment is shown in Reference 27 and wireless sensor networks for precision agriculture in
Reference 26. A similar two-tiered network that couples wireless and wired networks together has been
proposed for structural monitoring of civil structures that can be affected by natural disasters such as
earthquakes [28]. The small size and wireless nature of NES sensor nodes enables eld researchers to
deploy these sensors in small and sensitive locations.
29.3.3 Defense Applications: Battle-Space Surveillance
Network embedded systems are projected to become crucial for future defense applications, particularly
in battle-space surveillance and condition and location monitoring of vehicles, equipment, and personnel.
A military application called Cooperative Engagement Capability (CEC), developed by Raytheon Systems
[9,29], acts as a force multiplier for naval air and missile defense systems by distributing sensor and
weapons data to multiple CEC ships and airborne units. Data from each unit is distributed to other CEC
units, after which the data is ltered and combined via an identical algorithm in each unit to construct
a common radar-like aerial picture for missile engagements.
DARPA has funded research in several areas of defense systems under the aegis of their future combat
systems program [30]. Manipulating battle environments and critical threat-identication are projected
uses of NES in such systems. Manipulating battle environments refers to controlling the opposition by
detecting their presence and either altering their route or constraining their advance. Threat-identication
involves identifying a threat early for force protection. A force-protection scenario involves deployment of
sensors around a perimeter that requires protection so that forced entry can be identied and automated
responses such as alarms can be triggered based on a certain event.
System deployment used to be of concern in the past because sensors were bulky and large in size and
required manual deployment. However, with the advances in technology entire sensor networks can now
be deployed by airdrop, personnel, or even via artillery. The small sizes have enabled NES to be deployed
for monitoring vehicles in a way similar to the automobile example discussed earlier.
A relatively new technology called e-textiles has emerged whereby sensors or other computation devices
are integrated into wearable material [11]. Nakad et al. [11] are investigating the communication require-
ments between sensing nodes of e-textile and the computing elements embedded with them. One key and
obvious application of e-textiles is for data acquisition for human monitoring where sensor nodes can be
used to track the location and vital statistics of military personnel.
29.3.4 Biomedical Applications
A civilian application for e-textiles is for monitoring the health of people particularly, old people.
A sensor node embedded in an e-textile worn by patients with heart problems can automatically alert
doctors or emergency services when the patient suffers from heart failure. We have already seen the value
of heart pacemakers for helping millions of people around the world maintain a regular heart beat.
Work is in progress to make sensors small and body-friendly enough that they can either be surgically
insertedor swallowedfor temporary monitoring. These devices canbe usedto monitor, diagnose, andeven
correct anomalies in the health of a patient. Of course, surgical insertion or ingestion of microelectronic
devices raises several concerns about safety and ability of the body to adapt to the foreign bodies, which
are active areas of research.
29.3.5 Disaster Management
Scenarios that involve disaster management can be seen as data acquisition applications where certain
information is gathered, based on which a response is computed and performed. A good example of an
implemented scenario is provided in Reference 31 where remote villages are monitored by four sensors
measuring the seismic activity and water levels for earthquakes and oods, respectively. Through wireless
mediums, these sensors are connected to the nearest emergency rescue stations signaling emergency
events when the thresholds for maximum water levels and seismic activity are crossed. As mentioned
earlier, Kottapalli et al. [28] propose a sensor network for structural monitoring of civil structures in case
of natural disasters such as earthquakes. Other applications of disaster management systems include severe
cold (or heat) monitoring, re monitoring (smoke detectors, heat sensors), volcano monitoring, etc.
29.4 Design Considerations for NES
The examples presented in Section 29.3 gives an idea of the breadth of the application domains in
which NES can be deployed. By studying these examples, we understand the various requirements,
constraints, issues, and concerns involved in developing these kind of systems. Furthermore, as NES
proliferate, the true potential of these systems will be realized when they are deployed at a massive
scale, in the order of thousands or more components. Such a large-scale deployment, however, raises
some problems [13,1517]:
Deployment. Deployment refers to physical distribution of the nodes in the NES. The rst concerns
for deployment are safety, durability, and sturdiness; if devices are dropped from the air, they should
not cause damage to other objects (people, animals, plants, or material) while landing, and should not
cause damaged either. This is clearly important for defense applications where surveillance sensors may
be airdropped in the battle space.
Several deployment strategies are available that can be classied into either random or strategic deploy-
ment. As the name suggests, random deployment refers to deploying NES nodes in an arbitrary fashion
in the eld. Random deployment is useful when the region being monitored is not accessible for precise
placement of sensors [32]. The problem then becomes of determining region coverage and also possible
redeployment or movement of nodes to improve coverage. Strategic deployment refers to placing NES
nodes at well-planned points so that the coverage is maximized or to place nodes strategically in a small
eld of concentration such that these nodes are not easily subjected to natural damages (e.g., habitat
monitoring).
The number of NES nodes deployed must be considered in the cost and performance/quality of mon-
itoring trade-off because some nodes are generally bound to be destroyed by some means, so there should
be sufcient reserves or fault tolerance in the network to continue the monitoring.
Environment interaction. NES components often need to interact with the environment without human
interaction. Thus, a requirement of NES is an ability to work on their own and perhaps also has a feedback
loop so that nodes can adapt to changes (failure of nodes, movement of objects) in the environment
and continue functioning correctly. Systems such as those used in precision agriculture, chemical and
hazardous gas monitoring, and so on are designed to interact and react to changes in the system. For
example in agriculture, release of water can be tied to the the moisture content in the air.
Life expectancy of nodes. As discussed earlier, an essential requirement for nodes in a NES is a long
life expectancy. This is because once deployed, it is very difcult to access and refurbish the batteries
in the nodes. Also, these nodes must sustain environmental challenges such as inclement weather and
unexpected loss of nodes to animal interaction or owing to component failure. Thus, a whole body of
work has gone into identifying node failure and subsequent reconguration of the network to have some
amount of fault tolerance [33].
Communication protocol between devices. A combination of wired and wireless links can be used to
establish a NES. Furthermore, the nodes in the network may be stationary or mobile. Mobile nodes bring
in a whole range of issues related to dynamic route and neighbor discovery, dynamic routing, etc. The
NES should also be able to recongure and adjust to tolerate loss of nodes from a communication point
of view. That is, if a node that is a relay point fails or dies, then the network should be able to use other
nodes for relaying instead.
Recongurability. In many scenarios, it is not possible to physically reach nodes. However, NES fre-
quently require nodes torecongure after deployment. This may be toadd, remove, or change functionality
or to adjust parameters of the functionality. For example, handheld devices or even desktops may be used
to recongure nodes to ne-tune certain aspects of the system. For example, the water level can be
increased in precision agriculture when the weather report suggests a sudden heat wave for the following
few days [26,27].
Security. NES particularly those that use wireless communication are prune to malicious attack
[34]. This is most evident in military equipment where communication has to be secure from enemy
eavesdropping. Security in handheld devices is becoming an increasing concern with their widespread use
in ofce environments for everything from checking email to exchanging sensitive documents and data.
Running security protocols is computationally expensive and hence, power hungry and several researchers
are proposing ways to reduce these power requirements for sensor networks and handheld devices [3537].
Energy constrained. The small formfactor, lowweight, and the deployment of NES nodes in inaccessible
and remote regions implies that these nodes have access to a limited nonrenewable energy source. Thus,
one major focus of the research community is to develop networking protocols, applications, operating
systems, etc. (besides devices) that are energy efcient and utilize robust, high throughput but low power
communication schemes [17,13].
Operating system. There is a need for special or optimized operating systems owing to the stringent
hardware constraints (small form factor, limited energy source, limited memory space) and strict applic-
ation requirements (real-time constraints, adaptability). Several Real-Time Operating Systems (RTOSs)
have been proposed for embedded devices such as eCos [38], LynxOS fromLynuxWorks [39], QNXRTOS
[40], etc.
Adequate design methodologies. Standard design methodologies and design ows have to be modied
or new ones have to be created to address the special needs of NES. For example, there is a need for
design methodologies for low power system-on-a-chip implementations to enable integration of the large
number of diverse components that form a NES device [41].
29.5 System Engineering and Engineering Trade-Offs in NES
The design considerations presented in Section 29.4 raise opportunities for interesting trade-offs between
the hardware and software components in NES. Whereas area, power, and weight constraints limit the
amount of hardware that can be put in a NES node, integration, debugging, and complexity issues are
hindering increased dependence on software.
29.5.1 Hardware
Rapid advances in silicon technology is ushering in an era where we will see widespread use of smart dust
or very small sensor nodes with reasonably complex computational and communication abilities [2022].
Besides a small size, these nodes are low power and have a variety of actuators and sensors, along with
radio/wireless communication devices, and processors for computation. This enables these nodes to move
from beyond being just data acquisition sensors that send their data to a central server. They can now
also act as computation points that rst collate and process the data before sending it to a server or even
coordinate computation among themselves independent of a central server.
The power and area constraints on NES nodes means that general-purpose microprocessors cannot
be used in them. However, low power Application Specic Instruction Processors (ASIPs) augmented
with Application Specic Integrated Circuits (ASICs) will provide the necessary computational ability at
a relatively low power. Whereas the ASIPs are easily programmable, the ASICs can be used for executing
computation-ally expensive and/or time sensitive portions of applications. For example, target identic-
ation in defense systems or airbag release mechanisms in cars require ASICs to meet their timing and
computational needs.
In fact, Henkel and Li [42] and Brodersen et al. [1] have shown that less power is consumed by
custom-made processors than by general-purpose processors. The reason for this is that with custom
chips, parallelism can be effectively exploited to gain better power consumption. Also, hardwiring the
execution of each function eliminates the need for instruction storage and decoding, thus reducing power
as well.
On the other hand, applications such as habitat monitoring and precision agriculture do not have high
timing or computational requirements , so generic microproessors or ASIPs can be used. The compromise
is speed and computational ability versus programmability. ASICs have a high design and manufacturing
cost and are inexible when compared with programmable processors. A change in applications or
protocols leads to a large redesign effort. Programmable processors, on the other hand, can be reused for
several generations of an application (provided computational requirements do not increase).
Recongurable hardware such as Field Programmable Gate Arrays (FPGAs) provides a middle path
between programmable processors and hardwired ASICs. As the name suggests, FPGAs can be repro-
grammed after being deployed in the eld and hence, provide the exibility of microprocessors while
providing the hardwired speed of ASICs. In fact, FPGAs can be congured at runtime as suggested by
Nitsch and Kebschull [43]. They propose storing the functional behavior and structure of applications in
a XML format. When a client wants to execute an application, the XML is analyzed and the appropriate
mapping is done to the FPGA. The drawbacks of FPGAs are that they require large chip area and have
a low clock frequency.
29.5.2 Software
Small memory storage devices and low computational processors limit the size and complexity of the
software that can run on NES nodes. Porting commonly used operating systems and applications to NES
is difcult because of the limitations posed by the hardware. Hence, software development for NES nodes
is another challenge that embedded systems designers have to overcome.
The Tiny microthreading operating system (TinyOS) [44] has been proposed to address the unique
characteristics of NES nodes. TinyOS is a component-based, highly congurable embedded operating
system with a small footprint. TinyOS has a highly efcient multithreading engine and maintains a two-
level First In First Out (FIFO) scheduler. TinyOS consists of a set of interconnected modular components.
Each component has tasks, events, and command handlers associated with it (Figure 29.5). Tasks are the
processing units of components, that can signal event, issue commands, and execute other tasks. They are
allocateda static area of memory toholdthe state informationof the threadassociatedwiththe component.
TinyOS does not provide the functionality of dynamic memory allocation owing to the restrictions
imposed by the hardware. The component-based structure allows TinyOS to be a highly application-
specic operating system that can be congured by altering conguration les (.comp and .desc les).
Volgyesi and Ledeczi [2] provide a model-based approach to the development of applications based
on TinyOS. They presented a graphical environment, called GRATIS, through which the application
and operating system components are automatically glued together to produce an application. GRATIS
provides automatic code generation capability and a graphical user interface to construct the .comp
Handled
commands
Signaled
events
Issued
commands
Handled
events
Frame Tasks
FIGURE 29.5 Events and commands in TinyOS. (From P. Volgyesi and A. Ledeczi. Component-based development
of networked embedded applications, Vanderbilt Publication, 2002. With permission.)
and .desc les automatically, thus simplifying the task of component description and wiring for building
TinyOS-based applications [2]. This increases design productivity and recongurability.
Another design effort to provide functionality to program operating system in sensor nodes is the
development of a programming language framework called nesC [41]. nesC provides a programming
paradigmbasedonanevent-drivenexecution, exible concurrency, andcomponent-baseddesign. TinyOS
is an example where this language has been employed to develop a commonly used Linux-based operating
system for sensor networks. This programming language manages to successfully integrate concurrency,
reactivity to the environment and communication.
The distributed nature of NES means that these systems are inherently concurrent. For example, data
processing and event arrival are two processes that need to be concurrently executed on the NES node.
Concurrency management then has to ensure that race conditions do not occur. For example, in the
emergency airbag release system, the sensor needs to be able to sense the impact as well as react to it based
on processing of the collected data. These type of real-time demands along with the small size and low
cost of NES nodes make concurrency management a challenging task.
These issues are addressed by nesC by drawing upon several existing language concepts. Three of the
main contributions that nesC provides are: denition of a component model, expressive concurrency
model, and program analysis to improve reliability and reduce code volume. The component model
supports sensor node-like event targeted systems with bidirectional channel interfaces to ease event com-
munication. It also provides exible hardware and software boundaries and avoids dynamic component
instantiation and the use of virtual functions. Expressive concurrency model is tied in with compile time
analysis yielding data race detection at compile time to allow for comprehensive concurrent behaviors in
NES nodes. Reduction of code size and improvement of reliability are natural goals for any programming
language.
TinyOS inuenced the design of nesC owing to the specic features of the operating system. First,
TinyOS provides a collection of reusable system components best suited for component-based archi-
tectures. The channel interface connecting components is called the wiring specication, which is
independent of the specic implementation of the component. Tasks and events are inherent in TinyOS
where tasks are regarded as nonpreemptive computationmechanisms and events are similar to tasks except
that they can preempt another task or event. The event-task based concurrency scheme in TinyOS makes
event-driven and expressive concurrency closely related to the implementation of nesC.
Components in nesCare of either modules or conguration types, where the former consists of applica-
tion code and the latter provides interfaces for communicating between components. Modules are written
in C-style code and a top-level conguration is used to wire components together. This resembles the
VHDL-like component-architecture scheme where components are dened and the architecture is the
top-level model that connects signals between components. The component-based architecture brings
exibility to application implementations and allows users to write highly concurrent programs for a
very small scale platform with limited physical resources. Fortunately, with the aid of nesC and graph-
ics conguration tools such as GRATIS construction of dedicated operating systems based on TinyOS
is gradually becoming easier [2]. These tools allow a designer to build his/her own operating system
with relative ease, but application-specic functionality still requires implementation at a programming
level.
Another area of software for NES nodes that has received considerable attention is network protocols
[14,17]. Power and energy constraints in NES nodes necessitate efcient network protocols for trans-
mission of sensed data and intermediary communication. Two broad classications of sensor networks
are proactive and reactive. Proactive, as the word suggests, periodically sends the sensed attribute to the
data collection location or base station. The period is known a priori allowing the sensors to migrate
to their idle, sleep, or off modes to conserve energy. Applications that require periodic monitoring are
best suited for this type of sensor network. A protocol called Low Energy Adaptive Clustering Hierarchy
(LEACH) [45] is one of the many proposed proactive protocols.
Reactive networks, on the other hand, continuously sense the environment and transmit their data
to the base station only upon sensing that the attribute has exceeded a specied threshold. This type of
network is useful for time critical data so that the user or base station receives the sensitive information
immediately. One such time sensitive protocol for reactive systems that has been proposed recently is the
Threshold-sensitive Energy Efcient sensor Network (TEEN) protocol [46].
Hybrid networks constitute a third type of sensor networks that are a combination of proactive and
reactive networks and are an attempt to overcome the drawbacks experienced by the two other network
systems [47]. In hybrid networks, sensor nodes send sensed data at periodic intervals and also when sensed
data exceeds the set threshold. The periodic interval is generally longer than ones found in proactive
networks such that functionality of the two network types can be incorporated within one. Furthermore,
hybrid systems can be made to work in only proactive or only reactive modes as well.
29.6 Design Methodologies and Tools
Deploying large-scale distributed NES is inherently a complex and error-prone task. Designers of such
systems rely on system-level modeling and simulation tools during the initial architectural denition
phase for design space exploration, to come up with possible architectures that will satisfy the constraints
and requirements and then to verify the functionality of the system.
Design verication at the highest levels of abstraction down to nal implementation is an important
concern with any complex system. With distributed systems, this need becomes even more acute due
to the inability of a system designer to foresee all possible events and sequences of events that may
occur.
Several tools have been developed for simulating NES at the highest level of abstraction as com-
municating network models made up of basic network models. Network Simulator 2 (NS2), OPNET,
SensorSim, and NESLsim are popular network simulation tools widely used in the community [4851].
SensorSim and NESLsim are simulation frameworks specically designed for sensor networks.
SensorSim closely ties in sensor networks with their power considerations by constructing a two-
pronged approach to creating the models. The rst prong involves creating a sensor functional model
that represents the software functions of the sensor consisting of the network protocol stack, middleware,
user applications, and the sensor protocol stack. The second prong is the power model that simulates the
hardware abstracts such as CPU, and radio module to provide the sensor functional model to execute.
The architecture of the SensorSim simulator is shown in Figure 29.6.
Micro sensor node
Power model
Battery
model
Radio model
CPU model
Geophone model
Microphone model
Sensor function model
Application
Middleware
Network layer Sensor layer
Sensor channel Wireless channel
Network
protocol stack
Sensor
protocol stack
MAC layer Physical layer
Physical layer
FIGURE 29.6 SensorSim architecture. (From S. Park, A. Savvides, and M. Srivastava. Sensorsim: a simulation
framework for sensor networks, ACM, 2000. With permission.)
This two-prong model implies that the sensor functional model dictates the execution of the tasks to the
power model and these two models work in parallel with each other. An added feature is the sensor channel
that allows sensing devices to detect events. In this way, the sensor channel exposes external signals to the
sensor modules such as microphones, infrared detectors, etc. The signals that are transmitted through
this channel can be of any available form such as infrared light and sound waves for microphones. Every
type of signal has different characteristics based on the medium through which they travel and this is
the primary goal of the sensor channel to simulate these characteristics accurately and to detect and
monitor the events in a sensor network.
The use of a power model in SensorSim follows from the importance placed in designing low power
NES devices [52]. Efcient power control is a basic requirement for the longevity of these devices. The
basis behind the power model is that there is a single power supplier, the battery, and all other components
or models are energy consumers (as shown in Figure 29.6. The consumers such as the CPU model and
radio model drain energy from the battery through events.
An attractive feature of SensorSim is its capability to perform hybrid simulations. This refers to the
ability of SensorSim to behave as a network emulator and interact with real external components such as
network nodes and user applications. However, network emulation for sensor networks differs from
traditional network emulation. The large number and speed of input/output events in sensor net-
works mandates readjusting the real-time delays for the events and reordering the events, making the
implementation of an emulator for such networks a much more difcult task.
SensorSimenables reprogramming the sensor channel to monitor external inputs, thus using real inputs
instead of models for these channels. For example, instead of modeling waves traveling through a wired
(e.g., coaxial) cable, a microphone can be connected using a sensor channel to send waveforms through a
wired link to the simulator.
NESLsim is another modeling framework for sensor networks, which is based on the PARSEC (Parallel
Simulation Environment for Complex Systems) simulator [51]. NESLsim abstracts a sensor node into
two entities: the node entity and the radio entity. The node entity is responsible for computation tasks
such as scheduling, trafc monitoring, and congestion control, whereas the radio entity undertakes the
responsibility to maintain the communication between the sensor nodes in the NES. A third entity that
is not part of the sensor node is the channel entity. This models the wireless medium through which
communication is performed.
NS-2 is a discrete event simulator based on an open-source C++framework that implements a discrete
event simulator developed by Virtual Inter Network Test (VINT) bed collaborative research project at
University of Southern California and the University of California, Berkeley. They provide substantial
support for simulation of the routing and multicast protocols in the TCP/IP networking stack (IP, TCP,
UDP, etc.) over both wired and wireless channels. OPNET also performs similar tasks but is a proprietary
software developed by Opnet Technologies.
SystemC [53,54] is a system-level description language developed to model both the hardware and soft-
ware of a behavioral specication. Drago et al. [55] developed a methodology to combine the simulation
environments of NS-2 and SystemC to simulate and test the functionality of NES. They promote the use
of NS-2 for modeling the network topology and communication infrastructure and SystemC for repres-
enting and simulating the hardware/software components of the embedded system. Using NS-2 relieves
the designer from writing detailed high-level network protocols that already exist and are available for
simulation in network simulator, whereas SystemC allows modeling and simulation of implementations
of embedded systems. In unison these simulation frameworks preserve simulation integrity and reduce
the modeling effort with an admissible degradation in simulation performance.
Integration of simulators is regarded as a valuable resource for systems designers. However, to be able
to perform such a link between simulators, the underlying development platform must be similar. For
example, both NS-2 and SystemC have a C++ underlying framework on which these simulators were
built. Also, the basic simulation paradigm in NS-2 and SystemC is similar. NS-2 is a discrete event-driven
simulator where the scheduler runs by selecting the next event, executing it tocompletion, andlooping back
to execute the next event. Similarly, the SystemC simulator also has a discrete-event based kernel where
processes are executed and signals are updated at clocked transitions, working on the evaluate-update
paradigm. Drago et al. [55] use a shared memory queue to pass tokens and packets for communication
between the NS-2 kernel and SystemC kernel.
System-level design methodologies for embedded systems are based on a hardwaresoftware codesign
approach [5659]. Hardwaresoftware codesign is a methodology where the hardware and software are
designed and developed concurrently and in a collaboration. This leads to a more efcient and optimized
implementation of applications.
Ramanathan et al. [3,60] present a timing-driven design methodology for NES and explore the need for
temporal correctness while designing these systems. Determining temporal correctness of system models
is difcult because these models are not cycle accurate and usually have no notion of hardware and/or
software implementation. Determining the correctness of timing constraints after the hardware has been
manufactured naturally leads to costly redesign iterations for both the hardware and software subsystems.
Instead the authors propose a solution whereby they specify, explore, and exploit temporal information in
high-level network models and bridge the gap between requirement analysis and system design. At each
stage of the design renement, timing information modeled at the higher-level network models is trickled
down to ner and lower-level models. The authors used NS-2 as their modeling framework.
This timing-driven design methodology uses high-level network models generated using a rate deriv-
ation technique called RADHA-RATAN [60,61]. RADHA-RATAN works on generalized task graphs that
have nodes that represent functionality or tasks and edges that are a asynchronous, unidirectional com-
munication channel between producers and consumers. RADHA-RATAN is a collation of algorithms
that generate timing budgets for the task graph nodes based on their preset execution (ring) and data
production and consumption rates.
Along with RADHA-RATAN, network-level models at the highest level are used to represent function-
alities such as routing, congestion, and QoS. The designer can specify distributions, protocols, and such
settings to generate the network graph in NS-2. The nodes of the network graph simulate the transfer of
packets and tokens among themselves. This enables testing of the functionality of the protocols. However,
further renement by an experienced designer of this high-level network graph results in network subsys-
tems that capture timing requirements. In a process known as timing-driven task structuring, the designer
can then mutate the task graph until the desired timing behavior is achieved, after which partitioning
of hardware and software can be performed. The disconnect between requirement analysis and system
design is circumvented by this mutation, allowing the mix of the NS-2 modeling paradigm and formal
timing analysis techniques to provide a methodology whereby timing requirements seep from high-level
network models to low-level synthesis models.
Simulation and timing analysis are one part of the puzzle in hardwaresoftware codesign. Automated
hardware synthesis and software synthesis and compilation techniques are the next step in generating
implementations fromthe system-level models. To this end, Gupta et al. [62,63] have proposed the SPARK
parallelizing high-level synthesis framework that performs automated synthesis of behavioral descriptions
specied in C to synthesizable register-transfer level VHDL. This framework can then be used in a system
level codesign methodology to implement an application on a core-based platform target [64].
Such hardwaresoftware codesign methodologies are crucial for the design and development of NES.
Automated or semiautomated methodologies are less error-prone, lead to faster time-to-market, and can
help realize hardwaresoftware trade-offs and design exploration that may not be obvious in large systems.
29.7 Conclusions
The increasing interest in NES is quite timely as evident by the several important application areas where
such systems can be used. The development process for these applications and systems remains an ad hoc
process. We need methodologies and tools to provide system designers with the exibility and capability
to quickly construct efcient and optimized system designs. In this chapter, we have examined the range
of design and verication challenges faced by the NES designers. Among existing solutions, we have
presented techniques that promise to increase efciency of the design process by raising the level of design
abstraction and by enhancing the scope of system models. These include design tools such as GRATIS
and nesC for operating system conguration and NS-2, SystemC, SPARK, NESLsim, and SensorSim for
simulation, synthesis, and codesign of embedded systems. There are several open research problems.
Reducing device size and power and increasing device speed further remain important objectives.
There is a need for distributed applications, along with middleware and operating system support and
support for network protocols for distributed coordinated collaboration. Continued progress in these
technologies will full the promise of NES as ubiquitous computing systems.
References
[1] R.W. Brodersen, A.P. Chandrakasan, and S. Cheng. 1992. Lowpower CMOS digital design. IEEE
Journal of Solid-State Circuits 27(4): 473484.
[2] P. Volgyesi and A. Ledeczi. Component-based development of networked embedded applications.
In Proceedings of EuroMicro, 2002.
[3] D. Ramanathan, R. Jejurikar, and R. Gupta. Timing driven co-design of networked embedded
systems. In Proceedings of ASPDAC, 2000, pp. 117122.
[4] H. Wang, J. Elson, L. Girod, D. Estrin, and K. Yao. Target classication and localization in hab-
itat monitoring. In Proceedings of IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP 2003), 2003.
[5] S. Simic and S. Sastry. Distributed environmental monitoring using random sensor networks.
In Proceedings of the Second International Workshop, IPSN 2003, 2003.
[6] B. West, P. Flikkema, T. Sisk, and G. Koch. Wireless sensor networks for dense spatio-temporal
monitoring of the environment: a case for integrated circuit, system and network design.
In Proceedings of the IEEE CAS Workshop on Wireless Communications and Networking, 2001.
[7] A. Mainwaring, J. Polsatre, R. Szewczyk, D. Culler, and J. Anderson. Wireless sensor networks for
habitat monitoring. In Proceedings of WSNA02, 2002.
[8] P. Flikkema and B. West. Wireless sensor networks: from the laboratory to the eld. In National
Conference for Digital Government Research, 2002.
[9] Cooperative engagement capability. http://www.fas.org/man/dod-101/sys/ship/weaps/cec.htm
[10] Tian He, B.M. Blum, John A. Stankovic, and Tarek F. Abdelzaher. Aida: adaptive application inde-
pendent data aggregation in wireless sensor networks. ACMTransactions on Embedded Computing
System (TECS), Special Issue on Dynamically Adaptable Embedded Systems, 3(2): 426457, 2004.
[11] Z. Nakad, M. Jones, and T. Martin. Communications in electronic textile systems. In Proceedings
of the International Conference on Communications in Computing (CIC), 2003.
[12] D. Meoli and T.M. Plumlee. Interactive electronic textile. Journal of Textile and Apparel, Technology
and Management, 2: 112, 2002.
[13] D. Estrin, R. Govindan, J. Heidemann, and Satish Kumar. Next century challenges: scalable
coordination in sensor networks. In Proceedings of the International Conference on Mobile
Computing and Networking (MobiCom), 1999.
[14] D. Estrin, A. Sayeed, andM. Srivastava. Wireless sensor networks. InProceedings of the International
Conference on Mobile Computing and Networking (MobiCom), 2002.
[15] J. Kahn, R. Katz, and K. Pister. Emerging challenges: mobile networking for smart dust. Journal
of Communication Networks, 2: 188196, 2000.
[16] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. A survey of wireless sensor networks.
IEEE Communications Magazine, 38(4): 393422, August 2002.
[17] C.E. Jones, K.M. Sivalingam, P. Agrawal, and J.C. Chen. A survey of energy efcient network
protocols for wireless networks. Wireless Networks, 7: 343358, 2001.
[18] P. Koopman. Embedded system design issues the rest of the story. In Proceedings of the
International Conference on Computer Design, 1996.
[19] M. Horton, D. Culler, K. Pister, J. Hill, R. Szewczyk, and Alec Woo. Mica: the commercialization
of microsensor notes. Sensors Magazine, 19(4): 4048, 2002.
[20] K.S.J. Pister, J.M. Kahn, and B.E. Boser. Smart dust: wireless networks of millimeter-scale sensor
nodes. Technical report, Highlight Article in 1999 Electronics Research Laboratory Research
Summary, 1999.
[21] J.M. Kahn, R.H. Katz, and K.S.J. Pister. Mobile networking for smart dust. In Proceedings of the
ACM/IEEE International Conference on Mobile Computing and Networking (MobiCom), 1999.
[22] Crossbow: smarter sensors in silicon. http://www.xbow.com/
[23] Kyocera. http://global.kyocera.com/application/automotive/auto_elec/
[24] Gabriel Leen and Donal Heffernan. Expanding automotive electronic systems. IEEE Computer, 35:
8893, 2002.
[25] Mitsubishi electric. http://www.mitsubishielectric.ca/automotive/
[26] Y. Li and R. Wang. Precision agriculture: smart farm stations. IEEE 802 Plenary Meeting Tutorials.
[27] Board on Agriculture and Natural Resources. Precision Agriculture in the 21st Century: Geospatial
and Information Technologies in Crop Management. National Academy Press, Washington, 1998.
[28] V. Kottapalli, A. Kiremidjian, J. Lynch, E. Carryer, T. Kenny, K. Law, and Y. Lei. Two-tiered wireless
sensor network architecture for structural health monitoring. In Proceedings of the SPIEs 10th
Annual International Symposium on Smart Structures and Materials, 2003.
[29] Raytheon systems co. http://www.raytheon.com/
[30] Darpa. http://www.darpa.mil/fcs/index.html
[31] N. Sarwabhotla and S. Seetharamaiah. Intelligent disaster management system for remote villages
in India. In Development by Design, Bangalore, India, 2002.
[32] T. Clouqueur, V. Phipatanasuphorn, P. Ramanathan, and K. Saluja. Sensor deployment strategy
for target detection. In Proceedings of WSNA 02, 2002.
[33] F. Koushanfar, M. Potkonjak, and A. Sangiovanni-Vincentelli. Fault tolerance in wireless ad hoc
sensor networks. In Proceedings of the IEEE International Conference on Sensors, 2002.
[34] C. Karlof and D. Wagner. Secure routing in wireless sensor networks: attacks and counter-
measures. In Proceedings of the IEEE International Workshop on Sensor Network Protocols and
Applications, 2003.
[35] A. Perrig, R. Szewczyk, J.D. Tygar, V. Wen, and D.E. Culler. Spins: security protocols for sensor
networks. Wireless Networks, 8(5): 521534, 2002.
[36] H. Cam, S. Ozdemir, D. Muthuaavinashiappan, and Prashant Nair. Energy-efcient security
protocol for wireless sensor networks. In Proceedings of the IEEE VTC Fall 2003 Conference, 2003.
[37] N.R. Potlapally, S. Ravi, A. Raghunathan, and N.K. Jha. Analyzing the energy consumption of
security protocols. In Proceedings of the International Symposium on Low Power Electronics and
Design, 2003.
[38] ecos Open-source real-time operating system for embedded systems. http://sources.redhat.com/
ecos/
[39] Lynxos real-time operating system for embedded systems. http://www.lynuxworks.com/
[40] Qnx real-time operating system for embedded systems. http://www.qnx.com/
[41] D. Gay, P. Levis, R. Behren, M. Welsh, E. Brewer, and D. Culler. The nesc language: a holistic
approach to networked embedded systems. In Proceedings of the ACM SIGPLAN 2003 Conference
on Programming Language Design and Implementation, 2002.
[42] J. Henkel and Y. Li. Energy-conscious hw/sw-partitioning of embedded systems: a case study
on an mpeg-2 encoder. In Proceedings of the International Workshop on Hardware/Software
Codesign, 1998.
[43] C. Nitsch and U. Kebschull. The use of runtime conguration capabilities for network embedded
systems. In Proceedings of Design, Automation and Test in Europe Conference and Exhibition, 2002,
pp. 10932002.
[44] J. Hill, R. Szewczyk, A. Woo, S. Hollar, D. Culler, and K. Pister. System architecture directions
for networked sensors. In Proceedings of the International Conference on Architectural Support for
Programming Languages and Operating Systems, 2000.
[45] W. Ye, J. Heidemann, and D. Estrin. An energy efcient mac protocol for wireless sensor networks.
In Proceedings of the INFOCOM 2002: 21st Annual Joint Conference of IEEE, 2002.
[46] A. Manjeshwar and D.P. Agrawal. Teen: a protocol for enhanced efciency in wireless sensor
networks. In Proceedings of the International Workshop on Parallel and Distributed Computing
Issues in Wireless Networks and Mobile Computing, 2001.
[47] A. Manjeshwar and D.P. Agrawal. Apteen: a hybrid protocol for efcient routing and comprehensive
information retrieval in wireless sensor networks. In Proceedings of the International Workshop on
Parallel and Distributed Computing Issues in Wireless Networks and Mobile Computing, 2002.
[48] Opnet. http://www.opnet.com/
[49] Network simulator 2. http://www.isi.edu/nsnam/ns/
[50] SensorSim. http://nesl.ee.ucla.edu/projects/sensorsim/
[51] NESLsim. http://www.ee.ucla.edu/saurabh/NESLsim/
[52] S. Park, A. Savvides, and M. Srivastava. Sensorsim: a simulation framework for sensor networks.
In Proceedings of MSWiM 2000, 2000.
[53] R.K. Gupta and S.Y. Liao. Using a programming language for digital system design. IEEE Design
and Test of Computers, 14(2): 7280, April 1997.
[54] SystemC. http://www.systemc.org
[55] N. Drago, F. Fummio, and M. Poncino. Modeling network embedded systems with NS-2 and
SystemC. In Proceedings of ICCSC: Circuits and Systems for Communication, 2002, pp. 240245.
[56] R.K. Gupta and G. De Micheli. HardwareSoftware cosynthesis for digital systems. IEEE Design
and Test of Computers, 10(3): 2941, July 1993.
[57] G. Micheli and R. Gupta. Hardware/software co-design. Proceedings of IEEE, 85, 349365, 1997.
[58] R. Ernst and J. Henkel. Hardwaresoftware codesign of embedded controllers based on hardware
extraction. In Proceedings of the International Workshop on Hardware/Software Codesign, 1992.
[59] J. Henkel and R. Ernst. A hardwareSoftware partitioner using a dynamically determined
granularity. In Proceedings of the Design Automation Conference, 1997.
[60] A. Dasdan, D. Ramanathan, and R.K. Gupta. A timing-driven design and validation methodology
for embedded real-time systems. ACM Transactions on Design Automation of Electronic Systems,
3: 533553, 1998.
[61] A. Dasdan, D. Ramanathan, and R.K. Gupta. Rate derivation and its applications to reactive,
real-time embedded systems. In Proceedings of the Design Automation Conference, 1998.
[62] S. Gupta, R.K. Gupta, N.D. Dutt, andA.Nicolau. SPARK: AParallelizing Approach to the High-Level
Synthesis of Digital Circuits. Kluwer Academic, Publishers, Dordrecht, 2004.
[63] S. Gupta, N.D. Dutt, R.K. Gupta, and A. Nicolau. SPARK: a high-level synthesis framework for
applying parallelizing compiler transformations. In Proceedings of the International Conference on
VLSI Design, 2003.
[64] M. Luthra, S. Gupta, N.D. Dutt, R.K. Gupta, and A. Nicolau. Interface synthesis using memory
mapping for an FPGAplatform. In Proceedings of the International Conference on Computer Design,
October 2003.
30
Middleware Design
and Implementation
for Networked
Embedded Systems
Venkita Subramonian
and Christopher Gill
30.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30-1
Multiple Design Dimensions Networked Embedded Systems
Middleware Example Application: Ping-Node Scheduling
for Active Damage Detection Engineering Life-Cycle
Middleware Design and Implementation Challenges
30.2 Middleware Solution Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30-5
30.3 ORB Middleware for Networked Embedded
Systems A Case Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30-7
Message Formats Object Adapter Message Flow
Architecture Time-Triggered Dispatching Priority
Propagation Simulation Support
30.4 Design Recommendations and Trade-Offs . . . . . . . . . . . . 30-12
30.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30-13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30-14
30.1 Introduction
Networked embedded systems support a wide variety of applications, ranging from temperature
monitoring to battle eld strategy planning [1]. Systems in this domain are characterized by the following
properties:
1. Highly connected networks.
2. Numerous memory-constrained end-systems.
3. Stringent timeliness requirements.
4. Adaptive online reconguration of computation and communication policies and mechanisms.
This work was supported in part by the DARPA NEST (contract F33615-01-C-1898) and PCES (contract F33615-
03-C-4111) programs.
30-1
Networked embedded systems challenge assumptions about resource availability and scale made by
classical approaches to distributed computing, and thus represent an active research area with many open
questions. For example, advances in Micro Electro Mechanical Systems (MEMSs) hardware technology
have made it possible to move software closer to physical sensors andactuators to make more intelligent use
of their capabilities. To realize this possibility, however, new networked embedded systems technologies
are needed. For example, hardware infrastructure for such systems may consist of a network of hundreds
or even thousands of small microcontrollers, each closely associated with local sensors and actuators.
30.1.1 Multiple Design Dimensions
The following four dimensions drive the design choices for development of many networked embedded
systems:
1. Temporal predictability
2. Distribution
3. Feature richness
4. Memory constraints
There is often a contravariant relationship between some of these design forces. For example, the left
side of Figure 30.1 illustrates that feature richness may suffer when footprint is reduced. Similarly, a real-
time embedded systems temporal performance must be maintained even when more or fewer features
are supported, as illustrated by the right side of Figure 30.1.
Signicant research has gone into each of these individual design dimensions and has resulted in a
wide range of products and technologies. Research on the Embedded Machine [2] and Kokyu [3] mainly
addresses the real-time dimension. The CORBA Event service [4], Real-time Publish/Subscribe [5], and
Distributable Threads [6] provide alternative programming models that support both one-to-many and
one-to-one communication and hence address the distribution dimension. Small footprint middleware
is the main focus of e
ORB [7] and UCI-Core [8]. TAO [9] and ORBexpress RT [10] are general-purpose
CORBAimplementations that provide real-time and distribution features for a wide variety of application
domains.
30.1.2 Networked Embedded Systems Middleware
General-purpose middleware is increasingly taking the role that operating systems held three decades ago.
Middleware basedonstandards suchas CORBA[11], EJB[12], COM[13], andJava RMI [14] nowcaters to
the requirements of a broad range of distributed applications such as banking transactions [15,16], online
stock trading [17], and avionics mission computing [18]. Different kinds of general-purpose middleware
have thus become key enabling technologies for a variety of distributed applications.
F
o
o
t
p
r
i
n
t
Features
P
e
r
f
o
r
m
a
n
c
e
Features
FIGURE 30.1 Features, footprint, and performance.
Middleware Design and Implementation 30-3
Tomeet the needs of diverse applications, general-purpose middleware solutions have tendedtosupport
a breadth of features. In large-scale applications, layers of middleware have been added to provide different
kinds of services [18].
However, simply adding features breaks downfor certainkinds of applications. Inparticular, features are
rarely innocuous in applications with requirements for real-time performance or small memory footprint.
Instead, every feature of an application and its supporting middleware is likely either to contribute to or
detract from the application in those dimensions. Therefore, careful selection of features is crucial for
memory constrained and real-time networked embedded systems.
As middleware is applied to a wider range of networked embedded systems, a fundamental tension
between breadth of applicability and customization to the needs of each application becomes increasingly
apparent. To resolve this tension, special-purpose middleware must be designed to address the following
two design forces:
1. The middleware should provide common abstractions that can be reused across different
applications in the same domain.
2. It should then be possible to make ne-grained modications to tailor the middleware to the
requirements of each specic application.
In the following section, we describe a motivating example application and the design constraints it
imposes. In Section 30.1.4, we describe additional design constraints imposed by the engineering life-cycle
for this application.
30.1.3 Example Application: Ping-Node Scheduling for
Active Damage Detection
To illustrate how application domain constraints drive the design of special-purpose middleware, we now
describe a next-generation aerospace application [19], in which a number of MEMS sensor/actuator nodes
are mounted on a surface of a physical structure, such as an aircraft wing. The physical structure may
be damaged during operation, and the goal of this application is to detect such damage when it occurs.
Vibration sensor/actuator nodes are arranged in a mesh with (wired or wireless) network connectivity
to a xed number of neighboring nodes. To detect possible damage, selected actuators called ping nodes
generate vibrations that propagate across the surface of the physical structure. Sensors within a dened
neighborhood can then detect possible damage near their locations by measuring the frequencies and
strengths of these induced vibrations. The sensors convey their data to other nodes in the system, which
aggregate data from multiple sensors, process the data to detect damage, and issue alerts or initiate
mitigating actions accordingly.
Three restrictions on the system make the problem of damage detection difcult. First, the
sensor/actuator nodes are resource-constrained. Second, two vibrations whose strengths are above a
certain threshold at a given sensor location will interfere with each other. Third, sensor/actuator nodes
may malfunction over time. These constraints, therefore, require that the actions of two overlapping ping
nodes be synchronized so that no interfering vibrations will be generated at a sensor location at any time.
This damage detection problem can be captured by a constraint model. Scheduling the activities of
the ping nodes can be formulated as a distributed graph coloring problem. A color in the graph coloring
problem corresponds to a specic time slot in which a ping node vibrates. Thus two adjacent nodes in the
graph, each representing an actuator, cannot have the same color since the vibrations from these actuators
would then interfere with each other. The number of colors is therefore the length (in distinct time slots)
of a schedule. The problem is to nd a shortest schedule such that the ping nodes do not interfere with
one another, in order to minimize damage detection and response times. Distributed algorithms [20] have
been shown to be effective for solving the distributed constraint satisfaction problem in such large scale
and dynamic
1
networks.
1
For example, with occasional reconguration due to sensor/actuator failures online.
30.1.4 Engineering Life-Cycle
Large-scale networked embedded systems are often expensive and time consuming to develop, deploy,
and test. Allowing separate development and testing of the middleware and the target system hardware
can reduce development costs and cycle times. However, this separation imposes additional design and
implementation challenges for special-purpose middleware.
For example, to gauge performance of the distributed ping-scheduling algorithm in the actual system,
physical, computational, and communication processes must be simulated for hundreds of nodes at
once. For physical processes, tools such as Matlab or Simulink must be integrated within the simulation
environment. Computation should be performed using the actual software that will be deployed in the
target system. However, that software may be run on signicantly different, and often fewer, actual
end-systems in the simulation environment than in the target system. Similarly, communication in the
simulation environment will often occur over conventional networks, such as switched Ethernet, which
may not be representative of the target systems network.
The following issues must be addressed in the design and implementation of middleware that is suitable
for both the simulation and target system environments:
We need to use as much of the software that will be used in the target system as possible in the
simulation environment. This helps us to obtain relatively faithful metrics about the application
and middleware that will be integrated with the target system.
We need to allow arbitrary congurations for the simulation. The hardware and software cong-
uration may be different for each machine used to run the simulation, and different kinds and
numbers of target system nodes may be simulated on each machine.
Simple time scaling will not work since it does not guarantee that the nodes are synchronized. First,
it is not practical to require that all the computation and communication times are known a priori,
since one function of the simulation may be to gauge those times. Moreover, even if we could scale
the time to a safe upper bound, the wall-clock time it takes to run the simulation would likely be
prohibitively large.
Because of the heterogeneous conguration of the simulation environment, some simulated nodes
might run faster than others, leading to causal inconsistencies in the simulation [21,22].
Additional infrastructure is thus necessary to encapsulate the heterogeneity of different simulation
environments and simulate real-time performance on top of general-purpose operating systems
and networks, with simulation of physical processes in the loop.
30.1.5 Middleware Design and Implementation Challenges
To facilitate exchanges of information between nodes as part of the distributed algorithm, a middleware
framework that provides common services, such as remote object method invocation, is needed. Two key
factors that motivate the development of ORB (Object Request Broker)-style middleware for networked
embedded systems are (1) remote communication and (2) location independence.
Remote communication: Even though a xed physical topology may connect a group of sensor/actuator
components, the logical grouping of these components may not strictly followthe physical grouping.
Location independence: The behavior of communicating components should be independent of their
location to the extent possible. True location independence may not be achievable in all cases,
for example, due totiming constraints or explicit coupling tophysical sensors or actuators. However,
the implementation of object functionality should be decoupled from the question of whether it
accesses other objects remotely or locally where appropriate. The programming model provided
to the object developer should thus provide a common programming abstraction for both remote
and local access.
In summary, the key challenges we faced in the design and implementation of special-purpose
middleware to address the application domain constraints described in Sections 30.1.3 and 30.1.4 are to:
Reuse existing infrastructure: We want to avoid developing new middleware from scratch. Rather, we
want to reuse prebuilt infrastructure to the extent possible.
Provide real-time assurances: The performance of middleware itself must be predictable to allow
application-level predictability.
Provide a robust DOC middleware: We chose the DOC communication paradigm since it offers direct
communication among remote and local components, thus increasing location independence.
Reduce middleware footprint: The target for this middleware is memory-constrained embedded
microcontroller nodes.
Support simulation environments: Simulations should be done with the same application software and
middleware intended for deployment on the target. The middleware should also be able to deal
with heterogeneous simulation testbeds, that is, different processor speeds, memory resources, etc.
30.2 Middleware Solution Space
General-purpose CORBA implementations, such as TAO [23], offer generic CORBA implementations,
whose feature sets are determined a priori. Furthermore, faithful implementation of the entire CORBA
standard increases the number of features supported by ORBs and hence results in increased footprint for
the application. In the case of memory-constrained networked embedded applications, this can become
prohibitively expensive.
We instead want to get only the features that we need. The selection of features for our special-purpose
middleware implementation was strictly driven by the unique requirements of the application domain.
Two approaches to developing special-purpose middleware must then be considered:
Top-down: Subdividing existing general-purpose middleware frameworks, for example,TAO [9].
Bottom-up: Composing special-purpose middleware from lower-level infrastructure, for example,
ACE [24].
Both approaches seek to balance reuse of features with customizationto application-specic requirements.
The top-down approach is preferred when the number and kinds of features required are close to those
offered by a general-purpose middleware implementation. In this case, provided policy and mechanism
options can be adjusted in the the general-purpose middleware to t the requirements of the application.
In general, this has been the approach used to create and rene features for real-time performance in TAO.
On the other hand, if the number or kinds of middleware features required differs signicantly from
those available in general-purpose middleware, as is the case with many networked embedded systems
applications, then a bottom-up approach is preferable. This is based largely on the observation that in our
experience lower-level infrastructure abstractions are less interdependent and thus more easily decoupled
than higher-level ones. It is therefore easier to achieve highly customized solutions by composing middle-
ware from primitive infrastructure elements [25,26] than trying to extract the appropriate subset directly
from a general-purpose middleware implementation.
Modernsoftware development relies heavily onreuse. Givena problemanda space of possible solutions,
we try rst to see whether the problemcanbe solved directly fromanexisting solutionto a similar problem.
Taking this view, we compared the challenges described in Section 30.1.5 to existing middleware solutions,
as shown in Table 30.1.
TAO [9,23] and e
ORB [27,28] appeared to be the most suitable candidate solutions based on the
requirements of our target application described in Section 30.1.3. TAO is a widely used standards-
compliant ORB built using the Adaptive Communication Environment (ACE) framework [24,29]. In
addition to a predictable and optimized [30,31] ORB core [32], protocols [33,34], and dispatching [35,36]
infrastructure, TAOoffers a variety of higher-level services [37,38]. e
ORBis a customized ORBthat offers

space-efcient implementation of a reduced set of features, with a corresponding reduction in footprint.
TABLE 30.1 Challenges and Potential Solutions
Challenge Framework
Infrastructure reuse ACE, TAO
Real-time assurances Kokyu, TAO
Robust DOC middleware TAO, e
ORB
Reduced middleware footprint UCI-Core, e
ORB
Simulated real-time behavior TAO? Kokyu?
nORB
ACE
Network programming
primitives
Patterns
Portability
Kokyu
Dispatching model
Real-time QoS
assurance
Priority lanes
TAO
IDL compilation strategies
ORB concurrency patterns
ORB core mechanisms
UCI-Core
Minimum ORB
feature set
FIGURE 30.2 Reuse from existing frameworks.
Problem we get more or less than we need: Unfortunately, faithful implementation of the CORBA
standard increases the number of features supported by TAO, e
ORB, and other similar CORBA imple-

mentations and hence results inincreased footprint for the application. Inthe case of memory-constrained
applications, this becomes prohibitively expensive.
Although ACE reduces the complexity of the programming model for writing distributed object-
oriented applications and middleware infrastructure, it does not directly address the challenges of real-
time assurances, reduced footprint, or interoperation with standards-based distribution middleware.
Kokyu [3] is a low-level middleware framework built on ACE, for exible multiparadigm scheduling [39]
and congurable dispatching of real-time operations. Thus Kokyu can supplement the capabilities of
another DOC middleware framework, but cannot replace it.
The UCI-Core approach supports different DOC middleware paradigms. It offers signicant reuse of
infrastructure, patterns, and techniques by generalizing features common to multiple DOC middleware
paradigms and providing them within a minimal metaprogramming framework, thus also addressing the
challenge of reducing middleware footprint. However, it is unsuited to meet other challenges described in
Section 30.1.5, for example, it does not directly support real-time assurances or simulation of real-time
behavior.
Solution use a bottom-up composition approach to get only the features that we need: Figure 30.2
illustrates our approach. The selection of features for our special-purpose middleware implementation
was strictly driven by the unique requirements of the application described in Section 30.1.3.
We initially considered a top-down approach to avoid creating and maintaining an open-source code
base separate fromTAO. However, this approachprovedinfeasible due toseveral factors. First, the degree of
implementation-level interdependence betweenfeatures inTAOmade it difcult toseparate them. Second,
the scarcity of mature tools to assist in identifying and decoupling needed versus unneeded features made
it unlikely that we would be able to achieve such a top-down decomposition in a reasonable amount
of time. Third, due to absence of better tools it was also infeasible to validate that during refactoring we
had correctly retained functional and real-time properties for the large body of TAO applications deployed
outside our DOC middleware research consortium.
Therefore, we ultimately took a bottom-up compositional approach, which led to the develop-
ment of nORB [19]
2
, starting with the ACE framework and reusing as much as possible from it with
transparent refactoring of some ACE classes to avoid unneeded features. By building on ACE, we
reduced duplication between the TAO and nORB code bases, while achieving a tractable development
process.
As in TAO, ACE components serve as primitive building blocks for nORB. Communication between
nORB end-systems is performed according to the CORBA [11] model: the client side marshals the
parameters of a remote call into a request message and sends it to a remote server, which then demarshals
the request and calls the appropriate servant object; the reply is then marshaled into a reply message and
sent back to the client, where it is demarshaled and its result returned to the caller.
Although we did not retain strict compliance to the CORBA specication, wherever possible we reused
concepts, interfaces, mechanisms, and formats from TAOs implementation of the CORBA standard.
Problem no explicit support for temporal coordination in the simulation environment: As the last row
of Table 30.1 suggests, none of the potential middleware solutions we found was able to support the kind
of temporal coordination between simulated and actual infrastructure that is needed in the simulation
environment described in Section 30.1.4.
Because TAO is open-source, it would be possible to integrate special-purpose mechanisms that
intercept GIOP messages exchanged between simulated nodes, and perform accounting of simulation
time whenever a message is sent or received. However, TAO does not explicitly address time-triggered
dispatching of method invocations, which must also be subjected to simulation time accounting.
Kokyu is designed for time-triggered and event-triggered dispatching, but does not intercept messages
exchanged through the ORB without transiting a dispatcher. Therefore, neither TAO nor Kokyu alone is
able to provide the kind of temporal coordination needed in the the simulation environment.
Solution virtual clock integrated with real-time dispatching and distribution features: In the target
system, both time-triggered local method invocations and remote method invocations must be dis-
patched according to real-time constraints. Those constraints and the corresponding temporal behavior of
the application and middleware must be modeled and enforced effectively in the simulation environment
as well.
To support both the target and simulation environments, we integrated a dispatcher based on the Kokyu
model with distribution features based on TAO, in nORB. We then used the dispatcher as a single point
of real-time enforcement, where both local upcalls and method invocation request and reply messages are
ordered according to dispatching policies. Within that single point of control, we then integrated a virtual
clock mechanism that is used only in the simulation environment, to enforce both causal consistency and
real-time message and upcall ordering on the simulations logical time-line.
30.3 ORB Middleware for Networked Embedded
Systems A Case Study
In this section, we describe the design and implementation of nORB, and the rationale behind our
approach, to address the networked embedded system design and implementation challenges described
in Section 30.1.5.
2
nORB is freely available as open-source software at http://deuce.doc.wustl.edu/nORB/
Request
id
Two way
flag
Op
name
Object key
length
Object key Parameters Priority
Request id
nORB Request message format
nORB Reply message format
Status Results
nORB Locate Request message format
Locate request id Corbaloc style key
nORB Locate Reply message format
Locate reply id IOR string
nORB IOR format
Repository id Object key
Transport
address
Priority
Profile-1 Profile-n
FIGURE 30.3 nORB IOR and message formats.
30.3.1 Message Formats
We support a limited subset of the messages supported by the CORBAspecication, so that we do not incur
unnecessary footprint, but at the same time support the minimum features required by the application.
The following messages are supported by nORB: Request, Reply, Locate Request, and Locate Reply.
Figure 30.3 shows the formats of the messages supported by nORB. The format of the request and
reply messages in nORB closely resembles that of the GIOP Request and Reply messages, respectively. We
use the common data representation [11] to encode the messages themselves. The nORB client builds
a Request message and sends it to the nORB server which sends a Reply back to the client.
30.3.2 Object Adapter
In standard CORBA, each server-side ORB may provide multiple object adapters [40]. Servant objects
register with an object adapter, which demultiplexes each client request to the appropriate servant. Each
object adapter may be associated with a set of policies, for example, for servant threading, retention,
and lifespan [41]. In standard CORBA, multiple object adapters are supported by each ORB. This allows
heterogeneous object policies to be implemented in a clientserver environment, which is desirable in
applications, such as online banking, where each object on a server may be congured according to
preferences of the server administrator, or even the end-user.
In nORB, however, there is no assumption of multiple object adapters. Instead, a single object adapter
per ORB is considered preferable for simplicity and footprint reduction. In nORB, the number of objects
hosted on an embedded node is expected to be small, which reduces the need for multiple policies and thus
for multiple object adapters. Even though the resulting object adapter does not conform to the Portable
Object Adapter specication, a signicant degree of footprint reduction is achieved because of the reduced
object adapter functionality.
We have also simplied the process of object registration, to free developers fromwriting repetitive code
as is seen in many CORBA programs. In the object adapter, we maintain a lookup table of object ids and
pointers to servant implementation objects. The lookup table is synchronized using a Readers/Writer lock.
We have also consolidated object registration with other middleware initialization functions, by moving
it from the object adapter interface to the ORB interface.
30.3.3 Message Flow Architecture
The message ow architecture in nORB uses strategies and patterns similar to those in TAO. We briey
mention the strategies we used and refer the interested reader to Reference 42, which discusses these
strategies in greater detail.
30.3.3.1 Reply Wait Strategy
When a client makes a remote two-way function call, the caller thread needs to block until it receives a
reply back from the server. The two-way function call is made on the client stub, which then marshals
the parameters into a request and sends it to the server. The two-way function call semantics requires the
caller thread to block until the reply comes back from the server. There are two different strategies to wait
for the reply in TAO Wait on Connection and Wait on Reactor [43]. nORB uses the Wait on Connection
strategy to wait for the reply.
30.3.3.2 Upcall Dispatch Strategy
On the server side there are different strategies to process an incoming request and send the reply back
to the client. Two well-known strategies are Direct Upcall and Queued Upcall [43]. In the Direct Upcall
strategy, the upcall is dispatched in the same thread as the I/O thread that listens for incoming requests
from the connection stream. The Queued Upcall strategy follows the Half Sync Half Async pattern [44,45].
In contrast with the direct upcall strategy, a network I/O thread is dedicated to receiving requests from
clients. Once the request is received, it is encapsulated into a command object and then put into a queue.
This queue is shared with another thread, in which the upcall dispatch is done. TAO uses the Direct Upcall
strategy, whereas nORB uses the Queued Upcall strategy to address the simulation issues presented in
Section 30.1.4, as is discussed in Section 30.3.6.
30.3.4 Time-Triggered Dispatching
In the active damage detection application described in Section 30.1.3, control events must be triggered
predictably at various rates. We developed a dispatcher for nORB to trigger these events predictably, based
on the Kokyu [39] dispatching model. Kokyu abstracts combinations of fundamental real-time scheduling
and dispatching mechanisms to enforce a variety of real-time policies, including well-known strategies,
such as Rate-Monotonic Scheduling (RMS) [46], Earliest Deadline First (EDF) [46], and maximum
urgency rst [47].
This dispatcher is used to trigger events on the client side and for dispatching upcalls on the server
side. A Dispatchable interface is provided, which is to be implemented by an application object that
needs to be time triggered. The handle_dispatch() method is called on the Dispatchable object. In the
ping-scheduling application, some application objects are both Dispatchable and make remote calls when
triggered. Figure 30.4 shows the message ow from the client to the server when a timer expires on the
client side, leading to a remote object method call on the server side.
30.3.5 Priority Propagation
nORB implements real-time priority propagation in a way similar to TAO [32,48]. The client ORB uses
the priority of the thread in which the remote invocation is made. nORB then looks for a matching
priority from the set of proles in the IOR and then makes a connection to the appropriate port. We
use a cached connection strategy [42] to avoid the overhead of connection setup every time a remote
invocation is made. To alleviate priority inversion, each connection endpoint on the server is associated
with a thread/reactor pair, thus forming a dispatching lane. The priority of the thread associated with each
dispatching lane is set appropriately so that a request coming into a higher-priority lane will be processed
before a request coming into a lower-priority lane. Figure 30.5 shows an example in which RMS [46] was
used to assign priorities to the different rates on the client side.
nORB
Dispatcher Dispatcher
LO HI
SOA
Skeleton Skeleton
Servants
Stub Stub
Clients
FIGURE 30.4 Kokyu dispatcher in nORB.
HI LO
2 kHz 100 Hz
Client
Stub
HI LO
Skeleton
nORB
nORB
Reactor
port: 10,000
HI
Reactor
port: 20,000
LO
Request id=23 Priority=LO .
Object key A . Priority=LO, host A: 20,000 Priority=HI, host A: 10,000
IOR
Server
Dispatcher
FIGURE 30.5 Priority propagation in nORB.
The priority of the client thread making the request is propagated in the request. This priority is used
on the server side to enqueue the request if necessary as explained in Section 30.3.6.
30.3.6 Simulation Support
Our solution to the engineering life-cycle issues described in Section 30.1.4 is to have nORB maintain
a logical clock [21,49] at each node to support synchronization among nodes. We distinguish between
logical times and priorities that are maintained by the simulation, and actual times and priorities that are
maintained by the operating system. Logical time is divided into discrete frames. Each logical clock is
incremented in discrete units by nORB after any incoming message is processed. If there are no incoming
messages to be processed, then the logical time is incremented to the start of the next frame.
At the start of each frame, each node registered for a timed upcall in that frame is triggered, which
results in the node waiting for sensor values from a physical simulation tool, such as Matlab or Simulink.
nORB
Client
Stub
HI HI
Skeleton
ME
Reactor
port:10,000
Request id=23 Logical priority =LO .
Object key A . host A:10,000 IOR
Server
Logical clock
LO
nORB
HI HI
Dispatcher
Logical clock
LO
100 Hz
4
2 kHz
3 1 5
2
6
7
8
FIGURE 30.6 Processing with a logical clock.
While a node is waiting for the sensor values, we freeze the logical clock on that node, that is, we prevent
the logical clock from advancing. As a result, no messages are processed on that node while it is waiting,
and all incoming messages are queued. The queuing of incoming messages allows messages from the
future to be delayed, and then processed after a slower node catches up.
Any request or reply leaving a node carries the current logical clock value of that node. This value is
used by the receiving node to sort the incoming messages based on their time of release and the logical
clock of the receiving node. If the logical clock of the receiving node is later than that of the incoming
message, then the message is stamped with the value of the logical clock at the receiving node.
Each item(dispatchable or request/reply message) carries its logical execution time, which is predened
for each item. When an item is ready and most eligible for execution, the clock thread dequeues it
and checks whether it can complete its execution before the earliest more eligible items release time.
If it can complete its execution before another more eligible item must be processed, the clock thread
enqueues the current item in the appropriate lane for actual execution on the processor. If not, the clock
thread simulates the partial execution of the current item without actually executing it, by (1) storing
the remaining logical execution time in the item itself and (2) enqueuing the updated item back into the
clock threads queue so that it can compete with other enqueued items for its next segment of execution
eligibility.
Alane canbe conguredtoruna single threador a pool of worker threads. As describedinSection30.3.4,
without clock simulation each lane thread is run at its own actual OS priority. In the simulation environ-
ment time and eligibility are accounted for by the logical clock thread, so all the lane threads are run at
the same actual OS priority. Each lane still maintains its logical priority in thread-specic storage [45], so
that the logical priority can be sent with the request messages and used for eligibility ordering, as it will
be in the target system.
Figure 30.6 shows an illustrative sequence of events that results from the addition of the logical clock
thread to nORB, assuming for simplicity that all items run to completion rather than being preempted:
1. When the logical clock advances, the clock thread goes through the list of dispatchables to see
whether any are ready to be triggered. A ready dispatchable is one whose next trigger time is less
than or equal to the current logical clock and whose previous invocation has completed execution.
In general, the clock thread determines the earliest time a message or dispatchable will be run, and
marks all items with that time as being ready.
2. Any ready dispatchables are released to the clock threads queues, according to their assigned logical
priorities.
3. The clock thread selects the most eligible ready item (message or dispatchable) from among its
priority queues. The clock thread then enqueues the selected item in the appropriate priority lane
of the Dispatcher, where it will compete with other messages and locally released dispatchables.
4. The corresponding lane thread in the Dispatcher dispatches the enqueued item. The resulting
upcall might in turn invoke a remote call to a servant object, which we describe in steps 5 to 8.
5. The logical priority of the dispatchable or the message is propagated to the server side. Currently
the application scheduler uses RMS to decide the logical priority of the dispatchable based on its
simulated rate. Each lane thread stores its assigned logical priority in thread-specic storage [45].
The actual OS priorities of all the lane threads are kept the same under the clock simulation
mechanism.
6. The incoming message is accepted by the servers reactor thread and is enqueued for temporal and
eligibility ordering by the clock thread. Note that there is only one reactor thread, which runs at an
actual priority level between the clock and the lane threads actual priority levels. This is different
from the previous approach discussed in Section 30.3.5 and is applied only when using the clock
simulation. The lane threads are given the highest actual OS priority and the clock thread itself is
assigned the lowest actual OS priority. This conguration of actual OS thread priorities reduces
simulation times while still ensuring synchronization between nodes.
7. As on the client side, the clock thread then chooses the most eligible item from its queues and
enqueues the item on the appropriate lane threads queue.
8. The lane thread dispatches the enqueued item.
30.4 Design Recommendations and Trade-Offs
In this section, we present recommendations and trade-offs that we encountered in designing and devel-
oping the nORB special-purpose middleware to meet the challenges described in Section 30.1.5. While
we focus specically on our work on nORB, the same guidelines can be applied to produce middleware
tailored to other networked embedded systems:
Use the application to guide data-types to be supported. We used the application domain to guide
our choice of data types that the ORB middleware supports. In the damage detection application, for
example, sequences of simple structures are exchanged between sensor/actuator nodes. nORB therefore
supports only basic data types, structures, and sequences. This reduces the code base needed to support
data types to those necessary for ping-node scheduling. We supported marshaling in nORB to allow
application deployment over heterogeneous networked embedded platforms, particularly for simulation
environments. For other applicationdomains with homogeneous platforms, support for marshaling could
be removed entirely. To support other data types (for example, CORBA Any) the middleware simply has
to incorporate the code to marshal and demarshal those data types properly, trading-off use of those types
for an increase in footprint.
Minimize generality of the messaging protocol. Previous work [50] has shown that optimizations can
be achieved by the principle patterns of (1) relaxing system requirements and (2) avoiding unnecessary
generality. Protocols in TAO, such as GIOP-Lite [50], are designed according to this principle. Similarly, as
described in Section 30.3.1, we support a limited subset of the message types in the CORBA specication,
so that we incur only the necessary footprint, while still providing all features required by our target
application. By reducing the number of elds in the header, advanced features of a CORBA ORB such as
Portable Interceptors are not supported. Providing this support would trade-off with an increase in both
ORB footprint and message sizes.
Simplify object life-cycle management. In a networked embedded system, the number of application
objects hosted on each node is expected to be very small, which reduces the need for full-edged life-cycle
management. Servant objects are registered when the application starts, and live as long as the application,
eliminating the need for more complicated dynamic life-cycle management. The trade-off is that, for large
numbers of objects or where objects are dynamically created and destroyed, simplied object life-cycle
management will not sufce. For such end-systems, more complex adapters are necessary, albeit at a cost
of larger footprint.
Simplify operation lookup and dispatch. When a remote operation request is received on a server, an ORB
must search for the appropriate object in its Object Adapter and then perform a lookup of the appropriate
method in the operations table. nORB uses linear search for that lookup because of the assumption that
only a few methods on only a few objects on each node will receive remote calls. The linear search strategy
reduces memory footprint while still maintaining real-time properties for small numbers of methods. The
trade-off is that for large numbers of methods, real-time performance will suffer. This is because linear
search will take O(n) time to do a lookup, where n is the number of operations. Perfect hashing will take
O(1) time, but this alternative would again entail increased footprint due to the code generated to support
perfect hashing.
Pay attention to safety and liveness requirements. We described the different messaging and concurrency
architecture choices in Section 30.3.3. With nested upcalls for remote method invocations, the Wait on
Connection strategy could result in deadlocks [51]. The Wait on Reactor strategy, on the other hand, avoids
deadlocks but introduces blocking factors that could hurt real-time performance [51].
Design to support the entire engineering life-cycle. Although many middleware solutions address the
run-time behavior of networked embedded systems, few of them address earlier stages of the engin-
eering life-cycle. In particular, for networked embedded systems where simulation is used to gauge
performance in the target system prior to system integration, additional special-purpose mechanisms
may be needed. The virtual clock described in Section 30.3.6 is a good example of how such special-
purpose mechanisms can be provided in middleware so that (1) application software is not modied for
use in the simulation environment, and (2) the mechanism once developed can be reused for multiple
applications.
30.5 Related Work
MicroQoSCORBA [52] focuses on footprint reduction through case-tool customization of middle-
ware features. Ubiquitous CORBA projects [53], such as LegORB and the CORBA specialization of
the Universally Interoperable Core (UIC), focus on a metaprogramming approach to DOC middle-
ware. The UIC contains meta-level abstractions that different middleware paradigms, for example,
CORBA, must specialize, while ACE, TAO, and nORB are concrete base-level frameworks. e
ORB [7]
is a commercial CORBA ORB developed for embedded systems, especially in the telecommunications
domain.
The Time-TriggeredArchitecture (TTA) [54] is designedfor fault-tolerant distributedreal-time systems.
Within the TTA all system activities are initiated by the progression of a globally synchronized time-base.
This stands in contrast to event-driven systems, in which system activity is triggered by events. The Time-
Triggered Message-Triggered Object (TMO) [55,56] architecture facilitates the design and development of
real-time systems with syntactically simple but semantically powerful extensions of conventional object-
oriented real-time approaches.
We have described, how meeting the constraints of networked embedded systems requires careful analysis
of a representative application, as an essential tool for the development of the special-purpose middleware
itself. In addition, discovering which settings and features are best for an application requires careful
design a priori. It is therefore important to adopt an iterative approach to middleware development
that starts with specic application requirements and takes simulation and experimentation results into
consideration.
By integrating both real-time middleware dispatching and a virtual clock mechanism used for simula-
tion environments with distribution middleware features, we have shown how to develop special-purpose
middleware solutions that address multiple stages of a networked embedded systems engineering life-
cycle. We also have empirically veried [57] that with nORB the footprint of a statically linked executable
memory image for the ping-node-scheduling application was 30% of the footprint for the same application
built with TAO, while still retaining real-time performance similar to TAO.
Acknowledgments
We gratefully acknowledge the support and guidance of the Boeing NEST OEP Principal Investigator
Dr. Kirby Keller and Boeing Middleware Principal Investigator Dr. Doug Stuart. We also wish to
thank Dr. Weixiong Zhang at Washington University in St. Louis for providing the initial algorithm
implementation used in ping scheduling.
References
[1] D. Estrin, D. Culler, K. Pister, and G. Sukhatme. Connecting the physical world with pervasive
networks. IEEE Pervasive Computing, 1: 5969, 2002.
[2] T. Henzinger, C. Kirsch, R. Majumdar, and S. Matic. Time safety checking for embedded programs.
In Proceedings of the Second International Workshop on Embedded Software (EMSOFT). LNCS,
[3] C.D. Gill, R. Cytron, and D.C. Schmidt. Middleware scheduling optimization techniques for
distributed real-time and embedded systems. In Proceedings of the Seventh Workshop on Object-
Oriented Real-Time Dependable Systems. IEEE, San Diego, CA, January 2002.
[4] T.H. Harrison, D.L. Levine, and D.C. Schmidt. The design and performance of a real-time CORBA
event service. In Proceedings of OOPSLA 97. ACM, Atlanta, GA, October 1997, pp. 184199.
[5] D.C. Schmidt and C. ORyan. Patterns and performance of real-time publisher/subscriber archi-
tectures. Journal of Systems and Software, Special Issue on Software Architecture Engineering
Quality Attributes, 66(3): 213223, 2002.
[6] Y. Krishnamurthy, C. Gill, D.C. Schmidt, I. Pyarali, L.M.Y. Zhang, and S. Torri. The design and
implementation of real-time CORBA 2.0: dynamic scheduling in TAO. In Proceedings of the 10th
Real-Time Technology and Application Symposium (RTAS 04). IEEE, Toronto, CA, May 2004.
[7] PrismTech. eORB. URL: http://www.prismtechnologies.com/English/Products/CORBA/eORB/
[8] Manuel Roman. Ubicore: Universally Interoperable Core. www.ubi-core.com/Documentation/
Universally_ Interoperable_Core/universal%ly_interoperable_core.html
[9] Institute for Software Integrated Systems. The ACE ORB (TAO), Vanderbilt University.
www.dre.vanderbilt.edu/TAO/
[10] O. Interface. ORBExpress, 2002. www.ois.com
[11] Object Management Group. The Common Object Request Broker: Architecture and Specication,
3.0.2 ed. December 2002. http://www.omg.org/technology/documents/formal/corba_iiop.htm
[12] Sun Microsystems. Enterprise JavaBeans Specication, August 2001. java.sun.com/products/ejb/
docs.html
[13] D. Rogerson. Inside COM. Microsoft Press, Redmond, WA, 1997.
[14] Sun Microsystems, Inc. Java Remote Method Invocation Specication (RMI), October 1998
http://java.sun.com//j2se/1.3/docs/guide/rmi/spec/rmi-title.html
[15] L.R. David. Online banking and electronic bill presentment payment are cost effective. Published
online by Online Financial Innovations at www.onlinebank report.com
[16] K. Kang, S. Son, and J. Stankovic. Star: secure real-time transaction processing with timeliness
guarantees, 23rd IEEE Real-Time Systems Symposium, Austin, Texas, 2002, pp. 312.
[17] X. Defago, K. Mazouni, andA. Schiper. Highly available trading system: experiments with CORBA,
IFIP International Conference on Distributed Systems Platform and Open Distributed Processing
(Middlewate 98), The Lake District, England, September 1518, 1998.
[18] D. Corman. WSOA-Weapon systems open architecture demonstration using emerging open
system architecture standards to enable innovative techniques for time critical target (TCT)
prosecution. In Proceedings of the 20th IEEE/AIAA Digital Avionics Systems Conference (DASC),
October 2001.
[19] C. Gill, V. Subramonian, J. Parsons, H.-M. Huang, S. Torri, D. Niehaus, and D. Stuart. ORB
middleware evolution for networked embedded systems. In Proceedings of the Eighth International
Workshop on Object Oriented Real-time Dependable Systems (WORDS03). Guadalajara, Mexico,
January 2003.
[20] W. Zhang, G. Wang, and L. Wittenburg. Distributed stochastic search for constraint satisfaction and
optimization: parallelism, phase transitions and performance. In Proceedings of AAAI Workshop
on Probabilistic Approaches in Search, 2002.
[21] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Communications of
the ACM, 26, 558565, 1978.
[22] Nancy A Lynch. Distributed Algorithms. Morgan Kaufmann Publishers, Inc., San Mateo, California,
1996.
[23] D.C. Schmidt, D.L. Levine and S. Mungee. The Design of the TAO Real-Time Object Request
Broker, Computer Communications 21(4): 294324, 1998.
[24] Institute for Software Integrated Systems. The ADAPTIVE Communication Environment (ACE),
Vanderbilt University. www.dre.vanderbilt.edu/ACE/
[25] F. Hunleth, R. Cytron, and C. Gill. Building customizable middleware using aspect oriented
programming. In The OOPSLA 2001 Workshop on Advanced Separation of Concerns in Object-
Oriented Systems. ACM, Tampa Bay, FL, October 2001. www.cs.ubc.ca/kdvolder/Workshops/
OOPSLA2001/ASoC.html
[26] F. Hunleth and R.K. Cytron. Footprint and feature management using aspect-oriented program-
ming techniques. In Proceedings of the Joint Conference on Languages, Compilers and Tools for
Embedded Systems. ACM Press, 2002, pp. 3845.
[27] S. Aslam-Mir. Experiences with real-time embedded CORBA in Telecom. In Proceedings of
the OMGs First Workshop on Real-time and Embedded Distributed Object Computing. Object
Management Group, Falls Church, VA, July 2000.
[28] J. Garon. Meeting performance and QoS requirements with embedded CORBA. In Proceedings
of the OMGs First Workshop on Embedded Objectbased Systems. Object Management Group,
Santa Clara, CA, January 2001.
[29] D.C. Schmidt. ACE: an object-oriented framework for developing distributed applications. In
Proceedings of the USENIX C++ Technical Conference. USENIX Association, Cambridge, MA,
April 1994.
[30] I. Pyarali, C. ORyan, D.C. Schmidt, N. Wang, V. Kachroo, and A. Gokhale. Applying optimization
patterns to the design of real-time ORBs. In Proceedings of the Fifth Conference on Object-Oriented
Technologies and Systems. USENIX, San Diego, CA, May 1999, pp. 145159.
[31] N. Wang, D.C. Schmidt, and S. Vinoski. Collocation optimizations for CORBA. C++ Report, 11,
4752, 1999.
[32] D.C. Schmidt, S. Mungee, S. Flores-Gaitan, and A. Gokhale. Alleviating priority inversion and
non-determinism in real-time CORBA ORB core architectures. In Proceedings of the Fourth IEEE
Real-Time Technology and Applications Symposium. IEEE, Denver, CO, June 1998.
[33] A. Gokhale and D.C. Schmidt. Principles for optimizing CORBA internet inter-ORB protocol
performance. In Proceedings of the Hawaiian International Conference on System Sciences. Hawaii,
USA, January 1998.
[34] A. Gokhale and D.C. Schmidt. Optimizing a CORBA IIOP protocol engine for minimal footprint
multimedia systems. Journal on Selected Areas in Communications, Special Issue on Service Enabling
Platforms for Networked Multimedia Systems, 17: 16731699, 1999.
[35] A. Gokhale and D.C. Schmidt. Evaluating the performance of demultiplexing strategies for real-
time CORBA. In Proceedings of GLOBECOM 97. IEEE, Phoenix, AZ, November 1997.
[36] I. Pyarali, C. ORyan, and D.C. Schmidt. A pattern language for efcient, predictable, scalable,
and exible dispatching mechanisms for distributed object computing middleware. In Proceedings
of the International Symposium on Object-Oriented Real-Time Distributed Computing (ISORC).
IEEE/IFIP, Newport Beach, CA, March 2000.
[37] T.H. Harrison, C. ORyan, D.L. Levine, and D.C. Schmidt. The design and performance of a
real-time CORBA event service. In Proceedings of the 12th ACM SIGPLAN conference on Object-
Oriented Programming Systems, Languages, and Applications (OOPSLA 97), October 59, 1997,
Atlanta, Georgia.
[38] C.D. Gill, D.L. Levine, and D.C. Schmidt. The design and performance of a real-time CORBA
scheduling service. Real-Time Systems, The International Journal of Time-Critical Computing
Systems, Special Issue on Real-Time Middleware, 20: 117154, 2001.
[39] C. Gill, D.C. Schmidt, and R. Cytron. Multi-paradigmscheduling for distributed real-time embed-
ded computing. IEEE Proceedings, Special Issue on Modeling and Design of Embedded Software, 91:
183197, 2003.
[40] I. Pyarali and D.C. Schmidt. An overview of the CORBA portable object adapter. ACM
StandardView, 6: 3043, 1998.
[41] M. Henning and S. Vinoski. Advanced CORBA Programming with C++. Addison-Wesley, Reading,
MA, 1999.
[42] D.C. Schmidt and C. Cleeland. Applying a pattern language to develop extensible orb middle-
ware. In Design Patterns in Communications, L. Rising, Ed. Cambridge University Press, London,
2000.
[43] D.C. Schmidt, D.L. Levine, and C. Cleeland. Architectures and patterns for developing high-
performance, real-time ORB endsystems. In Advances in Computers, M. Zelkovitz, Ed., Academic
Press, New York, 1999.
[44] D.C. Schmidt and C.D. Cranor. Half-sync/half-async: an architectural pattern for efcient and
well-structured concurrent I/O. In Proceedings of the Second Annual Conference on the Pattern
Languages of Programs. Monticello, IL, September 1995, pp. 110.
[45] D.C. Schmidt, M. Stal, H. Rohnert, and F. Buschmann. Pattern-Oriented Software Architecture:
Patterns for Concurrent and Networked Objects, Vol. 2. John Wiley & Sons, New York, 2000.
[46] C. Liu and J. Layland. Scheduling algorithms for multiprogramming in a hard-real-time
environment. Journal of ACM, 20, 4661, 1973.
[47] D.B. Stewart and P.K. Khosla. Real-time scheduling of sensor-based control systems. In Real-
Time Programming, W. Halang and K. Ramamritham, Eds. Pergamon Press, Tarrytown,
NY, 1992.
[48] D.C. Schmidt, S. Mungee, S. Flores-Gaitan, and A. Gokhale. Software architectures for reducing
priority inversion and nondeterminism in real-time object request brokers. Journal of Real-
Time Systems, Special Issue on Real-Time Computing in the Age of the Web and the Internet,
21: 77125, 2001.
[49] K.M. Chandy and L. Lamport. Distributed snapshots: determining global states of distributed
systems. ACM Transactions on Computer Systems, 3, 6375, 1985.
[50] I. Pyarali, C. ORyan, D.C. Schmidt, N. Wang, V. Kachroo, andA. Gokhale. Using principle patterns
to optimize real-time ORBs. IEEE Concurrency Magazine, 8: 1625, 2000.
[51] V. Subramonian and C. Gill. A generative programming framework for adaptive middleware.
In Proceedings of the Hawaii International Conference on System Sciences, Software Technology
Track, Adaptive and Evolvable Software Systems Minitrack, HICSS 2003. HICSS, Honolulu, HW,
January 2003.
[52] D. McKinnon, D. Bakken et al. A congurable middleware framework with multiple quality of
service properties for small embedded systems. In Proceedings of the Second IEEE International
Symposium on Network Computing and Applications. IEEE, April 2003.
[53] M. Roman, R.H. Campbell, and F. Kon. Reective middleware: from your desk to your
hand. IEEE Distributed Systems Online, 2, 2001. http://csdl.computer.org/comp/megs/ds/2001/05/
o5001abs.htm
[54] H. Kopetz. Real-Time Systems: Design Principles for Distributed Embedded Applications. Kluwer
Academic Publishers, Norwell, MA, 1997.
[55] K. Kim. APIs enabling high-level real-time distributed object programming. IEEE Computer
Magazine, Special Issue on Object-Oriented Real-time Computing, 33(6), June 2000.
[56] K. Kim. Object structures for real-time systems and simulators. IEEE Computer Magazine, 30(8),
August 1997.
[57] V. Subramonian, G. Xing, C. Gill, C. Lu, and R. Cytron. Middleware specialization for memory-
constrained networked embedded systems. In Proceedings of the 10th IEEE Real-Time and
Embedded Technology and Applications Symposium (RTAS), 2004.
V
Sensor Networks
31 Introduction to Wireless Sensor Networks
S. Dulman, S. Chatterjea, and P. Havinga
32 Issues and Solutions in Wireless Sensor Networks
Ravi Musunuri, Shashidhar Gandham, and Maulin D. Patel
33 Architectures for Wireless Sensor Networks
S. Dulman, S. Chatterjea, T. Hoffmeijer, P. Havinga, and J. Hurink
34 Energy-Efcient Medium Access Control
Koen Langendoen and Gertjan Halkes
35 Overview of Time Synchronization Issues in Sensor Networks
Weilian Su
36 Distributed Localization Algorithms
Koen Langendoen and Niels Reijers
37 Routing in Sensor Networks
Shashidhar Gandham, Ravi Musunuri, and Udit Saxena
38 Distributed Signal Processing in Sensor Networks
Omid S. Jahromi and Parham Aarabi
39 Sensor Network Security
Guenter Schaefer
40 Software Development for Large-Scale Wireless Sensor Networks
Jan Blumenthal, Frank Golatowski, Marc Haase, and Matthias Handy
31
Introduction to
Wireless Sensor
Networks
S. Dulman,
S. Chatterjea, and
P. Havinga
31.1 The Third Era of Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-1
31.2 What Are Wireless Sensor Networks?. . . . . . . . . . . . . . . . . . . 31-2
31.3 Typical Scenarios and Applications . . . . . . . . . . . . . . . . . . . . . 31-3
31.4 Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-5
Locally Available Resources Diversity and Dynamics
Needed Algorithms Dependability
31.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31-9
Wireless Sensor Networks have gained a lot of attention lately. Due to technological advances, building
small-sized, energy-efcient reliable devices, capable of communicating with each other and organizing
themselves in ad hoc networks have become possible. These devices have brought a new perspective to
the world of computers as we know it: they can be embedded into the environment in such a way that the
user is unaware of them. There is no need for reconguration and maintenance as the network organizes
itself to inform the users of the most relevant events detected or to assist them in their activity.
This chapter will give a brief overview of the whole area, by introducing the wireless sensor networks
concepts to the reader. Then, a number of applications as well as possible typical scenarios will be presented
in order to better understand the eld of application of this newemerging technology. Up to this moment,
several main areas of applications have been identied. New areas of applications are still to be discovered
as the research and products grow more mature.
Wireless sensor networks bring lots of challenges and often contradictory demands from the design
point of view. The last part of the chapter will be dedicated to highlighting the main directions of research
involved in this eld. It will serve as a brief introduction to the problems to be described in the following
chapters of the book.
31.1 The Third Era of Computing
Things are changing continuously in the world of computers. Everything started with the mainframe
era: some 30 years ago, these huge devices were widely deployed, for example, within universities.
31-1
Lots of users made use of a single mainframe computer which they had to share among themselves. The
computation power came together with a high cost and a huge machine requiring a lot of maintenance.
Technology advanced as it was predicted by Moores Law and we stepped into the second era of
computers. It is a period that is still present today, but which is slowly approaching its nal part. It is the
era of the personal computers, cheaper and smaller, and increasingly affordable. Quite often, the average
user has access to and makes use of more than one computer, these machines being present now in almost
any home and work place.
But in this familiar environment, things are starting to change and the third era of computing gains
more and more terrain each day. Let us take a look at the main trends today. The technology advancements
cause the personal computers to become smaller and smaller. The desktop computers tend to be replaced
by laptops and other portable devices.
The main factor that is inuencing the new transition is the availability of wireless communication
technology. People are getting rapidly used to wireless communicating devices due to their independence
from xed machines. The success and availability of the Internet brought even more independence to the
user: the data could now be available regardless of the physical location of its owner.
The advancements in technology did not stop here: the processors became small and cheap enough to
be found now in almost any familiar device around us, starting with an every-day watch and ending with
(almost) any home appliance we own. The new efforts nowadays are to make these devices talk to each
other and organize themselves into ad hoc networks to accomplish their design goal as fast and reliably as
possible.
This is, in fact, the third computer age envisioned two decades ago by Mark Weiser [1]. Several names,
such as ubiquitous computing, pervasive computing, ambient intelligence, invisible computing, dis-
appearing computer, etc., were createdtoindicate different aspects of the newcomputing age (Mark Weiser
himself dened it as the calm technology, that recedes into the background of our lives).
The ubiquitous computing world brings a reversed view on the usage of computing power: instead of
having lots of users gathered around the mainframe computer, now, each user will be using the services of
several embedded networks. The user will be in the middle of the whole system, surrounded by an invisible
intelligent infrastructure. The original functionality of the objects and application will be enhanced, and
a continuous interaction will be present in a large variety of areas of daily life.
31.2 What Are Wireless Sensor Networks?
So what are wireless sensor networks and where is their place in this newenvironment that starts growing
around us?
Wireless sensor networks is the generic name under which a broad range of devices hide. Basically, any
collection of devices equipped with a processor, having sensing and communication capabilities and being
able to organize themselves into a network created in an ad hoc manner falls into this category.
The addition of the wireless communication capabilities to sensors increased their functionality
dramatically. Wireless sensor networks bring monitoring capabilities that will forever change the way
in which data is collected from the ambient environment. Let us take, for example, the traditional
monitoring approach of a remote location for a given phenomenon, such as recording the geological
activity, monitoring the chemical or biological properties of a region, or even monitoring the weather at
a certain place.
The old approach was the following: rather big and robust devices needed to be built. They should
have contained, besides the sensor pack itself, a big power supply and local data storage capabilities.
A team of scientists would have to travel together to the destination to be monitored, place these
expensive devices at predened positions and calibrate all the sensors. Then, they would come back
after a certain amount of time in order to collect the sensed data. If by misfortune some hardware
would fail, then nothing could be done for it, as the information about the phenomenon itself would
be lost.
Introduction to Wireless Sensor Networks 31-3
The newapproachis toconstruct inexpensive, small sized, energy-efcient sensing devices. As hundreds,
thousands, or even more of these devices will be deployed, the reliability constraints for them will be
diminished. No local data storage is needed anymore as they will process locally and then transmit by
wireless means the observed characteristic of the phenomenon to one or more access points connected to
a computer network. Individual calibration of each sensor node is no longer needed as it can be performed
by localized algorithms [2]. The deployment will also be easier, by randomly placing the nodes (e.g., simply
throwing them from a plane) onto the monitored region.
Having this example in mind, we can give a general description of a sensor node. The name sensor node
will be used to describe a tiny device that has a short range wireless communication capability, a small
processor andseveral sensors attachedtoit. It may be poweredby batteries andits mainfunctionis tocollect
data from a phenomenon, collaborate with its neighbors, and forward its observations (preprocessed
version of the data or even decisions) to the endpoint if requested. This is possible because its processor
additionally contains the code that enables internode communication and setting up, maintenance, and
reconguration of the wireless network. When referring to wireless communication, we have in mind
mainly radio communication (other means such as ultrasound, visible or infrared light, etc., are also
being used [3]). A sensor network is a network made up of large numbers of sensor nodes. By a large
number we understand at this moment hundreds or thousands of nodes but there are no exact limits for
the upper bound of the number of sensors deployed.
Wireless sensor networks are one of the most important tools of the third era of computing. They are the
simplest intelligent devices around, their main purpose being monitoring the environment surrounding
us and alerting us of the main events happening. Based on the observation reported by these instruments,
humans and machines can make decisions and act on them.
31.3 Typical Scenarios and Applications
At this moment a large variety of sensors exist. Sensors have beendeveloped to monitor almost every aspect
of the ambient world: lighting conditions, temperature, humidity, pressure, the presence or absence of
various chemical or biological products, detection of presence and movement, etc. By networking large
number of sensors and deploying them inside the phenomenon to be studied we obtain a sensing tool
way more powerful than a single sensor is able to do sensing at a superior level.
A rst classication of wireless sensor networks can be made based on the complexity of the networks
involved [4]:
Intelligent warehouse. Each item contained inside the warehouse will have a tag attached, that will be
monitored by the sensor nodes embedded into the walls and shelves. Based on the read data, knowledge
of the spatial positioning of the sensors, and time information, the sensor network will offer information
about the trafc of goods inside the building, create automatic inventories, and even perform long-
term correlations between the read data. The need of manual product scanning will thus disappear.
In this category we can include the scenario of the modern supermarket, where the selected products of
the customers will automatically be identied at the exit of the supermarket. This scenario also has the
minimum complexity. The sensor nodes are placed at xed positions, in a more or less random manner.
The deployment area is easily accessible and some infrastructure (e.g., power supplies and computers)
already exists. At the same time, the nodes are operating in a safe environment meaning that there are
no major external factors that can inuence or destroy them.
Environmental monitoring. This is the widest area of application envisioned up to now. A particular
application in this category is disaster monitoring. The sensor nodes deployed in the affected areas will
help humans estimate the effects of the disaster, build maps of the safe areas, and direct the human actions
toward the affected regions. A large number of applications in this category address monitoring of the
wild life. This scenario has an increased complexity. The area of deployment is no longer accessible in an
easy manner and no longer safe for the sensor nodes. There is hardly any infrastructure present, nodes
have to be scattered around in a random manner and the network might contain moving nodes. Also
a larger number of nodes will have to be deployed.
Very-large-scale sensor networks applications. The scenarioof a large city where all the cars have integrated
sensors. These sensor nodes will communicate with each other collecting information about the trafc,
routes, and special trafc conditions. On one hand, new information will be available to the driver of each
car. On the other hand, a global view of the whole picture will also be available. The two main constraints
that characterize this scenario are the large number of nodes and their high mobility. The algorithms
employed will have to scale well and deal with a network with a continuously changing topology.
On the other hand, the authors of Reference 5 present a classication of sensor networks based on their
area of application. It takes into consideration only the military, environment, health, home, and other
commercial areas and can be extended with additional categories, such as space exploration, chemical
processing, and disaster relief.
Military applications. Factors such as rapid deployment, self-organization, and increased fault
tolerance make wireless sensor networks a very good candidate for usage in the military eld. They
are suited for deployment in battleeld scenarios due to the large size of the network and the auto-
matic self-reconguration at the moment of destruction/unavailability of some sensor nodes [6]. Typical
applications are: the monitoring of friendly forces, equipment, and ammunition; battleeld surveillance;
reconnaissance of opposing forces and terrain, targeting, and battle damage assessment; and nuclear,
biological, and chemical attack detection and reconnaissance. A large number of projects have already
been sponsored by The Defense Advanced Research Projects Agency (DARPA) [7].
Environmental applications. Several aspects of the wildlife are being studied with the help of sensor
networks. Existing applications include the following: monitoring the presence and the movement of
birds, animals, and even insects; agricultural related projects observing the conditions of crops and
livestock; environmental monitoring of soil, water, and atmosphere contexts and pollution studies; etc.
Other particular examples include forest re monitoring, biocomplexity mapping of the environment,
and ood detection. Ongoing projects at this moment include the monitoring of birds on Great Duck
Island [8], the zebras in Kenya [9], or the redwoods in California [10]. The number of these applications is
continuously increasing as the rst deployed sensor network showthe benets of easy remote monitoring.
Healthcare applications. An increasing interest is being shown to the elder population [11]. Sensor
networks can help in several areas of the healthcare eld. The monitoring can take place both at home and
in hospitals. At home, patients can be under permanent monitoring and the sensor networks will trigger
alerts whenever there is a change in the patients state. Systems that can detect their movement behavior at
home, detect any fall, or remind them to take their prescriptions are being studied. Also inside hospitals,
sensor networks can be used in order to track the position of doctors and patients (their status or even
errors in the medication), expensive hardware, etc. [12].
Home applications. The home is the perfect application domain for the pervasive computing eld.
Imagine all the electronic appliances forming a network and cooperating together to fulll the needs of
the inhabitants [13]. They will have to identify each user correctly, remember their preferences and their
habits, and at the same time, monitor the entire house for unexpected events. The sensor networks also
have an important role here, being the eyes and the ears that will trigger the actuator systems.
Other commercial applications. This category includes all the other commercial applications envisioned
or already built that do not t in the previous categories. Basically they range from simple systems as
environmental monitoring within an ofce to more complex applications, such as managing inventory
control and vehicle tracking and detection. Other examples include incorporating sensors into toys and
thus detecting the position of the children insmart kindergartens [14]; monitoring the material fatigue
and the tensions inside the walls of a building, etc.
The number of research projects dedicated to wireless sensor networks has increased dramatically over
the last years. A lot of effort has been invested in studying all possible aspects of wireless sensor networks.
TABLE 31.1 List of Sensor Networks Related Research Projects
Project name Research area
CoSense [15] Collaborative sensemaking (target recognition, condition monitoring)
EYES [16] Self-organizing, energy-efcient sensor networks
PicoRadio [17] Develop low cost, energy-efcient transceivers
SensoNet [18] Protocols for sensor networks
Smart Dust [19] Cubic millimeter sensor nodes
TinyDB [20] Query processing system
WINS [21] Distributed network access to sensors, controls, and processors
TABLE 31.2 Current Sensor Networks Companies List
Company name Headquarters location HTTP address
Ambient systems The Netherlands http://www.ambient-systems.net
CrossBow San Jose, CA http://www.xbow.com
Dust networks Berkeley, CA http://dust-inc.com
Ember Boston, MA http://www.ember.com
Millennial net Cambridge, MA http://www.millennial.net
Sensoria corporation San Diego, CA http://www.sensoria.com
Xsilogy San Diego, CA http://www.xsilogy.com
Please refer to Table 31.1 for a few examples. Also, a number of companies were created, most of them
start-ups from the universities that perform research in the eld. Some of the names in the eld, valid at
the date of writing this document, are listed in Table 31.2.
31.4 Design Challenges
When designing a wireless sensor network one faces, on one hand, the simplicity of the underlying
hardware and, on the other hand, the requirements that have to be met. In order to satisfy them, new
strategies and new sets of protocols have to be developed [2224]. In the following paragraphs we will
address the main challenges that are present in the wireless sensor network eld. The research directions
involved and the open questions that still need to be answered will be presented as well.
To begin with, a high-level description of the current goals for the sensor networks can be
synthesized as:
Long life. The sensor node should be able to live as long as possible using its own batteries. This
constraint can be translated to a power consumption <100 W. The condition arises fromthe assumption
that the sensor nodes will be deployed in a harsh environment where maintenance is either impossible or
has a prohibitively high price. It makes sense to maximize the battery lifetime (unless the sensor nodes
use some form of energy scavenging). The targeted lifetime of a node powered by two AA batteries is
a couple of years. This goal can be achieved only by applying a strict energy policy that will make use of
power-saving modes and dynamic voltage scaling techniques.
Small size. The size of the device should be <1 mm
3
. This constraint gave the sensor nodes the name of
smart dust, a name that gives a very intuitive idea about the nal design. Recently, the processor, the radio
were integrated in a chip having a size of 1 mm
3
. What is left is the antenna, the sensors themselves,
and the battery. Advances are required in each of these three elds in order to be able to meet this design
constraint.
Inexpensive. The third high-level design constraint is about the price of these devices. In order to
encourage large-scale deployment, this technology must be really cheap, meaning that the targeted prices
are in the range of a couple of cents.
31.4.1 Locally Available Resources
Wireless sensor networks consist of thousands of devices working together. Their small size comes alsowith
the disadvantage of very limited resource availability (limited processing power, low-rate unreliable wire-
less communication, small memory footprint, and low energy). This raises the issue of designing a new
set of protocols across the whole system.
Energy is of special importance and can by far be considered the most important design constraint.
The sensor nodes will be mainly powered by batteries. In most of the scenarios, due to the environment
where they will be deployed, it will be impossible to have a human change their batteries. In some designs,
energy-scavenging techniques will also be employed. Still, the amount of energy available to the nodes
can be considered limited and this is why the nodes will have to employ energy-efcient algorithms to
maximize their lifetime.
By taking a look at the characteristics of the sensor nodes, we notice that the energy is spent for three
main functions: environment sensing, wireless communication, and local processing. Each of these three
components will have to be optimized in order to obtain minimum energy consumption. For sensing
of the environment component, the most energy-efcient available sensors have to be used. From this
point of view, we can regard this component as a function of a specic application and a given sensor
technology.
The energy needed for transmitting data over the wireless channel dominates by far the energy con-
sumption inside a sensor node. More than that, it was previously shown that it is more efcient to use
a short-range multihop transmission scheme than sending data over large distances [5]. A new strategy
characteristic to the sensor networks was developed based on a trade-off between the last two components
and it is in fact, one of the main characteristics of the sensor networks (see e.g., techniques developed in
References 25 and 26). Instead of blindly routing packets through the network, the sensor nodes will act
based on the content of the packet [27].
Let us suppose that a certain event took place. All nodes that sensed it will characterize the event with
some piece of data that needs to be sent to the interested nodes. There will be many similar data packets,
or at least, some redundancy will exist in the packets to be forwarded. In order to reduce the trafc,
each node on the communication path will examine the contents of the packet it has to forward. Then it
will aggregate all the data related to a particular event into one single packet, eliminating the redundant
information. The reduction of trafc by using this mechanism is substantial. Another consequence of
this mechanism is that the user will not receive any raw data, but only high-level characterizations of the
events. This makes us think of the sensor network as a self-contained tool, a distributed network that
collects and processes information.
From an algorithmic point of view, the local strategies employed by sensor nodes have as a global goal
to extend the overall lifetime of the network. The notion of lifetime of the network usually hides one
of the following interpretations: one can refer to it as the time passed since power on and a particular
event, such as the energy depletion of the rst node or of 30% of the nodes, or even the moment when
the network is splitted in several subnetworks. No matter which of these concepts will be used, the nodes
will choose to participate in the collaborative protocols following a strategy that will maximize the overall
network lifetime.
To be able to meet the goal of prolonged lifetime, each sensor node should:
Spend all the idle time in a deep power down mode, thus using an insignicant amount of energy.
Whenactive, employ scheduling schemes that take intoconsiderationvoltage andfrequency scaling.
It is interesting to note at the same time, the contradictory wireless industry trends and the requirements
for the wireless sensor nodes. The industry focuses at the moment in acquiring more bits/sec/Hz while
the sensor nodes need more bits/euro/nJ. From the transmission range point of view, the sensor nodes
need only limited transmission range to be able to use an optimal calculated energy consumption, while
the industry is interested in delivering higher transmission ranges for the radios. Nevertheless, the radios
designed nowadays tend to be as reliable as possible, while a wireless sensor network is based on the
assumption that failures are regarded as a regular event.
Energy is not the only resource the sensor nodes have to worry about. The processing power and
memory are also limited. Large local data storages cannot be employed, so strategies need to be developed
in order to store the most important data in a distributed fashion and to report the important events
to the outside world. A feature that helps dealing with these issues is the heterogeneity of the network.
There might be several types of devices deployed. Resource poor nodes will be able to ask more powerful
nodes to perform complicated computations. At the same time, several nodes could associate themselves
in order to perform the computations in a distributed fashion.
Bandwidth is also a constraint when dealing with sensor networks. The low-power communication
devices used (most of the time radio transceivers) canonly work insimplex mode. They offer lowdata rates
due also to the fact that they are functioning in the free unlicensed bands where trafc is strictly regulated.
31.4.2 Diversity and Dynamics
As we already suggested, there may be several kinds of sensor nodes present inside a single sensor network.
We could talk of heterogeneous sensor nodes from the point of view of hardware and software. From the
point of view of hardware, it seems reasonable to assume that the number of a certain kind of devices will
be in an inversely proportional relationship to the capabilities offered. We can assist to a tiered architecture
design, where the resource poor nodes will ask more powerful or specialized nodes to make more accurate
measurements of a certain detected phenomenon, to perform resource intensive operations or even to
help in transmitting data at a higher distance.
Diversity can also refer to sensing several parameters and then combining them in a single decision,
or in other words to perform data-fusion. We are talking about assembling together information from
different kinds of sensors, such as light, temperature, sound, smoke, etc., to detect, for example, if a re
has started.
Sensor nodes will be deployed in the real world, most probably in harsh environments. This puts them
in contact with an environment that is dynamic in many senses and has a big inuence on the algorithms
that the sensor nodes should execute. First of all, the nodes will be deployed in a random fashion in the
environment and in some cases, some of them will be mobile. Second, the nodes will be subject to failures
at randomtimes and they will also be allowed to change their transmission range to better suit their energy
budget. This leads to the full picture of a network topology in a continuous change. The algorithms for the
wireless sensor networks have as one of their characteristic the fact that they do not require a predened
well-known topology.
One more consequence of the real-world deployment is that there will be many factors inuencing the
sensors in contact with the phenomenon. Individual calibration of each sensor node will not be feasible
and probably will not help much as the external conditions will be in a continuous change. The sensor
network will calibrate itself as a reply to the changes in the environment conditions. More than this, the
network will be capable of self-conguration and self-maintenance.
Another issue we need to talk about is the dynamic nature of the wireless communication medium.
Wireless links between nodes can periodically appear or disappear due to the particular position of each
node. Bidirectional links will coexist with unidirectional ones and this is a fact that the algorithms for
wireless sensor networks need to consider.
31.4.3 Needed Algorithms
For a sensor network to work as a whole, some building blocks need to be developed and deployed in
the vast majority of applications. Basically, they are: a localization mechanism, a time synchronization
mechanism, and some sort of distributed signal processing. A simple justication can be that data hardly
has any meaning if some position and time values are not available with it. Full, complex signal processing
done separately at each node will not be feasible due to the resource constraints.
The self-localization of sensor nodes gained a lot of attention lately [2831]. It came as a response to the
fact that global positioning systems are not a solution due to high cost (in terms of money and resources)
and it is not available or provides imprecise positioning information in special environments, such as
indoors, etc. Informations, such as connectivity, distance estimation based on radio signal strength, sound
intensity, time of ight, angle of arrival, etc., were used with success in determining the position of each
node within degrees of accuracy using only localized computation.
The position information once obtained was not only used for characterizing the data, but also in
designing the networking protocols, for example, leading to more efcient routing schemes based on the
estimated position of the nodes [32].
The second important building block is the timing and synchronization block. Nodes will be allowed
to function in a sleep mode for long periods of time, so periodic waking up intervals need to be computed
within a certain precision. However, the notion of local time and synchronization with the neighbors is
needed for the communication protocols to perform well. Light-weight algorithms have been developed
that allow fast synchronization between neighboring nodes using a limited number of messages. Loose
synchronization will be used, meaning that each pair of neighbor nodes are synchronized within a certain
bound, while nodes situated multiple hops away might not be synchronized at all.
Global timing notion might not be needed at all in most of the applications. Due to the fact that many
applications measure natural phenomenon, such as temperature, where delays up to the order of seconds
can be tolerated, the trade-off between latency and energy is preferred.
The last important block is the signal processing unit. A new class of algorithms has to be developed
due to the distributed nature of wireless sensor networks. In their vast majority the signal processing
algorithms are centralized algorithms that require a large computation power and the availability of all
the data at the same time. Transmitting all the recorded data to all nodes is impossible in a dense network
even from theoretical point of view, not to mention the needed energy for such an operation. The new
distributed signal processing algorithms have to take into account the distributed nature of the network,
the possible unavailability of data from certain regions due to failures, and the time delays that might be
involved.
31.4.4 Dependability
More than any other sort of computer network, the wireless sensor networks are subject to failures.
Unavailability of services will be considered a feature of these networks or regular events rather than
some sporadic and highly improbable events. The probability for something going wrong is at least several
orders of magnitude higher than in all the other computer networks.
All the algorithms have to employ some form of robustness in front of the failures that might affect
them. On the other hand, this comes at the cost of energy, memory, and computation power, so it has to
be kept at a minimum. An interesting issue is the one on the system architecture from the protocols point
of view. In traditional computer networks, each protocol stack is designed for the worst-case scenario.
This scenario hardly ever happens simultaneously for all the layers, and a combination of lower layer
protocols could eliminate such a scenario. This leads to lot of redundancy in the sensor node, redundancy
that costs important resources. The preferred approach is that of crosslayer designing and studying of the
sensor node as a whole object rather than separate building blocks. This opens for a discussion on the
topic of what is a right architecture for all the sensor networks and if a solution that ts all the scenarios
makes sense at all.
Let us summarize the sources of errors the designer will be facing: nodes will stop functioning starting
with even the (rough) deployment phase. The harsh environment will continuously degrade the per-
formances of the nodes making them unavailable as the time passes. Then, the wireless communication
medium will be an important factor to disturb the message communication and to affect the links and
implicitly the network topology. Evenwitha perfect environment, collisions will occur due tothe imprecise
local time estimates and lack of synchronization. Nevertheless, the probabilistic scheduling policies and
protocol implementations can be considered as sources of errors.
Another issue that can be addressed as a dependability attribute is the security. The communication
channel is opened and cannot be protected. This means that others are able to intercept and to disrupt
the transmissions or even to transmit their own data. In addition to accessing private information, a third
party could also act as an attacker that wants to disrupt the correct functionality of the network. The
security in a sensor network is a hard problem that still needs to be solved. Like almost any other protocol
in this sort of network, it has contradictory requirements: the schemes employed should be as light as
possible while achieving the best results. The usual protection schemes require too much memory and
computation power to be employed (the keys themselves are sometimes too big to t into the limited
available memory).
A real problem is how to control the sensor network itself. The sensor nodes will be too many to
be individually accessible to a single user and might also be deployed in an inaccessible environment.
By control we understand issues, such as deployment and installation, conguration, calibration and
tuning, maintenance, discovery, and reconguration. Debugging the code running in the network is
completely infeasible, as at any point inside, the user has access only to the high-level aggregated results.
The only real debugging and testing can be done with simulators that prove to be invaluable resources in
the design and analysis of the sensor networks.
31.5 Conclusions
This chapter was a brief introduction to the new eld of wireless sensor networks. It provided a short
overview of the main characteristics of this new set of tools that will soon enhance our perception
capabilities regarding the ambient world.
The major challenges have been identied, some initial steps have been taken and early prototypes are
already working. The following chapters of the book will focus on particular issues, giving more insight
into the current state of the art in the eld. The research in this area will certainly continue and there may
come a time when sensor networks will be deployed all around us and will become regular instruments
available to everyone.
References
[1] Weiser, M. The computer for the 21st century. Scientic American, 265, 6675, 1991.
[2] Whitehouse K. and Culler, D. Calibration as parameter estimation in sensor networks.
In Proceedings of ACM International Workshop on Wireless Sensor Networks and Applications
(WSNA02). Atlanta, GA, 2002.
[3] Want, R., Hopper, A., Falcao, V., and Gibbons, J. The active badge location system. ACM
Transactions on Information Systems, 10, 91102, 1992.
[4] Estrin, D., Govindan, R., Heidemann, J., and Kumar, S. Next century challenges: scalable coordin-
ation in sensor networks. In Proceedings of the International Conference on Mobile Computing and
Networking. ACM/IEEE Seattle, Washington, USA, 1999, pp. 263270.
[5] Akyildiz, I., Su, W., Sankarasubramaniam, Y., and Cayirci, E. Wireless sensor networks: a survey.
Computer Networks Journal, 38, 393422, 2002.
[6] Brooks, R.R., Ramanathan, P., and Sayeed, A.M. Distributed target classication and tracking in
sensor networks. Proceedings of the IEEE, 91, 11631171, 2003.
[7] DARPA. http://www.darpa.mil/body/off_programs.html.
[8] Polastre, J., Szewczyk, R., andCuller, D. Analysis of wireless sensor networks for habitat monitoring.
InWireless Sensor Networks, C.S. Ragavendra, K.M. Sivalingam, andT. Znati, Eds. Kluwer Academic
[9] Juang, P., Oki, H., Wang, Y., Martonosi, M., Peh, L., and Rubenstein, D. Energy-efcient computing
for wildlife tracking: design tradeoffs and early experiences with zebranet. In Proceedings of the
Tenth International Conference on Architectural Support for Programming Languages and Operating
Systems (ASPLOS-X). San Jose, CA, 2002.
[10] Yang, S. Redwoods go hightech: researchers use wireless sensors to study Californias state tree.
UCBerkeley News, 2003.
[11] IEEE Computer Science Society. Pervasive Computing, 3 Successfull Aging, 2004.
[12] Baldus, H., Klabunde, K., and Muesch, G. Reliable set-up of medical body-sensor networks.
In Proceedings of the Wireless Sensor Networks, First European Workshop (EWSN 2004). Berlin,
Germany, 2004.
[13] Basten, T., Geilen, M., and Groot, H. Omnia eri possent. In Ambient Intelligence: Impact on
Embedded System Design. Kluwer Academic Publishers, Dordrecht, 2003, pp. 18.
[14] Srivastava, M., Muntz, R., and Potkonjak, M. Smart kindergarten: sensor-based wireless networks
for smart developmental problem-solving environments (challenge paper). In Proceedings of the
Seventh Annual International Conference on Mobile Computing and Networking. ACM, Rome, Italy,
2001, pp. 132138.
[15] CoSense. http://www2.parc.com/spl/projects/ecca.
[16] Eyes. http://eyes.eu.org.
[17] Picoradio. http://bwrc.eecs.berkeley.edu/research/pico_radio.
[18] SensoNet. http://users.ece.gatech.edu/ weilian/sensor/index.html.
[19] SmartDust. http://robotics.eecs.berkeley.edu/pister/smartdust.
[20] TinyDB. http://telegraph.cs.berkeley.edu/tinydb.
[21] Wins. http://www.janet.ucla.edu/wins.
[22] Estrin, D., Culler, D., Pister, K., and Sukhatme, G. Connecting the physical world with pervasive
networks. IEEE Pervasive Computing, 1, 5969, 2002.
[23] Akyildiz, I., Su, W., Sankarasubramaniam, Y., and Cayirci, E. A survey on sensor networks. IEEE
Communication Magazine, 40, 102114, 2002.
[24] Pottie, G.J. and Kaiser, W.J. Wireless integrated network sensors. Communications of the ACM, 43,
5158, 2000.
[25] Chlamtac, I., Petrioli, C., and Redi, J. Energy-conserving access protocols for identication
networks. IEEE/ACM Transactions on Networking, 7, 5159, 1999.
[26] Schurgers, C., Raghunathan, V., and Srivastava, M.B. Power management for energy-aware
communication systems. ACM Transactions on Embedded Computing Systems, 2, 431447, 2003.
[27] Intanagonwiwat, C., Govindan, R., Estrin, D., Heidemann, J., and Silva, F. Directed diffusion for
wireless sensor networks. IEEE/ACM Transactions on Networking, 11, 2003.
[28] Bulusu, N., Heidemann, J., and Estrin, D. Gps-less low cost outdoor localization for very small
devices. In IEEE Personal Communications, 2000, pp. 2834.
[29] Doherty, L., Pister, K., and Ghaoui, L. Convex position estimation in wireless sensor networks.
In IEEE INFOCOM. Anchorage, AK, 2001.
[30] Langendoen, K. and Reijers, N. Distributed localization in wireless sensor networks: a quantitative
comparison. Computer Networks, Special Issue on Wireless Sensor Networks, 2003.
[31] Evers, L., Dulman, S., and Havinga, P. A distributed precision based localization algorithm for
ad hoc networks. In Proceedings of Pervasive Computing (PERVASIVE 2004), 2004.
[32] Zorzi, M. and Rao, R. Geographic random forwarding (geraf) for ad hoc and sensor networks:
energy and latency performance. IEEE Transactions on Mobile Computing, 2(4), 337348, 2003.
32
Issues and Solutions
in Wireless Sensor
Networks
Ravi Musunuri,
Shashidhar Gandham,
and Maulin D. Patel
University of Texas at Dallas
32.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-1
Sensor Networks versus Mobile ad hoc Networks
32.2 System Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-3
Operational Model Radio Propagation Model
32.3 Design Issues in Sensor Networks . . . . . . . . . . . . . . . . . . . . . . 32-4
32.4 MAC Layer Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-5
32.5 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-5
Flat Routing Protocols Cluster-Based Routing Protocols
32.6 Other Important Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-7
Security Location Determination Lifetime Analysis
Power Management Clock Synchronization Reliability
Sensor Placement and Organization for Coverage and
Connectivity Topology Control
32.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-13
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32-14
32.1 Introduction
Due to advances in integrated circuits (ICs) fabrication technology and Micro Electro Mechanical Systems
(MEMSs) [1, 2], it is now commercially feasible to manufacture ICs with sensing, signal processing,
memory, and other relevant components built into them. Such ICs enabled with RF communication
bring forth a new kind of network, which are self-organizing and application specic. These networks are
referred to as wireless sensor networks.
A sensor network is a static ad hoc network consisting of hundreds of sensor nodes deployed on the
y for unattended operation. Each node consists of [35] sensors, processor, memory, radio, limited
power battery, and software components, such as operating system and protocols. The architecture of
a sensor node is completely dependent on the purpose of the deployment. But, we can generalize the
architecture [2] as shown in Figure 32.1.
32-1
R
a
d
i
o
Operating system
and other software
CPU
AC/DC
convertor
Processor
Sensor
Memory
Battery
FIGURE 32.1 Sensor node architecture.
Sensor nodes are expected to monitor some surrounding environmental phenomena, process the data
obtained and forward this data toward a base station located on the periphery of the sensor network.
Wireless sensor networks have numerous applications in elds, such as surveillance, security, environ-
mental monitoring, habitat monitoring, smart spaces, precision agriculture, inventory tracking, and
healthcare [4].
The main advantage of sensor networks is their ability to be deployed in almost any kind of remote
terrain. Their unattended mode of operation makes them a preferable choice over ground based radar
systems [5]. The spatial distribution of sensor nodes ensures greater signal-to-noise ratio (SNR) by
combining signals from various sensors. Furthermore, higher level of redundancy allows greater fault
tolerance. As sensor nodes are expected to be manufactured for very less price, they canbe deployed inlarge
numbers. As a result, sensor networks can provide large coverage area through the union of individual
nodes coverage area. Since sensor nodes are expected to be deployed close to the object of interest,
obstruction of the line of sight for sensing activity is ruled out.
To illustrate the above-mentioned advantages consider an example of seismic detection [4]. The earth
generates seismic noise, which becomes attenuated and distorted with distance. Hence to increase the
probability of detection, it is advisable to have sensors closer to the source. To accomplish this, we need
to detect the exact location and time of the seismic activity, which happens to be the goal of deploying
sensors. If a distributed network of sensors was deployed across the entire geographical area of interest,
then there would be no requirement to pinpoint the locations where sensors need to be deployed.
32.1.1 Sensor Networks versus Mobile ad hoc Networks
Wireless sensor networks are signicantly different from mobile ad hoc networks (MANETS) [6] due to
following reasons:
Mode of communication. In MANETS, potentially any node can send data to any other node. But in
sensor networks, mode of communication is restricted. In general, base station will broadcast commands
to all sensor nodes in its network and sensor nodes send back sensed data to base station. Sometimes,
sensor nodes may need to forward sensed data to other sensor nodes, if base station is not reachable
directly. Depending on the application, some sensor networks will employ data aggregation at designated
nodes to reduce the bandwidth usage. Most of the sensor networks messages are routed to base stations.
Hence sensor nodes need not maintain explicit routing tables.
Node mobility. In MANETS every node can move. In general, sensor nodes are static and some
architectures have mobile base stations [7].
Energy. Nodes in MANETS have a rechargeable source of energy. Thus energy conservation is of
secondary importance. However, sensor networks consist of several hundreds of nodes, which need to
Issues and Solutions in Wireless Sensor Networks 32-3
operate in remote terrain. Hence, battery replacement is not possible, due to which energy efciency is
critical for sensor networks.
Apart from the above mentioned differences, sensor nodes have low computational power, less cost as
compared to MANETS nodes. Protocols designed for sensor networks should be more scalable. Since they
are expected to be deployed in hundreds.
The remaining part of this chapter is organized as follows. In Section 32.2, we describe system
models used in the literature. Section 32.3 presents design issues in sensor networks. In Section 32.4,
medium access layer issues and few solutions proposed in literature are described. In Section 32.5, we
move on to at routing protocols and hierarchical routing protocols. We then describe other import-
ant issues, such as security, location determination, lifetime analysis, power management, and clock
synchronization.
32.2 System Models
Various system models proposed in the literature can be classied based on the following factors:
Mobility of base stations
Number of base stations
Method of organization (heirarchical/at)
System models considered by researchers until now consist of static sensor nodes randomly deployed
in a geographical area of interest. This geographical area of interest is often referred to as sensor eld.
Most of the models considered had a single, static base station [6, 811]. In Reference 12, the author
evaluates the best position to locate a base station and proposes to split large sensor networks into small
squares and move the base station to center of each square to collect the data. In Reference 7, the authors
propose to deploy multiple, intermittently mobile base stations to increase the lifetime of the sensor
networks.
32.2.1 Operational Model
Research on sensor networks done until now considered various operational models of the sensor nodes.
These models can be broadly classied as below:
Active. In case of active sensor networks [6,811,13] each sensor node would be sensing its environment
continuously. Based on how frequently this sensed data is forwarded toward the base station, sensor
networks can be further classied as
Periodic: Based on the application for which the sensor network is deployed it might be required to
gather the data from every sensor node periodically [8, 10].
Event driven: Sensor networks that are deployed for monitoring specic events gather data only
when the event of interest occurs [11, 13, 14]. For example, sensor nodes deployed to monitor
seismic activity in a region need to route data only when they detect seismic currents in their
proximity.
Passive. In the case of passive sensor networks, data forwarding would be triggered by a query from the
base station. Passive sensor networks can be further classied as follows:
Energized on query: In this operation mode, sensor nodes switch off their sensors most of the time.
Only when a query is generated for the data, the sensor node would switch on its sensor and record
the data to be forwarded.
Always sensing: This category of sensor nodes have their sensors running all the time. As soon
as there is a query for data from it, a sensor node would generate the data packet based on the
observations until now and forward the packet.
32.2.2 Radio Propagation Model
Most of the researchers assumed that energy spent in transmission over wireless medium is in accordance
with the rst-order radio model [8, 11]. In this model energy required to transmit a signal has a xed part
and a variable part. The variable part is directly proportional to square of the distance. Some constant
energy is required to receive a signal by a receiving antenna.
32.3 Design Issues in Sensor Networks
Most sensor networks encounter operational challenges [15], such as ad hoc deployment, limited energy
supply, dynamic environmental conditions, and unattended mode of operation. Any solution proposed
for sensor networks should consider the following design issues:
Energy. Each sensor node is equipped with a limited battery supplied energy. Sensor nodes spend more
energy in communication than local computations.
1
As sensor nodes are deployed in large numbers,
it is not feasible to manually recharge the batteries. Thus sensor nodes should conserve their energy by
minimizing the number of messages that are to be transmitted. Based on energy source, sensor nodes can
be classied as below:
Rechargeable: Sensor nodes equipped with solar cells can recharge their batteries when sunlight is
available. For such sensor nodes, the main design criteria would be to maximize the number of
nodes operational during times when no sunlight is available.
Nonrechargeable: Sensor nodes equipped with nonrechargeable batteries will cease to operate once
they drain their energy. Thus, the main design issue in such sensor networks would be to maximize
the operational time of every sensor node.
Bandwidth. Sensor nodes need to communicate over the ISM (industrial, scientic, and medical)
band. When many nodes make an attempt to use the same communication frequency, there might be
a requirement to use the available bandwidth optimally.
Limited computation power and memory. As the processing power at each sensor node is limited,
proposed solutions for sensor networks should not expect sensor nodes to carry out computationally
intensive tasks.
Unpredictable reliability, failure models. Sensor networks are expected to be deployed in inaccessible and
hostile environment. As a result, it is possible for sensor nodes to crash or malfunction due to external
environmental factors. The proposed solutions should be based on failure models that account for such
possibility. Furthermore, failure of few nodes should not bring down the network.
Scalability. Sensor nodes are expectedtobe deployedinthousands. As a result, scalability is a critical issue
in design of sensor networks. Any solution proposed should be scalable to large-sized sensor networks.
Timeliness of action (latency). Latency is an important issue in sensor networks deployed for critical
applications, such as security and surveillance. Hence, the time elapsed between the time an event is
detected and the time the event is reported at the base station is to be minimized.
To address these design challenges several strategies, such as cooperative signal processing, Exploiting
redundancy, adaptive signal processing, and hierarchical architecture are going to be key building blocks for
sensor networks [3].
In near future we believe that sensor networks would nd wide acceptance in day to day activities
similar to computers. To attain such a wide-scale acceptance, sensor nodes should be affordable, easily
available, easily congurable (plug and play), and deployable. To accomplish these objectives we need
to come up with suitable Medium Access Control (MAC) layer protocols, routing protocols, location
1
To take an example for ground to ground communication [6] it takes 3 J of energy to transmit 1 Kb of data at
a distance of 100 m. Ageneral-purpose processor having a processing capability of 100 million instructions per second
would execute 300 million instructions for the same amount of energy.
discovery algorithms, power-management strategies, and solutions to other relevant problems. Some of
these design problems are well studied by researchers. In the next section, we present a brief overview of
existing solutions for each of these design problems.
32.4 MAC Layer Protocols
Medium Access Control layer provides topology information and channel allocation to the higher layers
in the protocol stack. Channel allocation is critical for energy-efcient functioning of the link layer.
Energy efciency and scalability [16] are main issues in developing MAC protocols for sensor net-
works. Fairness, latency, and throughput are also important performance measures for channel allocation
algorithms. A channel could be a time slot in Time Division Multiple Access (TDMA), a frequency band
in Frequency Division Multiple Access (FDMA), or a code in Code Division Multiple Access (CDMA).
Channel allocation algorithms should try to avoid energy wastage through:
Collisions: When two or more nodes, which are in direct transmission range of each other, transmit
packets in the same channel.
Overhearing: Nodes receive data destined for other nodes.
Idle listening: Unnecessarily listening to the channel when there are no packets to be received.
Control packet: Bandwidth wastage due to exchange of too many controls packets.
The existing solutions to channel allocation in ad hoc networks can be divided into two categories:
contention-based and contention-free methods. In contention-based solutions, the sender continu-
ously senses the medium. IEEE 802.11 Distributed Coordinated Function (DCF), MACAW [17], and
PAMAS [18] are examples of contention-based protocols. Contention-based schemes are not suitable
for sensor networks because of energy wastage due to collisions and idle listening [19].
Sensor networks should use organized methods for channel allocation. The organized methods of
channel allocationdetermine the networktopology rst andthenassignthe channels tothe links. Achannel
assignment should avoid co-channel interference, which avoids two consecutive links being assigned to the
same channel. Sensor networks channel allocation algorithmshould be distributed because network-wide
synchronization for calculation of a schedule would be an energy intensive procedure. Another reason for
distributed algorithms is that they scale well with an increase in network size and that they are robust to
network partitions and node failures.
In Reference 6, the authors proposed a Self-organizing MAC (SMAC) protocol for sensor networks.
SMAC is a distributed protocol, which enables nodes to discover their neighbors and build a network
topology for communication. SMACS builds a at topology, that is, there are no clusters or cluster heads.
In SMAC, each node allocates channels to links between itself and neighbors within a TDMA frame
referred to as super frame. In a given time slot, every node communicates with only one neighbor to avoid
interference. Nodes communicate intermittently and hence they can power themselves off when they have
no data to send. The super frame schedule is divided into two periods. In the rst bootup period, nodes try
to discover neighbors, and rebuild severed links. The second time period is reserved for communication
between nodes. Authors in Reference 6 also proposed an Eavesdrop and Register (EAR) protocol to handle
channel allocation with moving base stations. In Piconet [20], authors used periodic sleep cycle to save
energy. Here, if a node wants to communicate with neighbors then it has to wait until it receives broadcast
message from neighbors. Wei et al. [16] proposed an energy-efcient MAC protocol known as SMAC.
SMAC saves energy by avoiding collisions, overhearing, and idle listening and increases latency.
32.5 Routing
As stated earlier, each sensor node is expected to monitor some environmental phenomenon and forward
the corresponding data toward the base station. To forward the data packets each node needs to have the
routing information. Here, we would like to state that the ow of packets is mostly directed from sensor
nodes toward the base station. As a result, each sensor node need not maintain explicit routing tables.
Routing protocols can in general be divided into at routing and cluster-based routing protocols.
32.5.1 Flat Routing Protocols
In at routing protocols the nodes in the network are considered to be homogeneous. Each node in
the network participates in route discovery, maintenance, and forwarding of the data packets. Here,
we describe few existing at routing protocols for the sensor networks.
Sequential Assignment Routing (SAR) [6] takes into consideration the energy and Quality of Service
(QoS) for each path, and the priority level of each packet for making routing decisions. Every node main-
tains multiple paths tothe sink toavoidthe overheadof route recomputationdue tothe node or link failure.
Estrin et al. [21] proposed a diffusion-based scheme for routing queries from base station to sensor
nodes and forwarding corresponding replies. In directed diffusion, an attribute-based naming is used
by the sensor nodes. Each sensor names data that it generates using one or more attributes. A sink may
query for data by disseminating interests. Intermediate nodes propagate these interests. Interests establish
gradients of data toward the sink that expressed that interest.
The minimum cost forwarding approach proposed by Ye et al. [9] exploits the fact that the dataow in
sensor networks is in a single direction and is always toward the xed base station. Their method neither
requires sensor nodes to have unique identity nor maintain routing tables to forward the messages. Each
node maintains the least cost estimate from itself to the base station. Each message to be forwarded is
broadcasted by the node. On receiving a message, the node checks if it is on the least cost path between
the source sensor node and the base station. If so, it would forward the message by broadcasting.
In Reference 7, the authors proposed to model the sensor network as a ow network and have proposed
an ILP (Integer Linear Program)-based routing method. The objective of this ILP-based method is to
minimize the maximum energy spent by any sensor node during a period of time. Through simulation
results the authors have shownthat our ILP-basedrouting heuristic increases the lifetime of sensor network
signicantly.
Kulik and coworkers [22] proposed a set of protocols to disseminate sensed data from the sensor to
other sensor nodes. Sensor Protocols for Information via Negotiation (SPIN) overcomes information
implosion and overlap by using negotiation and information descriptors (metadata). Authors proposed
different protocols for both point-to-point and broadcast channels.
32.5.2 Cluster-Based Routing Protocols
In cluster-based routing protocols special nodes, referred to as cluster heads, discover and maintain routes
and noncluster-head nodes join one of the clusters. All the data packets originating in the cluster are
forwarded toward the cluster head. Cluster head in turn will forward these packets toward destination
using the routing information. Here, we describe some cluster-based routing protocols fromthe literature.
Chandrakasan et al. [23] proposed Low Energy Adaptive Clustering Hierarchy (LEACH) as an energy-
efcient communication protocol for wireless sensor networks. In LEACH, self-elected cluster heads
collect data from all the sensor nodes in their cluster, aggregate the collected data by data fusion methods,
and transmit the data directly to the base station.
In Reference 11, the authors have classied sensor networks into proactive networks and reactive
networks. Nodes in proactive networks continuously monitor the environment and thus have the data to
be sent at a constant rate. LEACH suits such sensor networks in transmitting data efciently to the base
station. In case of the reactive sensor networks, nodes need to transmit the data only when an event of
interest occurs. Hence, all the nodes in the network do not have equal amount of data to be transmitted.
Manjeshwar et al. proposed Threshold sensitive Energy-Efcient sensor Network (TEEN) protocol [11]
for routing in reactive sensor networks.
Estrin et al. [21] proposed a two-level clustering algorithm that can be extended to build a cluster
hierarchy.
32.6 Other Important Issues
In this section we will discuss other important issues, such as security, location determination, lifetime
analysis, power management, and clock synchronization. We describe why these issues are paramount for
the functioning of sensor networks and some solutions proposed in the literature.
32.6.1 Security
Security is a very critical issue for the envisioned mass deployment of sensor networks. In particular,
a strong security framework is a must in battleeld and border monitoring applications. The security
framework in sensor networks should provide the following objectives:
Authentication/nonrepudiation: Each sensor should be able to identify the sender of a message
correctly and no node should deny its previous actions.
Integrity: Messages sent over wireless medium should not be altered by unauthorized entities.
Condentiality: Messages should be kept secret from unauthorized entities.
Freshness: Messages received by sensors should be current.
32.6.1.1 Sensor Networks versus ad hoc Networks: Security Perspective
Sensor networks share some similarities with ad hoc networks. But security in sensor networks is different
from ad hoc networks due to the following reasons:
Node power. Sensor nodes have limited power supply and lowcomputational capabilities as compared to
ad hoc nodes. Asymmetric Key encryption [24] schemes requires large computational power as compared
to Symmetric Key encryption [24]. Thus, sensor networks can only use Symmetric Key encryption. To use
Symmetric Key encryption mechanisms we need to address the Key Distribution problem.
Mode of communication. As stated earlier, most of the communication in the sensor networks is from
sensor nodes to the base station. At times, base station would issue commands to the sensor nodes. In this
mode of communication, every node may not need to share keys with every other node in its network.
Moreover, it is not practical to store keys, that are shared with every node, at every node.
Node mobility. In ad hoc networks every node can move. In general, sensor nodes are static and in some
architectures have mobile base stations as in Reference 7.
The above differences make ad hoc networks or any other traditional networks security protocols
impractical for sensor networks.
32.6.1.2 Proposed Security Protocols
Recently, there has been some work related to sensor network security. Perrig et al. [25] proposed
SPINS: Security Protocols for Sensor Networks. SPINS framework consists of two protocols to satisfy
the security objectives. Secure Network Encryption Protocol (SNEP) provides data integrity, two party
authentication, data freshness, and micro-Timed Efcient Streaming Loss-tolerant Authentication Protocol
(TESLA) provides authenticated broadcast. In SNEP, each sensor node and base station share a unique
key, whichis bootstrapped. This sharedkey andanincremental message counter, maintainedat bothsensor
node and base station, are used to derive new keys using the RC5 [24] algorithm. In TESLA, sender
generates chain of keys using one way function MD5 [24]. The important property of the chain of keys is
that if the sender authenticates the initial key then other keys in the chain are self-authenticated. Sender
divides time into equal intervals and each interval is assigned a key from the chain of keys. Sender and
receiver agree upon the key disclosure schedule. The rst key fromthe chain is authenticated using unicast
authentication. After the rst authentication, receiver authenticates packets after receiving a symmetric
key from sender as per the disclosure schedule. TESLA employs delayed disclosure of symmetric keys to
authenticate packets after the rst authentication of one key in the chain.
In Reference 26, the authors proposed a security framework based on broadcast with end-to-end
encryption of the data. This scheme avoids trafc analysis and also removes compromised and dead nodes
from the network. Sasha et al. [27] divided the messages in sensor networks into three classes depending
on security required. Each class of messages is encrypted using different encryption keys. They showed
that this multilevel scheme saves the resources at the nodes.
In general, the base stations will broadcast commands to all the sensor nodes. Hence, secure broadcast is
a very important issue inthe security framework. InTESLA, the authenticationof the rst key inthe chain
is done using a unicast mechanism. This unicast authentication mechanism has the scalability problem.
Authors inReference 28 replaced this unicast-based mechanismwith a broadcast-based mechanism, which
avoids denial of service [24] attacks. In Reference 29, the authors proposed a routing aware broadcast key
distribution algorithm. Karlof and Wagner [30] described possible attacks on different routing protocols
in literature and suggested countermeasures.
The Asymmetric key mechanism requires large computational power, bandwidth, and memory.
Therefore, sensor networks employ the symmetric key encryption to satisfy security objectives. A key
distribution [3135] in the symmetric key encryption mechanism is another important issue in sensor
networks. Eschenauer and Gligor [31] proposed a probabilistic predistribution of keys scheme. In this
scheme, every sensor node will be given a small set of m keys out of a large set of available keys such
that every two sensor nodes will have one common key with the given probability p. This scheme dra-
matically reduces the number of keys stored in each sensor as compared to storing separate keys to every
node in its network. In Reference 31, the authors proposed three extensions to this basic key distribution
scheme. In the rst, q-composite keys extention, sensor nodes will share q common keys instead of one
key with a given probability p. This extention improves the security against small-scale attacks, such as
eavesdropping on one link. The second extention, multi-path extention, deals with setting up end-to-end
path keys between two communicating nodes. In this extention, path keys between two nodes is estab-
lished by sending random keys through every available path between them. A receiver uses all received
random keys along all the paths to establish a path key. This improves the security against large-scale
attacks, such as eavesdropping many links. The third extention, random pairwise keys scheme, provides
the node-to-node authentication. In this scheme unique node identities are generated randomly. Every
node is randomly paired with m other nodes and m corresponding keys. Every node is aware of the
other nodes Id in the pair and the corresponding key. This node Id information is used for node-to-node
authentication.
32.6.2 Location Determination
Sensor nodes monitor surrounding phenomenon, such as temperature, light, seismic currents, chemical
leaks, radiation, and other parameters of interest. After detecting an event, sensor nodes forward the
sensed data toward the nearest base station. In order to process any message reported by the sensor
network, the base station is required to know the senders location. For example, if the sensor network
is deployed to detect forest res, the base station should know the reporting sensors location. Hence,
the base station needs to be aware of the location of every sensor node deployed in the network. In this
section we will explain different solutions proposed in the literature for location determination in sensor
networks. Locationing algorithms performance can be measured [36] by the following parameters:
Resolution: The smallest distance betweennodes that canbe distinguishedby the locationing system.
Accuracy: Probability of locationing system nding the correct location.
Robustness: Ability of the locationing system to nd the correct location when subjected to node
failures and link failures.
Global Positioning System (GPS) [37] has been used to locate outdoor nodes. Due to reection and
multi-path fading GPS is not a viable option for indoor locationing. Since sensor nodes can be deployed
at indoor locations or on other planets, the GPS-based locationing system is not suggestible. Many
non-GPS-based locationing solutions are proposed by the research community. Most of these solutions
are either proximity or beacons based. In proximity-based solutions, some nodes will act as special nodes,
whose locations are known. We can divide proximity-based solutions into two types. In the rst type [38],
beacons are sent by special nodes, from which other nodes can approximate their location. In the second
type of proximity-based solutions [39], beacons are sent by nonspecial nodes, from which special nodes
can approximate the locations of nonspecial nodes. Cricket [38] uses the difference in arrival times from
known beacons as the basis for nding the location. In RADAR [40], authors used SNR as the basis for
nding the location of nodes. SpotON [41] system nds the location of nodes in the three-dimensional
space. These solutions can be adopted for the location detection in sensor networks. In Reference 42,
the authors proposed a location detection scheme, which consists of local positioning and global posi-
tioning. In local positioning, nodes will approximate their relative location from anchor nodes, whose
locations are assumed, by using triangulation method. Global positioning nds global location by using
cooperative ranging approach, in which nodes iteratively converge to global position by interacting with
each other.
Saikat et al. [36] proposed a robust location detection algorithm for emergency applications. None
of the above-explained solutions dealt with robustness. Robustness is an important issue in emergency
scenarios, such as building collapses. This location detection improves robustness by nding identifying
codes. Estrin et al. [21] gave an interesting application of their clustering algorithmto pinpoint the location
of an illegitimate object. This algorithm is robust to link or node failures and the overhead is proportional
to the local population density and a sublinear function of the total number of nodes.
32.6.3 Lifetime Analysis
Lifetime refers to the time period for which a sensor network is capable of sensing and transmitting the
sensed data to the base station(s). In sensor networks, thousands of nodes are powered with very limited
supply of battery power. As a result, lifetime analysis becomes an important tool to efciently use the
available energy. In sensor networks using rechargeable energy, such as solar energy, lifetime analysis helps
the nodes to use the energy efciently before recharging. Lifetime analysis may include an upper bound
on the lifetime and factors inuencing this upper bound.
Theoretical upper bound on the lifetime of a sensor network helps to understand the efciency of
other protocols. Bhardwaj et al. [14] proposed a theoretical upper bound on the lifetime of a sensor
network deployed for tracking movement of external objects. In Reference 43, authors found the lifetime
of a sensor network with Hybrid Automata modeling. Hybrid Automata is a mathematical method to
analyze the systems with both discrete and continuous behaviors. The authors used the trace data to
analyze the power consumption and to estimate the lifetime of a sensor network.
32.6.4 Power Management
Sensor networks should operate with the minimum possible energy to increase the life of sensor nodes.
This requires power aware computation/communication component technology, low-energy signaling
and networking, and power aware software infrastructure.
Design challenges encountered in the building of wireless sensor networks can be broadly classied
into hardware, wireless networking, and OS/applications. All three categories should minimize the power
usage to increase the life of a sensor node. Hardware includes the design activities related to all hardware
platforms that make up sensor networks. MEMS, digital circuit design, system integration, and RF are
important categories in the design of hardware. The second aspect includes design of power-efcient
algorithms and protocols. In previous sections, we described few energy-efcient protocols for MAC and
routing. Next, we present few OS/application-level strategies related to power management in sensor
nodes.
Once the system is designed, additional power savings can be obtained by using the Dynamic Power
Management (DPM) [44]. The basic idea behind DPM is to shutdown (sleep mode) the devices when not
needed and get them back when required. This needs an embedded operating system [45] that is able to
support the DPM. The switching of a node from the sleep state to the active state takes some nite time
and resource. Each sensor node could be equipped with multiple devices. The number of devices switched
off determines the level of the sleep state. Each sleep state is characterized by the latency and the power
consumption. The deeper the sleep state, the lesser the power consumption, and more the latency. This
requires a careful use of DPM to maximize the life of a sensor node. But in many cases it is not known
beforehand when a particular device is required. Hence, a stochastic analysis should be applied to predict
the future events.
Energy can be conserved by using Dynamic Voltage Scheduling (DVS) [44, 46]. DVS minimizes the idle
processor cycles by using a feed-back control system. Energy savings can be obtained by optimizing the
sensor nodes performance in active state. DVS is an effective tool to achieve this goal. The main idea
behind DVS is to change the power supply to match the workload. This needs tuning of the processor to
deliver the required throughput to avoid idle cycles. The crux of the problem lies in the fact that future
workloads are nondeterministic. So the efciency depends on predicting the future workload.
Efcient Link Layer strategies can be used to conserve energy at each sensor node. In Reference 47, the
authors propose to conserve energy by compromising the quality of the link layer established. This is
possible by maintaining bit error rate (BER) just below the user requirements. Different error controlling
algorithms, such as the BoseChaudhuriHocquen (BCH) coding, convolution coding, and turbo-coding,
can be employed for error control. The algorithm with the lowest power consumption to support the
predetermined BER and latency should be chosen.
Local computationandprocessing [23,45] of sensor data inwireless networks canbe made highly energy
efcient. Partitioning the computation among multiple sensor nodes and performing the computation
in parallel permits a greater control on latency and results in the energy conservation through frequency
scaling and voltage scaling.
Biomedical wireless sensor networks could use Power Efcient Topologies [48] to save the energy spent
in communication. Biomedical sensor nodes include monitors and implantable devices intended for
long-term placement in the human body. Topology is predetermined in these sensor networks. Ayad et al.
proposed Directional Source-Aware routing Protocol (DSAP) for this class of sensor networks. DSAP
incorporates power considerations into routing tables. The authors explored various topologies to
determine the most energy-efcient topology for biomedical sensor networks.
32.6.5 Clock Synchronization
Some of the communication algorithms, for wireless sensor networks, which are proposed in the literature,
make aninherent assumptionthat there exists some mechanismthroughwhichlocal clocks of all the sensor
nodes are synchronized. Though this assumption is valid we need to have an explicit way of synchronizing
local clocks of all sensor nodes. Apart from the implementation of the communication algorithms, clock
synchronization is required for accurate time stamps in cryptographic schemes, for recognizing duplicate
detection of same event from different sensor nodes, for data aggregation algorithms, such as beam
forming, for ordering of logged events, and many other similar applications. In this section, a post facto
clock synchronization algorithm proposed by Jeremy et al. [49] is described.
The post facto clock synchronization algorithmdiscussed here is suitable for applications, such as beam
forming, duplicate event detection, and other similar localized methods. This algorithm is expected to
be implemented on systems similar to the WINS (Wireless Integrated Network Sensors) platform where
a processor has various sleep modes and has the capability of powering down high-energy peripherals.
Because of the capability of the sensor node processor to power down a device and power up when there is
a requirement to sense data and transmit, existing clock synchronization methods for distributed systems
are not applicable.
The basic idea behind post facto clock synchronization algorithm is that for certain applications such
as data fusion and beam forming it is sufcient to order the events in a localized fashion. In this scheme,
nodes clocks are normally unsynchronized. When a stimulus arrives (time to sense and transmit data),
each node records the stimulus with respect to its local clock. Immediately following this event a third
party will broadcast a synchronization pulse. Every node receiving this pulse normalizes their stimulus
time stamp with respect to broadcasted synchronizing pulse. It is essential to note that the time elapsed
between recording the stimulus and arrival of synchronized pulse needs to be measured accurately. For this
reason the algorithm is inappropriate for systems, which need to communicate a time stamp over a long
distance.
32.6.6 Reliability
Reliable transfer of critical sensed data is a very important issue in sensor networks. Reliability can be
achieved from the MAC layer, transport layer, or application layer. In Reference 50, authors concluded
that reliability in both MAC and transport layers are important.
In sensor networks, base station uses the data, sensed by different sensors, to conclude occur-
rence of events. Hence, reliability of data from sensors to base station is critical. In ESRT [51], sink
maintains an application-specic target reliability value, which depends on the reporting frequency of
sensor nodes. ESRT protocol adaptively adjusts the reporting frequency of sensors based on required
reliability.
32.6.7 Sensor Placement and Organization for Coverage and Connectivity
Sensor networks are deployed to perform sensing, monitoring, surveillance, and detection tasks. The area
in which a sensor node can perform its tasks with reasonable accuracy (i.e., the sensor readings have at
least a threshold level of sensing/detection probabilities within that area) is also known as the coverage
area. The union of coverage areas of individual sensor nodes is the coverage area of a sensor network.
The coverage area can be modeled as a circular disk (similar to a sphere in 3D) surrounding a sensor
at the center. The coverage areas can be irregular and can be location dependent due to the obstructions
in the terrain, for example, sensor nodes deployed for indoor applications, in urban and hilly areas [52].
The coverage area may also depend on the target, for example, a bigger target can be detected at a longer
distance than a smaller target [53].
The degree of the sensing coverage is a measure of the sensing quality provided by the sensor network
in a designated area. Coverage requirement depends on the application. For some applications, covering
every location with at least a single sensor node might be sufcient while other applications might need
higher degree of coverage [54], for example, to pinpoint the exact location of a target, it might be necessary
that every location be monitored by multiple sensor nodes [55]. Covering every location with multiple
sensors can provide robustness. Some applications may require preferential coverage of critical points,
for example, sensitive areas in the sensor eld may require more surveillance/monitoring and should be
covered by more sensors than other areas [52]. The coverage requirements can also change with time due
to changes in environmental conditions, for example, the visibility can vary due to fog or smoke. A low
degree of coverage might be sufcient in normal circumstances but when a critical event is sensed, a high
degree of coverage may be desired [54].
It is desirable to achieve the required degree of coverage and robustness with the minimum number
of active sensors so as to minimize the interference and the information redundancy [54, 56]. However,
due to the limited range of the wireless communication, the minimum number of sensors required for
the coverage may not guarantee the connectivity of the resulting sensor network. The network is said to
be connected if any sensor node can communicate with any other sensor node (possibly using other sensor
nodes as intermediate nodes). In some cases, the physical proximity of sensor nodes may neither guarantee
connectivity nor coverage due to the obstacles, such as buildings, walls, and trees. The connectivity of the
sensor nodes also depends on the physical layer technology used for communication. Some technologies
require the transmitter and the receiver to be in the line-of-sight, for example, infra-red, ultrasound [57].
Maintaining greater connectivity is desirable for good throughput and to avoid network partitioning due
to node failures [54].
The sensor nodes can be deployed randomly or deterministically in the sensor eld. Next we discuss
the issues and proposed strategies for placement and organization of sensor nodes.
32.6.7.1 Sensor Placement for Connectivity and Coverage
When the sensor nodes are deployed deterministically, a good placement strategy can minimize the cost
and the energy consumption thereby increasing lifetimes of sensor nodes while guaranteeing the desired
level of coverage, connectivity, and robustness [55].
Chakrabarty et al. [55] and Ray et al. [57] have used a framework of identifying codes to determine
sensor placements for target location detection. The identifying code problem, in an undirected graph,
nds an optimal covering of vertices such that any vertex on the graph can be uniquely identied by
the subset of vertices that cover it. If each location in the sensor eld is covered by a unique subset
of sensors then the position of a target can be determined from the subset of sensors that observe the
target. However, to determine the minimum number of sensors that must be deployed for uniquely
identifying each position of the target is equivalent to constructing an optimal identifying code, which is
an NP-complete problem [57]. Ray et al. [57] have proposed a polynomial-time algorithm to compute
irreducible identifying codes such that resulting codes can tolerate up to a given number of errors in the
received identifying code packets, while still providing information position.
Zou and Chakrabarty [58] have proposed a virtual force algorithm to improve the coverage after an
initial randomdeployment of sensor nodes. Initially, the sensors are deployed randomly in the sensor eld.
It is assumed that if two sensor nodes are very close to each other (closer than predened threshold) then
they exert (virtual) repulsive forces on each other. If two sensor nodes are very far (farther than predened
threshold) then they exert (virtual) attractive forces on each other. The obstacles exert repulsive forces
and areas of preferential coverage exert attractive forces on a sensor node. The objective is to move
sensor nodes from densely concentrated regions to sparsely concentrated regions so as to achieve uniform
placement. The sensor nodes do not physically move during the execution of the virtual force algorithm
but a sequence of virtual motion path is determined. After the new positions of the sensors are identied,
a one-time movement is carried out to redeploy the sensors at their new position.
32.6.7.2 Sensor Organization for Connectivity and Coverage
Sensor networks deployed in enemy territories, inhospitable areas, or disaster struck areas preclude
deterministic placement of sensor nodes [53]. Dispersing a large number of sensor nodes over a sensor
eld from an airplane is one way to deploy sensor networks in those areas. Since the sensor nodes may be
scatteredarbitrarily, a very large number of sensor nodes are deployedcomparedwiththe number of sensor
nodes that wouldhave beendeployedif deterministic placement was possible. Therefore, it is advantageous
to operate the minimum number of sensor nodes required for sensing coverage and connectivity in the
active mode and the remaining nodes in the passive (sleep) mode. The passive nodes can be made active as
and when neighboring active nodes deplete their energies or fail so as to increase the lifetime of the sensor
network. When the sensor nodes are deployed randomly, the main challenge is to develop an efcient
distributed localized strategy for sensor organization that would maximize the lifetime of the network
while guaranteeing the coverage and connectivity of active nodes [54, 56].
Wang et al. [54] have proposed a Coverage Conguration Protocol (CCP) which minimizes the number
of active nodes required for coverage and connectivity. CCP assumes that the sensing areas and the
transmission areas are circular and obstacle-free. The authors have shown that the set of sensor nodes
that covers the convex region are connected if the transmission radius is at least twice the sensing radius.
In CCP each node determines whether it is eligible to become active or not based on the coverage provided
by its active neighbors. It is shown that a set of sensors in a convex region provide the required degree of
coverage if (1) all the intersection points between any sensing circles have the required degree of coverage
and (2) all the intersection points between any sensing circle and the regions boundary have the required
degree of coverage. A sensor node discovers other active sensor nodes and their locations that are within
a distance of twice of the sensing radius through HELLO messages. Then it nds the coverage degree
of all the intersection points within its coverage area. A sensor node is not eligible to become active
if all the intersection points within its coverage area have the required degree of coverage. If there are no
intersection points within its coverage area then it is ineligible if there are required number of active
sensors located at the same position as itself. Each node periodically checks it eligibility, and only eligible
nodes remain active, sense environment, and communication with other active nodes. As active nodes
deplete their energy, nonactive nodes become eligible and become active to maintain the required degree
of coverage.
Zhang and Hou [56] have proposed an Optimal Geographical Density Control (OGDC) algorithm,
which maintains coverage and connectivity by keeping the minimum number of sensors in the active
mode. The idea behind OGDC is similar to that of CCP, that is, if all the intersection points are covered
by active sensor nodes then the entire area is covered. OGDC minimizes the number of active nodes
by selecting nodes such that the overlap of sensing areas of active nodes is the minimum. It is shown
that to minimize the overlap, the intersection point of two circles should be covered with a third circle
such that the centers of the circles form an equilateral triangle with a side length of
3r, where r is the

radius of the circles. OGDC begins each round by randomly selecting a starting node and its neighbor
at a distance (approximately) of
3r. To cover the intersection point of these two circles, a third node
is selected whose position is closest to the optimal position (centers of three circles form an equilateral
triangle). These processes continues until the entire area is covered. All the selected nodes become active
and nodes not selected go to sleep mode.
32.6.8 Topology Control
The topology of the sensor network is induced by the wireless links connecting the sensor nodes. The wire-
less connectivity of the nodes depends on many parameters, such as the physical layer technology,
propagation conditions, terrain, noise, antenna characteristics, and the transmit power [59]. The topology
of the network can be controlled by adjusting the tunable parameters, such as the power levels of the trans-
mitters [5963]. The topology of the network affects its performance in many ways. A sparse topology can
increase the chances of network partitioning due to node failures and can increase the delay. On the other
hand, a dense topology can limit the capacity due to limited spatial reuse and can increase the interference
and the energy consumption [59]. A distributed localized topology control algorithm that adjusts the
tunable parameters to achieve the desired level of performance while minimizing the energy consumption
is highly desirable.
Wattenhofer et al. [62, 63] have proposed a two-phase distributed Cone-Based Topology Control
(CBTC) algorithm. In the rst phase each node broadcasts a neighbor-discovery message with a small
radius and records all the acknowledgments and the direction from which the acknowledgments came.
The node continues its neighbor discovery process by increasing transmission power (radius) until either
it nds at least one neighbor in every cone of degrees centered on that node or it reaches its maximum
transmission power. The authors have proved that for 5/6, the algorithm guarantees that the
resulting network topology is connected. In the second phase the algorithm eliminates redundant edges
without affecting the minimum power routes of the network.
Li et al. [60] have proposed an MST (Minimum Spanning Tree)-based topology control algorithm,
called Local MinimumSpanning Tree (LMST). In the information exchange phase each node collects node
ids and positions of all the nodes within its maximum transmission range using HELLO messages. In the
topology construction phase each node independently constructs its local MST using Prims algorithm.
The transmission power needed to reach a node is taken as the cost of an edge to that node. The nal
topology of the network is derived from all the local MSTs by keeping only on-tree nodes that are one
hop away as neighbors. To retain only bidirection links, either convert all the unidirectional links into
bidirectional or delete the unidirectional links. The authors have proved that the resulting topology
preserves the network connectivity and the node degree of any node is bounded by 6.
32.7 Conclusions
In this chapter we made an attempt to present an overview of the wireless sensor networks and describe
some design issues. We discussed various solutions proposed to prolong the lifetime of sensor networks.
Proposed solutions to issues, such as MAC layer, routing data from sensor node to the base station, power
management, location determination, and clock synchronization, were discussed.
References
[1] Sohrabi, K. and Pottie, G.J. Performance of a novel self-organization protocol for wireless
ad-hoc sensor networks. In Proceedings of the IEEE Vehicular Technology Conference, vol. 2, 1999,
pp. 12221226.
[2] Min, R., Bhardwaj, M., Seong-Hwan Cho, Shih, E., Sinha, A., Wang, A., and Chandrakasan, A.
Low-power wireless sensor networks. In Proceedings of the 14th International Conference on VLSI
Design, 2001, pp. 205210.
[3] Estrin, D., Girod, L., Pottie, G., and Srivastava, M. Instrumenting the world with wireless sensor
networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal
Processing, 2001, pp. 20332036.
[4] Pottie, G.J. Wireless sensor networks. In Information Theory Workshop, 1998, pp. 139140.
[5] Agre, J. and Clare, L. An integrated architecture for cooperative sensing networks. Computer, 33,
106108, 2000.
[6] Sohrabi, K., Gao, J., Ailawadhi, V., and Pottie, G.J. Protocols for self-organization of a wireless
sensor network. IEEE Personal Communications, 7, 1627, 2000.
[7] Shashidhar Rao Gandham, Milind Dawande, Ravi Prakash, and Venkatesan, S. Energy
efcient schemes for wireless sensor networks with multiple mobile stations. IEEE Globecom,
1, 377381, 2003.
[8] Heinzelman, W., Kulik, J., and Balakrishnan, H. Adaptive protocols for information dissemination
in wireless sensor networks. In Proceedings of the Fifth Annual ACM/IEEE International Conference
on Mobile Computing and Networking, 1999, 174185.
[9] Ye, F., Chen, A., Liu, S., and Zhang, L. A scalable solution to minimum cost forwarding in large
sensor networks. InProceedings of the Tenth International Conference onComputer Communications
and Networks, 2001, pp. 304309.
[10] Lindsey, S. and Raghavendra, C.S. PEGASIS: power-efcient gathering in sensor information
systems. In Proceedings of the International Conference on Communications, 2001.
[11] Manjeshwar, A. and Agrawal, D.P. TEEN: a routing protocol for enhanced efciency in wire-
less sensor networks. In International Proceedings of the 15th Parallel and Distributed Processing
Symposium, 2001, pp. 20092015.
[12] Gao, J. Analysis of energy consumption for ad hoc wireless sensor networks using the watts-
per-meter metric. IPN Progress Report, 42150, 2002.
[13] Youssef, M.A., Younis, M.F., and Arisha, K.A. A constrained shortest-path energy-aware routing
algorithm for wireless sensor networks. In Proceedings of the Wireless Communications and
Networking Conference, vol. 2, 2002, pp. 794799.
[14] Bhardwaj, M., Chandrakasan, A., and Garnett, T. Upper bounds on the lifetime of sensor networks.
In Proceedings of the IEEE International Conference on Communications, 2001, pp. 785790.
[15] Elson, J. and Estrin, D. Time synchronization for wireless sensor networks. In Proceedings of the
15th International Parallel and Distributed Processing Symposium, 2001, pp. 19651970.
[16] Wei Ye, John Heidemann and Deborah Estrin. An energy-efcient MAC protocol for wireless
sensor networks. In Proceedings of the IEEE INFOCOM, 2002.
[17] Bhargavan, V., Demers, A., Sheker, S., and Zhang, L. MACAW: a media access protocol for wireless
LANS. In Proceedings of the ACM SIGCOMM Conference, 1994.
[18] Singh, S. and Ragavendra, C.S. PAMAS: power ware multi-access protocol with signalling for
ad-hoc networks. ACM Computer Communication Review, 28, 526, 1998.
[19] Andrew, S. Tanenbaum. Computer Networks, 3rd ed., Prentice-Hall Inc., New York, 1996.
[20] Frazer Bennett, DavidClarke, JosephB. Evans, Andy Hopper, AlanJones, andDavidLeask. Piconet:
embedded mobile networking. IEEE Personal Communications, 4, 815, 1997.
[21] Estrin, D., Govindan, R., Heidemann, J., and Kumar, S. Next century challenges: scalable
coordination in sensor networks. In Proceedings of the Fifth Annual ACM/IEEE International
Conference on Mobile Computing and Networking, 1999, pp. 263270.
[22] Heinzelman, W., Kulik, J., and Balakrishnan, H. Negotiation-based protocols for disseminating
informationinwireless sensor networks. InProceedings of the FifthAnnual ACM/IEEEInternational
Conference on Mobile Computing and Networking, 1999.
[23] Heinzelman, W.R., Chandrakasan, A., and Balakrishnan, H. Energy-efcient communication
protocol for wireless micro sensor networks. In Proceedings of the 33rd Annual Hawaii International
Conference on System Sciences, 2000, pp. 30053014.
[24] Menezes, AlfredJ., vanOorschot, Paul C., andVanstone Scott A. Handbook of AppliedCryptography.
CRC Press, Boca Raton, FL, October 1996.
[25] PerrigAdrian, SzewczykRobert, WenVictor, Culler David, andTygar, J.D. SPINS: security protocols
for sensor networks. Wireless Networks Journal, 8, 521534, 2002.
[26] Undercoffer Jeffery, Avancha Sasikanth, Joshi Anupam, and Pinkston John. Security for sensor
networks. In CADIP Research Symposium, 2002.
[27] Slijepcevic Sasha, Potkonjak Miodrag, Tsiatsis Vlasios, Zimbeck Scott, and Srivastava Mani B.
On communication security in wireless ad-hoc sensor network. In Proceedings of the 11th IEEE
International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprises,
Pittsburgh, PA, June 1012, 2002.
[28] Liu, Donggang and Ning, Peng. Efcient distribution of key chain commitments for broadcast
authentication in distributed sensor networks. In Proceedings of the 10th Annual Network and
Distributed System Security Symposium. San Diego, CA, February 2003.
[29] Lazos, Loukas andPoovendran, Radha. Secure broadcast inenergy-aware wireless sensor networks.
In Proceedings of the IEEE International Symposium on Advances in Wireless Communications.
Victoria, BC, Canada, September 2324, 2002.
[30] Karlof, Chris and Wagner, David. Secure routing in wireless sensor Networks: attacks and coun-
termeasures. In Proceedings of the First IEEE International Workshop on Sensor Network Protocols
and Applications, May 2003.
[31] Eschenauer, Laurent and Gligor, Virgil D. A Key-Management Scheme for Distributed Sensor
Networks. ACM Conference on Computer and Communications Security, Washington DC, USA,
2002, pp. 4147.
[32] Chan, Haowen, Perrig, Adrian, and Song, Dawn. Random key predistribution schemes for sensor
networks. In Proceedings of the 2003 IEEE Symposium on Research in Security and Privacy, 2003.
[33] Carman, c.D.W., Matt, B.J., and Cirincione, G.H. Energy-efcient and low-latency key
management for sensor networks. In Proceedings of the 23rd Army Science Conference, Orlando,
FL, December 25, 2002.
[34] Law, Yee Wei, Etalle, Sandro, and Hartel, Pieter H. Key management with group-wise pre-deployed
keying and secret sharing pre-deployed keying. Centre for Telematics and InformationTechnology,
University of Twente, The Netherlands, Technical report (TR-CTIT-02-25), July 2002.
[35] Law, Yee Wei, Corin, Ricardo, Etalle, Sandro, and Hartel, Pieter H. Aformally veried decentralized
key management architecture for wireless sensor networks. 4th IFIP TC6/WG6.8 Int. Conf on
Personal Wireless Communications (PWC), LNCS 2775, Venice, Italy, September 2003, pp. 2739.
[36] Ray, Saikat, Ungrangsi, Rachanee, De Pellegrini, Francesco, Trachtenberg, Ari and
Starobinski, David. Robust location detection in emergency sensor networks. In Proceedings of
INFOCOM, 2003.
[37] Hofmann-Welleenhof, B., Lichtenegger, H., and Collins, J. Global Positioning Sytem: Theory and
Practice, 4th ed., Springer-Verlag, Heidelberg, 1997.
[38] Priyantha, Nissanka B., Chakraborthy, Anit, and Balakrishnan, Hari. The cricket location-support
system. In Proceedings of the ACM MOBICOM Conference, Boston, MA, 2000.
[39] Want, Roy, Hopper, Andy, Falcao, Veronica, and Gibbons, Jon. The active badge location system.
ACM Transactions on Information Sytems, 10, 91102, 1992.
[40] Bahl, Paramvir and Padmanabhan, Venkata N. RADAR: an in-building RF-based user
location and tracking system. In Proceedings of the IEEE INFOCOM Conference. Tel Aviv,
Israel, 2000.
[41] Hightower, Jefferey, Borriello, Gaetano, and Want, Roy. SpotON: an indoor 3D location sensing
technology based on RF signal strength. Technical report, 2000-020-02, University of Washington,
February 2000.
[42] Savarese, C. and Rabaey, J. Locationing in distributed ad-hoc wireless sensor networks. In IEEE
Proceedings on Acoustics, Speech, and Signal Processing, 2001, pp. 20372040.
[43] Colerim, Sinem, Ergen, Mustafa, andJohnKoo, T. Lifetime analysis of a sensor network withhybrid
automata modelling. In Proceedings of the ACM WSNA Conference. Atlanta, GA, September 2002.
[44] Sinha, A. and Chandrakasan, A. Dynamic power management in wireless sensor networks.
IEEE Design and Test of Computers, 18, 6274, 2001.
[45] Wang, A. and Chandrakasan, A. Energy efcient system partitioning for distributed wireless
sensor networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and
Signal Processing, vol. 2, 2001, pp. 905908.
[46] lm, C., Huiseok Kim, and Soonhoi, Ha. Dynamic voltage scheduling technique for low-power
multimedia applications using buffers. In Proceedings of the International Symposium on Low
Power Electronics and Design, 2001, pp. 3439.
[47] Shih, E., Calhoun, B.H., Cho, Seong Hwan, and Chandrakasan, A.P. Energy-efcient link layer for
wireless micro sensor networks. In Proceedings of the IEEE Computer Society Workshop on VLSI,
2001, pp. 1621.
[48] Salhieh, A., Weinmann, J., Kochha, M., and Schwiebert, L. Power efcient topologies for wire-
less sensor networks. In Proceedings of the International Conference on Parallel Processing, 2001,
pp. 156163.
[49] Elson, J. and Estrin, D. Time synchronization for wireless sensor networks. In International
Proceedings of the 15th Parallel and Distributed Processing Symposium, 2001, pp. 19651970.
[50] Stann, Fred and Heidemann, John. RMST: reliable data transport in sensor networks.
In Proceedings of the IEEE International Workshop on Sensor Net Protocols and Applications,
May 2003.
[51] Yogesh, S., Ozgur, B. Akan, and Akyildiz, Ian F. ESRT: event-to-sink reliable transport in wireless
sensor networks. In ACM MobiHoc, June 2003.
[52] Santpal Singh Dhillon and Chakrabarty, Krishnendu. Sensor placement for effective coverage and
surveillance in distributed sensor networks. In Proceedings of the IEEE Wireless Communications
and Networking Conference, March 2003.
[53] Slijepcevic Sasa, and Potkonjak, Miodrag. Power efcient organizationof wireless sensor networks.
In Proceedings of the IEEE International Conference on Communications, June 2001.
[54] Wang, X., Xing, G., Zhang, Y., Lu, C., Pless, R., and Gill, C. Integrated coverage and connectivity
conguration in wireless sensor networks. In Proceedings of the ACMSenSys 2003, November 2003.
[55] Chakrabarty, K., Iyengar, S.S., Qi, H., and Cho, E. Grid coverage for surveillance and tar-
get location in distributed sensor networks. IEEE Transactions on Computers, 51(12), 14481453,
December 2002.
[56] Zhang, Honghia and Hou, Jennifer C. Maintaining sensing coverage and connectivity in large
sensor networks. Technical report# UIUCDCS-R-2003-2351, University of Illinois at Urbana-
Champaign, June 2003.
[57] Ray, Saikat, Ungrangsi, Rachanee, De Pellegrini, Francesco, Trachtenberg, Ari, and Starobinski,
David. Robust location detection in emergency sensor networks. In Proceedings of the IEEE
INFOCOM, April 2003.
[58] Zou, Yi and Chakrabarty, Krishnendu. Sensor deployment and target localization based on virtual
forces. In Proceedings of the IEEE INFOCOM, April 2003.
[59] Ramanathan, Ram and Rosales-Hain, Regina. Topology control of multihop wireless networks
using transmit power adjustment. In Proceedings of the IEEE INFOCOM, March 2000.
[60] Li, Ning, Hou, Jennifer C., and Sha, Lui. Design and analysis of an MST-based topology control
algorithm. In Proceedings of the IEEE INFOCOM, April 2003.
[61] Liu, Jilei and Li, Baochun. Distributed topology control in wireless sensor networks with
asymmetric links. In Proceedings of the IEEE GLOBECOM, December 2003.
[62] Wattenhofer, Roger, Li, Li, Bahl, Paramvir, and Yi-Min Wang. Distributed topology control
for power efcient operation in multihop wireless ad hoc networks. In Proceedings of the IEEE
INFOCOM, April 2001.
[63] Li, Li, Halpern, Joseph Y., Bahl, Paramvir, Wang, Yi-Min, and Wattenhofer, Roger. Analysis of a
cone-baseddistributedtopology control algorithmfor wireless multi-hopnetworks. In Proceedings
of the ACM Symposium on Principles of Distributed Computing, August 2001.
33
Architectures for
Wireless Sensor
Networks
S. Dulman,
S. Chatterjea,
T. Hoffmeijer,
P. Havinga,
and J. Hurink
33.1 Sensor Node Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-2
Mathematical Energy Consumption Model of a Node
33.2 Wireless Sensor Network Architectures . . . . . . . . . . . . . . . . 33-5
Protocol Stack Approach EYES Project Approach
33.3 Data-Centric Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-17
Motivation Architecture Description
33.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33-21
The vision of ubiquitous computing requires the development of devices and technologies that can be
pervasive without being intrusive. The basic component of such a smart environment will be a small
node with sensing and wireless communications capabilities, able to organize itself exibly into a network
for data collection and delivery. Building such a sensor network presents many signicant challenges,
especially at the architectural, protocol, and operating system level.
Although sensor nodes might be equipped with a power supply or energy scavenging means and an
embedded processor that makes them autonomous and self-aware, their functionality and capabilities
will be very limited. Therefore, collaboration between nodes is essential to deliver smart services in
a ubiquitous setting. New algorithms for networking and distributed collaboration need to be developed.
These algorithms will be the key for building self-organizing and collaborative sensor networks that show
emergent behavior and can operate in a challenging environment where nodes move, fail, and energy is
a scarce resource.
The question that rises is how to organize the internal software and hardware components in a manner
that will allowthemtowork properly andbe able toadapt dynamically tonewenvironments, requirements,
and applications. At the same time the solution should be general enough to be suited for as many
applications as possible. Architecture denition also includes, at the higher level, a global view of the
whole network. The topology, placement of base stations, beacons, etc. is also of interest.
In this chapter, we will present and analyze some of the characteristics of the architectures for wireless
sensor networks. Then, we will propose a new dataow-based architecture that allows, as a new feature,
the dynamic reconguration of the sensor nodes software at runtime.
33-1
33.1 Sensor Node Architecture
Current existing technology already allows integration of functionalities for information gathering,
processing, and communication in a tight packaging or even in a single chip (e.g., Figure 33.1
presents the EYES sensor node [1]). The four basic blocks needed to construct a sensor node are
(see Figure 33.2):
Sensor platform. The sensors are the interfaces to the real world. They collect the necessary information
and have to be monitored by the central processing unit (CPU). The platforms may be built in a modular
way such that a variety of sensors can be used in the same network. The utilization of a very wide range of
sensors (monitoring characteristics of the environment, such as light, temperature, air pollution, pressure,
etc.) is envisioned. The sensing unit can also be extended to contain one or more actuation units (e.g., to
give the node the possibility of repositioning itself).
Processing unit. Is the intelligence of the sensor node will not only collect the information detected
by the sensor but will also communicate with the server network. The level of intelligence in the sensor
node will strongly depend on the type of information that is gathered by its sensors and by the way
in which the network operates. The sensed information will be preprocessed to reduce the amount of
data to be transmitted via the wireless interface. The processing unit will also have to execute some
networking protocols in order to forward the results of the sensing operation through the network to the
requesting user.
Communication interface. This is the link of each node to the sensor network itself. The focus relies on
a wireless communication link, in particular on the radio communication, although visible or infrared
light, ultrasound, etc. means of communications have already been used [2]. The used radio trans-
ceivers can usually function in simplex mode only, and can be completely turned off, in order to save
energy.
Power source. Owing to the application areas of the sensor networks, autonomy is an important
issue. Sensor nodes are usually equipped with a power supply in the form of one or more batteries.
Current studies focus on reducing the energy consumption by using low-power hardware components
FIGURE 33.1 EYES sensor node. (From EYES. Eyes European project, http://eyes.eu.org. With permission.)
Architectures for Wireless Sensor Networks 33-3
Processing unit
Sensor platform
Communication
interface
Power source
FIGURE 33.2 Sensor node components.
and advanced networking and data management algorithms. The usage of such energy scavenging tech-
niques for sensor nodes might make possible for the sensor nodes to be self-powered. No matter which
form of power source is used, energy is still a scarce resource and a series of trade-offs will be employed
during the design phase to minimize its usage.
Sensor networks will be heterogeneous fromthe point of viewof the types of nodes deployed. Moreover,
whether or not any specic sensor node can be considered as being part of the network only depends on
the correct usage and participation in the sensor network suite of protocols and not on the nodes specic
way of implementing software or hardware. An intuitive description given in Reference 3 envisions a sea of
sensor nodes, some of them being mobile and some of them being static, occasionally containing tiny isles
of relatively resource-rich devices. Some nodes in the system may execute autonomously (e.g., forming
the backbone of the network by executing network and system services, controlling various information
retrieval and dissemination functions, etc.), while others will have less functionality (e.g., just gathering
data and relaying it to a more powerful node). Thus, from the sensor node architecture point of view
we can distinguish between several kinds of sensor nodes. A simple yet sufcient in the majority of the
cases approach would be to have two kinds of nodes: high-end sensor nodes (nodes that have plenty of
resources or superior capabilities; the best candidate for such a node would probably be a fully equipped
PDA device or even a laptop) and low-end nodes (nodes that have only the basic functionality of the
system and have very limited processing capabilities).
The architecture of a sensor node consists of two main components: dening the precise way in which
functionalities are needed and how to join them into a coherent sensor node. In other words, sensor node
architecture means dening the exact way in which the selected hardware components connect to each
other, how they communicate and how they interact with the CPU, etc.
A large variety of sensor node architectures have been built up to this moment. As a general design
rule, all of them have targetted the following three objectives: energy efciency, small size, and low cost.
Energy efciency is by far the most important design constraint because energy consumption depends
on the lifetime of the sensor nodes. As the typical scenario of sensor networks deployment assumes that
the power supplies of nodes will be limited and not rechargeable, a series of trade-offs need to be made
to decrease the amount of consumed energy. Small size of the nodes leads to the ability of deploying
lots of them to study a certain phenomenon. The ideal size is suggested by the name of one of the rst
research projects in the area: SmartDust [4]. Very cheap sensor nodes will lead to rapid deployment of
such networks and large-scale usage.
33.1.1 Mathematical Energy Consumption Model of a Node
In this section, we present a basic version of an energy model for a sensor node. The aim of the model is
to predict the current energy state of the battery of a sensor node based on historical data on the use of
the sensor node and the current energy state of the battery.
In general a sensor node may consist of several components. The main components are: a radio,
a processor, a sensor, a battery, external memory, and periphery (e.g., a voltage regulator or debugging
equipment and periphery to drive an actuator). In the presented model we consider only the rst four
components. The external memory is neglected in this stage of the research since its use of energy is rather
complex and needs an own energy model if the memory is a relevant part of the functional behavior of
the sensor node and not just used for storage. The periphery can be quite different and, thus, can not be
integrated in an energy model of a sensor node in a uniform way.
For the battery we assume that the usage of energy by the other components is independent of the
current energy state of the battery. This implies that the reduction of the energy state of the battery
depends only on the actions of the different components. Furthermore, we do not consider a reactivation
of the battery by time or external circumstances. Based on these assumptions, it remains to give models
for the energy consumption of the three components radio, processor, and sensor.
The base of the model for the energy consumption of a component is the denition of a set S of
possible states s
1
, . . . , s
k
for the component. These states are dened such that the energy consump-
tion of the component is given by the sum of the energy consumptions within the states s
1
, . . . , s
k
plus the energy needed to switch between the different states. We assume that the energy consumption
within a state s
j
can be measured using a simple index t
j
(e.g., execution time or number of instruc-
tions) and that the energy needed to switch between the different states can be calculated on the basis
of a state transition matrix st, where st
ij
denotes the number of times the component has switched from
state s
i
to state s
j
. If now P
j
denotes the needed power in the state s
j
and E
ij
denotes the energy con-
sumption of switching once from state s
i
to state s
j
, the total energy consumption of the component is
given by
E
consumed
=
k
j=1
t
j
P
j
+
k
i,j=1,i=j
st
ij
E
ij
(33.1)
In the following, we describe the state sets S and the indices to measure the energy consumption within
the states for the radio, processor, and sensor:
Radio. For the energy consumption of a radio four different states need to be distinguished: off, sleep,
receiving, and transmitting. For all these four states the energy consumption depends on the time the
radio has been in the state. Thus, for the radio we need to memorize the times the radio has been in the
four states and the 4 4 state transition matrix representing the number of times the radio has switched
between the four states.
Processor. In general, for a processor, four main states can be identied: off, sleep, idle, and active.
In sleep mode the CPU and most internal peripherals are turned off. It can be awaked by an external
event (interrupt) only, on which idle state is entered. In idle mode the CPU is still inactive, but now
some peripherals are active, such as the internal clock or timer). Within the active state the CPU and
all peripherals are active. Within this state multiple states might be identied based on clock speeds and
voltages. We assume that the energy consumption depends on the time the processor has been in a certain
state.
Sensor. For a (simple) sensor we assume that only the two states on and off are given and that the energy
consumption within both states can be measured by time. However, if more powerful sensors are used,
it may be necessary to work with more states (similar to the processor or the radio).
The energy model for the complete sensor node now consists of the energy models for the three
components radio, processor, and sensor plus two extra indicators for the battery:
For the battery only the energy state E
old
at a time t
old
in the past is given.
For each component, the indices I
j
characterizing the energy consumption in state s
j
since time t
old
and the state transition matrix st indicating the transitions since time t
old
are specied.
Based on this information an estimate of the current energy state of the battery can be calculated by
subtracting from E
old
the sum of the consumed energy for each component estimated on the base of
Equation (33.1).
Since the energy model gives only an estimate of the remaining energy of the battery, in practice it may
be a good approach to use the energy model only for limited time intervals. If the difference between the
current time and t
old
gets larger than a certain threshold, the current energy state of the battery should be
estimated on the base of measurements or other information available on the energy state and E
old
and
t
old
should be replaced by this new estimate and the current time. Furthermore, the indices characterizing
the states and the state transition matrix are reset for all the components of the sensor node.
33.2 Wireless Sensor Network Architectures
A sensor network is a very powerful tool when compared to a single sensing device. It consists of a large
number of nodes, equipped with a variety of sensors that are able to monitor different characteristics of
a phenomenon. A dense network of such small devices, will give the researcher the opportunity to have
a spatial view over the phenomenon and, at the same time, it will produce results based on a combination
of various sorts of sensed data.
Each sensor node will have two basic operation modes: initialization phase and operation phase. But,
the network as a whole will function in a smooth way, with the majority of the nodes in the operation
mode and only a subset of nodes in the initialization phase. The two modes of operation for the sensor
nodes have the following characteristics:
Initialization mode. A node can be considered in initialization mode if it tries to integrate itself in the
network and is not performing its routine function. A node can be in initialization mode, for example, at
power on or when it detects a change in the environment and needs to congure itself. During initialization,
the node can pass through different phases such as detecting its neighbors and the network topology, syn-
chronizing with its neighbors, determining its own position or even performing conguration operations
on its own hardware and software. At a higher abstraction level, a node can be considered in initialization
mode if it tries to determine which services are already present in the network, which services it needs to
provide or can use.
Operation mode. After the initialization phase the node enters a stable state, the regular operation state.
It will function based on the conditions determined in the initialization phase. The node can exit the
operation mode and pass through an initialization mode if either the physical conditions around it or
the conditions related to the network or to itself have changed. The operation mode is characterized by
small bursts of node activity (such as reading sensors values, performing computations, or participating
in networking protocols) and periods spent in an energy-saving low-power mode.
33.2.1 Protocol Stack Approach
A rst approach to building a wireless sensor network will be to use a layered protocol stack as a starting
point, as in the case of traditional computer network. The main difference between the two kinds of
networks is that the blocks needed to build the sensor network usually span themselves over multiple
layers while others depend on several protocol layers. This characteristic of sensor networks comes from
the fact that they have to provide functionalities that are not present in traditional networks. Figure 33.3
presents an approximative mapping of the main blocks onto the traditional OSI protocol layers.
The authors of Reference 5 propose an architecture based on the ve OSI layers together with three
management planes that go throughout the whole protocol stack (see Figure 33.4). A brief description
of the layers included: (1) the physical layer, which addresses mainly the hardware details of the wireless
communication mechanism, such as the modulation type, the transmission and receiving techniques, etc.;
(2) the data-link layer is concerned with the Media Access Control (MAC) protocol that manages com-
munication over the noisy shared channel; (3) the network layers manages routing the data between the
nodes, while the transport layer helps to maintain the dataow; (4) nally, the application layer contains
(very often) only one single user application.
A
g
g
r
e
g
a
t
i
o
n
Physical
Link
Network
Transport
Application
C
l
u
s
t
e
r
i
n
g
L
o
c
a
l
i
z
a
t
i
o
n
Lookup
T
i
m
i
n
g
A
d
d
r
e
s
s
i
n
g
S
e
c
u
r
i
t
y
Routing
Collaboration
FIGURE 33.3 Relationship between building blocks and OSI layers.
Application layer
Transport layer
Network layer
Data link layer
Physical layer
P
o
w
e
r

m
a
n
a
g
e
m
e
n
t

p
l
a
n
e
M
o
b
i
l
i
t
y

m
a
n
a
g
e
m
e
n
t

p
l
a
n
e
T
a
s
k

m
a
n
a
g
e
m
e
n
t

p
l
a
n
e
FIGURE 33.4 Protocol stack representation of the architecture. (From Akyildiz, I., Su, W., Sankarasubramaniam, Y.,
and Cayirci, E. IEEE Communication Magazine, 40, 102114, 2002. With permission.)
In addition to the ve network layers, three management planes have the following functionality: the
power management plane coordinates the energy consumption inside the sensor node. It can, for example,
be based on the available amount of energy, allow the node to take part in certain distributed algorithms
or to control the amount of trafc it wants to forward. The mobility management plane will manage all
the information regarding the physical neighbors and their movement patterns as well as its own moving
pattern. The task management plane coordinates sensing in a certain region based on the number of nodes
and their placement (in very densely deployed sensor networks, energy might be saved by turning certain
sensors off to reduce the amount of redundant information sensed).
In the following we will give a description of the main building blocks needed to setup a sensor network.
The description will follow the OSI model. This should not imply that this is the right structure for these
networks, but is taken only as a reference point:
Physical layer. The physical layer is responsible for the management of the wireless interface. For a given
communication task, it denes a series of characteristics as: operating frequency, modulation type, data
coding, interface between hardware and software, etc.
The large majority of already built sensor networks prototypes and most of the envisioned application
scenarios assume the use of a radio transceiver as the means for communication. The unlicensedindustrial,
scientic, and medical (ISM) band is preferred because it is a free band designed for short-range devices
using low-power radios andrequiring lowdata-transmissionrates. The modulationscheme usedis another
important parameter to decide upon. Complex modulation schemes might not be preferred because they
require important resources (in the form of energy, memory, computation power).
In the future, the advancements of the integrating circuits technology (e.g., ASIC, FPGA) will allow the
use of modulation techniques, such as ultrawide band (UWB) or impulse radio (IR), while if the sensor
node is built using off-the-shelf components the choice comes down mainly to schemes such as amplitude
shift keying (ASK) or frequency shift keying (FSK). Based on the modulation type and on the hardware
used, a specic data encoding scheme will be chosen to assure both the synchronization required by
the hardware component and a rst level of error correction. At the same time, the data frame will also
include some carefully chosen initial bytes needed for the conditioning of the receiver circuitry and clock
recovery.
It is worth mentioning that the minimum output power required to transmit a radio signal over
a certain distance is directly proportional to the distance raised to a power between two and four (the coef-
cient depends on the type of the antenna used and its placement relative to the ground, indooroutdoor
deployment, etc.). In these conditions, it is more efcient to transmit a signal using a multihop network
composed of short-range radios rather than using a (power consuming) long-range link [5].
The communication subsystem usually needs a controller hierarchy to create the abstraction for the
other layers in the protocol stack (we are referring to the device hardware characteristics and the strict
timing requirements). If a simple transceiver is used, some of these capabilities will need to be provided
by the main processing unit of the sensor node (this can require a substantial amount of resources for
exact timing execution synchronization, crosslayer distribution of the received data, etc.). The use of more
advanced specialized communication controllers is not preferred as they will hide important low-level
details of information.
Data-link layer. The data-link layer is responsible for managing most of the communicationtasks within
one hop (both point-to-point and multicasting communication patterns). The main research issues here
are the MAC protocols, the error control strategies and the power consumption control.
The MAC protocols make the communication between several devices over a shared channel possible
by coordinating the sendingreceiving actions as a function of time or frequency. Several strategies have
already been studied and implemented for the mobile telephony networks and for the mobile ad hoc
networks but unfortunately, none of them is directly applicable. Still, ideas can be borrowed from the
existing standards and applications and new MAC protocols can be derived this can be proven by the
large number of new schemes that target specically the wireless sensor networks.
As the radio component is probably the main energy consumer in each sensor node, the MAC protocol
must be very efcient. To achieve this, the protocol must, rst of all, make use of the power down state
of the transceiver (turn the radio off) as much as possible because the energy consumption is negligible
in this state. The most important problem comes from the scheduling of the sleep, receive, and transmit
states. The transitions among these states also need to be taken into account as they consume energy
and sometimes take large time intervals. Message collisions, overhearing, and idle listening are direct
implications of the scheduling used inside the MACprotocol which, in addition, inuences the bandwidth
lost due to the control packet overheads.
A second function of the data-link layer is to perform error control of the received data packets. The
existent techniques include automatic repeat-request (ARQ) and forward error correction (FEC) codes.
The choice of a specic technique comes down to the trade-off between the energy consumed to transmit
redundant information over the channel and the energy and high computation power needed at both the
coder/decoder sides.
Additional functions of the data-link layer are creating and maintaining a list of the neighbor nodes
(all nodes situated within the direct transmission range of the node in discussion); extracting and adver-
tising the source anddestinationas well as the data content of the overheardpackets; supplying information
related to the amount of energy spent on transmitting, receiving, coding, and decoding the packets, the
amount of errors detected, the status of the channel, etc.
Network layer. The network layer is responsible for routing of the packets inside the sensor network.
It is one of the most studied topics in the area of wireless sensor networks and it received a lot of
attention lately. The main design constraint for this layer is, as in all the previous cases, the energy
efciency.
The main function of wireless sensor networks is to deliver sensed data (or data aggregates) to the base
stations requesting it. The concept of data-centric routing has been used to address this problem in an
energy-efcient manner, minimizing the amount of trafc in the network. In data-centric routing, each
node is assigned a specic task based on the interests of the base stations. In the second phase of the
algorithm, the collected data is sent back to the requesting nodes. Interest dissemination can be done in
two different ways, depending on the expected amount of trafc and level of events in the sensor network:
the base stations can broadcast the interest to the whole network or the sensor nodes themselves can
advertise their capabilities and the base stations will subscribe to that.
Based on the previous considerations, the network layer needs to be optimized mainly for two oper-
ations: spreading the user queries, generated at one or more base stations, around the whole network, and
then retrieving the sensed data to the requesting node. Individual addressing of each sensor node is not
important in the majority of the applications.
Due to the high density of the sensor nodes, a lot of redundant information will be available inside
the sensor network. Retrieving all this information to a certain base station might easily exceed the
available bandwidth, making the sensor network unusable. The solution to this problem is the data
aggregation technique, which requires each sensor node to inspect the content of the packets it has to
route and aggregate the contained information, reducing the high redundancy of the multiple sensed
data. This technique was proven to substantially reduce the overall trafc and make the sensor net-
work behave as an instrument for analyzing data rather than just a transport infrastructure for raw
data [6].
Transport layer. This layer appears from the need to connect the wireless sensor network to an external
network, such as the Internet, in order to disseminate its data readings to a larger community [7]. Usually
the protocols needed for such interconnections require signicant resources and they will not be present
in all the sensor nodes. The envisioned scenario is to allow a small subset of nodes to behave as gateways
between the sensor network and some external networks. These nodes will be equipped with superior
resources and computation capabilities, and will be able to run the needed protocols to interconnect the
networks.
Application layer. The application layer usually links the users applications with the underlying layers
in the protocol stack. Sensor networks are designed to fulll one single application scenario for each
particular case. The whole protocol stack is designed for a special application and the whole network is
seen as an instrument. These make the application layer to be distributed along the whole protocol stack,
and not appear explicitly. Still, for the sake of classication we can consider an explicit application layer
that could have one of the following functionalities [5]: sensor management protocol, task management
and data advertisement protocol, and sensor query and data dissemination protocol.
33.2.2 EYES Project Approach
The approach taken in the EYES project [1] consists of only two key system abstraction layers: the sensor
and networking layer and the distributed services layer (see Figure 33.5). Each layer provides services that
may be spontaneously specied and recongured:
1. The sensor and networking layer contains the sensor nodes (the physical sensor and wireless
transmission modules) and the network protocols. Ad hoc routing protocols allow messages to be
forwarded through multiple sensor nodes taking into account the mobility of nodes, and the
dynamic change of topology. Communication protocols must be energy efcient since sensor
nodes have very limited energy supplies. To provide more efcient dissemination of data, some
sensors may process data streams, and provide replication and caching.
Applications
Information
service
Lookup
service
Distributed
services
Sensors and
networking
FIGURE 33.5 EYES project architecture description.
2. The distributed services layer contains distributed services for supporting mobile sensor applica-
tions. Distributed services coordinate with each other to perform decentralized services. These
distributed servers may be replicated for higher availability, efciency, and robustness. We have
identied two major services. The lookup service supports mobility, instantiation, and recongura-
tion. The informationservice deals withaspects of collecting data. This service allows vast quantities
of data to be easily and reliably accessed, manipulated, disseminated, and used in a customized
fashion by applications.
On top of this architecture, applications can be built using the sensor network and distributed services.
Communication in a sensor network is data centric since the identity of the numerous sensor nodes is not
important, only the sensed data together with time and location information counts. The three main
functions of the nodes within a sensor network are directly related to this:
Data discovery. Several classes of sensors will be equipped in the network. Specialized sensors can
monitor climatic parameters (humidity, temperature, etc.), motion detection, vision sensors, and so on.
A rst step of data preprocessing can also be included in this task.
Data processing and aggregation. This task is directly related to performing distributed computations
on the sensed data and also aggregating several observations into a single one. The goal of this operation
is the reduction of energy consumption. Data processing inuences it by the fact that the transmission
of one (raw sensed) data packet is equivalent to many thousands of computation cycles in the current
architectures. Data aggregation keeps the overall trafc low by inspecting the contents of the routed
packets, and in general, reducing the redundancy of the data in trafc by combining several similar
packets into a single one.
Data dissemination. This task includes the networking functionality comprising routing, multicasting,
broadcasting, addressing, etc.
The existing network scenarios contain both static and mobile nodes. In some cases, the static nodes can
be considered to forma back-bone of the network and are more likely to be preferred in certain distributed
protocols. Both mobile and static nodes will have to perform data dissemination, so the protocols should
be designed to be invariant to node mobility. The particular hardware capabilities of each kind of sensor
node will determine how the previously described tasks will be mapped onto them (in principle, all the
nodes could provide all the previous functionalities). During the initialization phase of the network, the
functionality of every node will be decided based on both the hardware congurations and the particular
environmental conditions.
For a large sensor network to be able to function correctly, a tiered architecture is needed. This means
that nodes will have to organize themselves into clusters based on certain conditions. The nodes in each
cluster will elect a leader the best tted node to perform coordination inside the cluster (this can
be e.g., the node with the highest amount of energy or the node having the most advanced hardware
architecture or just a random node). The cluster leader will be responsible for scheduling the node
operations, managing the resources and the cluster structure, and maintaining communication with the
other clusters.
We can talk about several types of clusters that can coexist in a single network:
Geographical clustering. The basic mode of organizing the sensor network. The clusters are built based
on the geographical proximity. Neighboring nodes (nodes that are in transmission range of each other)
will organize themselves into groups. This operation can be handled in a completely distributed manner
and it is a necessity for the networking protocols to work even when the network scales up.
Information clustering. The sensor nodes can be grouped into information clusters based on the services
they can provide. This clustering structure belongs to the distributed services layer and is built on top of
the geographical clustering. Nodes using this clustering scheme need not be direct neighbors from the
physical point of view.
Security clustering. A even higher hierarchy appears if security is taken into consideration. Nodes can
be grouped based on their trust levels or based on the actions they are allowed to perform or resources
they are allowed to use in the network.
Besides offering increased capabilities to the sensor network, clustering is considered as one of the
principal building blocks for the sensor networks also from the point of view of energy consumption. The
overhead given by the energy spent for creating and organizing the sensor network is easily recovered in
the long term due to the reduced trafc it leads to.
33.2.2.1 Distributed Services Layer Examples
This section focuses on the distributed services that are required to support applications for wireless sensor
networks. We discuss the requirements of the foundation necessary to run these distributed services and
describe how various research projects approach this problem area from a multitude of perspectives.
A comparison of the projects is also carried out.
One of the primary issues of concern in wireless sensor networks is to ensure that every node in
the network is able to utilize energy in a highly efcient manner so as to extend the total network
lifetime to a maximum [5, 8, 9]. As such, researchers have been looking at ways to minimize energy
usage at every layer of the network stack, starting from the physical layer right up to the application
layer.
While there are a wide range of methods that can be employed to reduce energy consumption, architec-
tures designed for distributed services generally focus on one primary area how to reduce the amount
of communication required and yet get the main job done without any signicant negative impact by
observing and manipulating the data that ows through the network [6, 10, 11]. This leads us to look at
the problem at hand from a data-centric perspective.
In conventional IP-style communication networks, such as on the Internet for instance, nodes are
identied by their end-points and internode communication is layered on an end-to-end delivery service
that is provided within the network. At the communication level, the main focus is to get connected
to a particular node within the network, thus the addresses of the source and destination nodes are
of paramount importance [12]. The precise data that actually ows through the network is irrelevant
to IP.
Sensor networks, however, have a fundamental difference compared to the conventional communica-
tion networks described above as they are application-specic networks. Thus instead of concentrating on
which particular node a certain data message is originating from, a greater interest lies in the data message
itself what is the data in the data message and what can be done with it? This is where the concept of
a data-centric network architecture comes into play.
As sensor nodes are envisioned to be deployed by the hundreds and potentially even thousands [8],
specic sensor nodes are not usually of any interest (unless of course a particular sensor needs to have
its software patched or a failure needs to be corrected). This means that instead of a sensor network
application asking for the temperature of a particular node with ID 0315, it might pose a query asking,
What is the temperature in sector D of the forest?
Such a framework ensures that the acquired results are not just dependent on a single sensor. Thus
other nodes in sector D can respond to the query even if the node with ID 0315 dies. The outcome is
not only a more robust network but due to the high density of nodes [13], the user of the network is
also able to obtain results of a higher delity (or resolution). Additionally, as nodes within the network
are able to comprehend the meaning of the data passing through them, it is possible for them to carry
out application-specic processing within the network thus resulting in the reduction of data that needs to
be transmitted [14]. In-network processing is particularly important as local computation is signicantly
cheaper than radio communication [15].
33.2.2.1.1 Directed Diffusion
Directed Diffusion is one of the pioneering data-centric communication paradigms developed specically
for wireless sensor networks [6]. Diffusionis based ona publish/subscribe API (applicationprogramming
interface), where the details of how published data is delivered to subscribers is hidden from the data
producers (sources) and publishers (sinks). The transmission and arrival of events (interest or data
messages) occur asynchronously. Interests describe tasks that are expressed using a list of attribute-value
pairs as shown below:
// detect location of seagull
type = seagull
// send back results every 20ms
interval = 20ms
// for the next 15 seconds
duration = 15s
// from sensors within rectangle
rect = [-100,100,200,400]
A node that receives a data message sends it to its Filter API, which subsequently performs a matching
operation according to a list of attributes and their corresponding values. If a match is established between
the received data message and the lter residing on the node, the diffusion substrate passes the event to the
appropriate application module. Thus the Filter API is able to inuence the data which propagates through
the network from the source to the sink node as an application module may perform some application-
specic processing on the received event, for example, it may decide to aggregate the data. For example,
consider a scenario in an environmental monitoring project where the user needs to be notied when the
light intensity in a certain area goes beyond a specied threshold. As the density of deployed nodes may
be very high, it is likely that a large number of sensors would respond to an increase in light intensity
simultaneously. Instead of having every sensor relaying this notication to the user, intermediate nodes in
the region could aggregate the readings from their neighboring nodes and return only the Boolean result
thus greatly reducing the number of radio transmissions.
Apart from aggregating data by simply suppressing duplicate messages, application-specic lters can
also take advantage of named-data to decide how to relay data messages back toward the sink node and
what data to cache in order to route future interest messages in a more intelligent and energy-saving
manner. Filters also help save energy by ensuring that nodes react appropriately to incoming events only
if the attribute matching process has proven to be successful.
Diffusion also supports a more complex form of in-network aggregation. Filters allow nested queries
such that one sensor is able to trigger other sensors in its vicinity if the attribute-value matching operation
is successful. It is not necessary for a user to directly query all the relevant sensors. Instead the user only
queries a certain sensor which in turn eventually queries the other relevant sensors around it if certain
conditions are met. In this case, energy savings are obtained from two aspects. First, since the user may be
geographically distant from the observed phenomenon, the energy spent transmitting data can be reduced
drastically using a triggering sensor. Second, if sampling the triggered (or secondary) sensor consumes
a lot more energy than the triggering (initial) sensor, then energy consumption can be reduced greatly by
reducing the duty cycle of the secondary sensor to only periods when certain conditions are met at the
initial sensor.
33.2.2.1.2 COUGAR
Building up on the same concept, that processing data within the network would result in signicant
energy savings, but deviating from the library-based lower-level approach, that is, as used by Directed
Diffusion, the COUGAR [10, 16] project envisions the sensor network as an extension of a conventional
database thus viewing it as a device database system. It makes the usage of the network more user-friendly
by suggesting the use of a high-level declarative language similar to SQL. Using a declarative language
ensures that queries are formulated independent of the physical structure and organization of the sensor
network.
Conventional database systems use a warehousing approach [17] where every sensor that gathers data
from an environment subsequently relays that data back to a central site where this data is then logged for
future processing. While this framework is suitable for historical queries and snapshot queries, it cannot
be used to service long-running queries [17]. For instance, consider the following query:
Retrieve the rainfall level for all sensors in sector A every 30 sec if it is greater than 60 mm.
Using the warehousing approach, every sensor would relay its reading back to a central database every
30 sec regardless of whether it is in sector A or its rainfall level reading is greater than 60 mm. Upon
receiving all the readings from the sensors, the database would then carry out the required processing
to extract all the relevant data. The primary problem in this approach is that excessive resources are
consumed at each and every sensor node as large amounts of raw data need to be transmitted through the
network.
As the COUGAR approach is modeled around the concept of a database, the system generally proceeds
as follows. It accepts a query from the user, produces a query execution plan (which contains detailed
instructions of how exactly a query needs to be serviced), executes this plan against the device database
system, and produces the answer. The query optimizer generates a number of query execution plans and
selects the plan that minimizes a given cost function. The cost function is based on two metrics, namely
resource usage (expressed in Joules) and reaction time.
In this case, the COUGAR approach selects the most appropriate query execution plan that pushes the
selection (rainfall level >60 mm) onto the sensor nodes. Only the nodes that meet this condition send
their readings back to the central node. Thus just like in Directed Diffusion, the key idea here is to transfer
part of the processing to the nodes themselves, which in turn would reduce the amount of data that needs
to be transmitted.
33.2.2.1.3 TinyDB
Following the steps of Directed Diffusion and the COUGAR project, TinyDB [11] also proclaims the
use of some form of in-network processing to increase the efciency of the network and thus improve
network lifetime. However, while TinyDB views the sensor network from the database perspective just
like COUGAR, it goes a step further by pushing not only selection operations to the sensor nodes but
also basic aggregation operations that are common in databases, such as MIN, MAX, SUM, COUNT, and
AVERAGE.
Figure 33.6 illustrates the obvious advantage that performing such in-network aggregation operations
have compared to transmitting just raw data. Without aggregation, every node in the network needs to
transmit not only its own reading but also those of all its children. This not only causes a bottleneck close
Node transmitting
sensor reading
Root node/base
station
Intermediate node
transmitting sensor reading
+ aggregating data
Data transmission
without in-network
aggregation
Data transmission
with in-network
aggregation
FIGURE 33.6 The effect of using in-network aggregation.
to the root node but also results in unequal consumption of energy, that is, the closer a node is to the root
node, the larger the number of messages it needs to transmit, which naturally results in higher energy
consumption. Thus nodes closer to the root node die earlier. Losing nodes closer to the root node can
have disastrous consequences on the network due to network partitioning. Using in-network aggregation,
however, every intermediate node aggregates its own reading with that of its children and eventually
transmits only one combined result.
Additionally, TinyDB has numerous other features, such as communication scheduling, hypothesis
testing, and acquisitional query processing, which makes it one of the most feature-rich distributed query
processing frameworks for wireless sensor networks at the moment.
TinyDB requires users to specify queries injected into the sensor network using an SQL-like language.
This language describes what data needs to be collected and how it should be processed upon collection as
it propagates through the network toward the sink node. The language used by TinyDB varies from tradi-
tional SQL in the sense that the semantics supports queries that are continuous and periodic. For example,
a query could state: Return the temperature reading of all the sensors on Level 4 of the building every
5 min over a period of 10 h. The period of time between every successive sample is known as an epoch
(in this example it is 5 min).
Just like in SQL, TinyDB queries follow the SELECT - FROM - WHERE - GROUPBY - HAVING
format that supports selection, join, projection, aggregation, and grouping. Just like in COUGAR, sensor
data is viewed as a single virtual table with one column per sensor type. Tuples are appended to the table at
every epoch. Epochs also allow computation to be scheduled such that power is minimized. For example,
the following query species that each sensor should report its own identier and temperature readings
once every 60 sec for a duration of 300 sec:
SELECT nodeid, temp
FROM sensors
SAMPLE PERIOD 60s FOR 300s
The virtual table sensors is conceptually an unbounded, continuous data stream of values that contain
one column for every attribute and one row for every possible instant in time. The table is not actually
stored in any device, that is, it is not materialized but sensor nodes only generate the attributes and rows
that are referenced in active queries. Apart from the standard query shown above, TinyDB also supports
event-basedqueries andlifetime queries [18]. Event-basedqueries reduce energy consumptionby allowing
nodes to remain dormant until some triggering event is detected. Lifetime queries are useful when users
are not particularly interested in the specic rate of incoming readings but more on the required lifetime
of the network. So the basic idea is to send out a query saying that sensor readings are required for say
60 days. The nodes then decide on the best possible rate at which readings can be sent given the specied
network lifetime.
Queries are disseminated into the network via a routing tree rooted at the base station that is formed
as nodes forward the received query to other nodes in the network. Every parent node can have multiple
child nodes but every child node can only have a single parent node. Every node also keeps track of its
distance from the root node in terms of the number of hops. This form of communication topology is
commonly known as tree-based routing.
Upon receiving a query, each node begins processing it. A special acquisition operator at each node
acquires readings from sensors corresponding to the elds or attributes referenced in the query. Similar to
the concept of nested queries in Directed Diffusion, where sensors with a low sampling cost are sampled
rst, TinyDB performs the ordering of sampling and predicates. Consider the following query as an
example where a user wishes to obtain readings from an accelerometer and a magnetometer provided
certain conditions are met:
SELECT accel, mag
FROM sensors
WHERE accel > c1
AND mag > c2
SAMPLE INTERVAL 1s FOR 60s
Depending on the cost of sampling the accelerometer and the magnetometer sensors, the optimizer
will rst sample the cheaper sensor to see if its condition is met. It will only proceed to the more costly
second sensor if the rst condition has been met.
Next we describe how the sampled data is processed within the nodes and is subsequently propagated
up the network toward the root node. Consider the following query:
Report the average temperature of the fourth oor of the building every 30 sec.
To service the above query, the query plan has three operators: a data acquisitional operator, a select
operator that checks if the value of oor equals 4, and the aggregate operator that computes the average
temperature from not only the current node but also its children located on the fourth oor. Each sensor
node applies the plan once per epoch and the data stream produced at the root node is the answer to the
query. The partial computation of averages is represented as {sum, count} pairs, which are merged at each
intermediate node in the query plan to compute a running average as data ows up the tree.
TinyDB uses a slotted scheduling protocol to collect data where parent and child nodes receive and send
(respectively) data in the tree-based communication protocol. Each node is assumed to produce exactly
one result per epoch, which must be forwarded all the way to the base station. Every epoch is divided into
a number of xed-length intervals that is dependent on the depth of the tree. The intervals are numbered
in reverse order such that interval 1 is the last interval in the epoch. Every node in the network is assigned
to a specic interval that correlates to its depth in the routing tree. Thus for instance if a particular
node is two hops away from the root node, it is assigned the second interval. During its own interval,
a node performs the necessary computation, transmits its result and goes back to sleep. In the interval
preceding its own, a node sets its radio to listen mode collecting results from its child nodes. Thus data
ows up the tree in a staggered manner eventually reaching the root node during interval 1 as shown in
Figure 33.7.
33.2.2.1.4 Discussion
In this section we do a comparison of the various projects described above and highlight some of their
drawbacks. We also mention some other work in the literature that has contributed further improvements
to some of these existing projects. Table 33.1 shows a list comparing some of the key features of the various
projects.
As mentioned earlier, Directed Diffusion was a pioneering project in the sense that it introduced the
fundamental concept of improving network efciency by processing data within the sensor network.
However, unlike COUGAR and TinyDB it does not offer a particularly simple interface, exible naming
system, or any generic aggregation and join operators. Such operators are considered as application-
specic operators and must always be coded in a low-level language. A drawback of this approach is that
Key
Message
transmission
Radio in transmit
mode
Radio in listen-only
mode
Interval 5 Interval 4 Interval 3 Interval 2 Interval 1
Time
A
B
C
D
E
F
G
H
I
J
Nodes sleeping
Nodes sleeping
Slotted approach
FIGURE 33.7 Communicating scheduling in TinyDB using the slotted approach [19].
TABLE 33.1 Comparison of Data Management Strategies
Directed Diffusion COUGAR TinyDB
Type Non-database Database Database
Platform iPAQ class (Mote class iPAQ class Mote class
for micro-diffusion)
Query language Application specic, SQL-based SQL-based
dependent on Filter API
Type of in-network Suppression of identical Selection operators Selection, aggregation operators
aggregation data messages from and limited optimization
different sources
Crosslayer features Routing integrated with None Routing integrated with
in-network aggregation in-network aggregation;
communication scheduling
also decreases burden
on the MAC layer
Caching of data Yes No Yes
for future routing
Power saving mechanism Yes nested queries None Yes acquisitional
while sampling sensors query processing
Type of optimization None Centralized Mostly centralized metadata is
occasionally copied to catalogue
query optimizers are unable to deal with such user-dened operators as there are no xed semantics.
This is because query operators are unable to make the necessary cost comparisons between various user-
dened operators. A direct consequence of this is that since the system is not able to handle optimization
tasks autonomously, the arduous responsibility of placement and ordering of operators is placed on the
user. This naturally would be a great hindrance to users of the system (e.g., environmentalists) who are
only concerned with injecting queries into the network and obtaining the results not guring out
the intricacies of energy-efcient mechanisms to extend network lifetime!
While the COUGAR project specically claims to target wireless sensor networks [20, 21], apart from
the feature of pushing down selection operations into the device network, it does not demonstrate any
other novel design characteristics that would allow it to run on sensor networks. In fact, the COUGAR
project has simulations and implementations using Linux-based iPAQ class hardware that has made them
take certain design decisions that would be unsuitable for sensor networks. For instance, unlike Directed
Diffusion [14] and TinyDB [18], COUGAR does not take the cost incurred by sampling sensors into
consideration during the generation of query execution plans. It also does not take advantage of certain
inherent properties of radio communication, for example, snooping, and also fails to suggest any methods
that could link queries to communication scheduling. Additionally, the usage of XML to encode messages
and tuples makes it inappropriate for sensor networks given their limited bandwidth and high cost of
transmission per bit.
Among the various query processing systems currently stated in the literature, TinyDB seems to be the
one that is the most feature packed. The TinyDB software has been deployed using Mica2 motes in the
Berkeley Botanical Garden to monitor the microclimate in the gardens redwood grove [22]. However,
the initial deployment only relays raw readings and does not currently make use of any of the aggregation
techniques introduced in the TinyDB literature. While it may have approached the problem of improving
energy efciency from several angles it does have a number of inherent drawbacks the most signicant
being the lack of adaptability. First, the communication scheduling mentioned above is highly dependent
on the depth of the network that is assumed to be xed. This makes it unable to react to changes in the
topology in the network on-the-y that could easily happen if new nodes are added or certain nodes
die. Second, the communication scheduling is also directly dependent on the epoch that is specied in
every query injected into the network. With networks expected to span say hundreds or even thousands
of nodes, it is unlikely that environmentalists using a particular network would only inject one query
into the node at any one time. Imagine if the Internet was designed in a way such that only one person
was allowed to use it at any instant! Thus methods need to be devised to enable multiple queries to run
simultaneously in a sensor network.
Although TinyDB reduces the number of transmissions greatly by carrying out in-network aggregation
for every long-running query, it keeps on transmitting data during the entire duration of the active query
disregarding the temporal correlation in a sequence of sensor readings. Reference 23 takes advantage of
this property and ensures that nodes only transmit data when there is a signicant enough change between
successive readings. In other words, sensors may refrain from transmitting data if the readings remain
constant.
Another area related to the lack of adaptability affecting both COUGAR and TinyDB has to do with
the generation of query execution plans. In both projects the systems assume a global view of the network
when it comes to query optimization. Thus network metadata is periodically copied from every node
within the network to the root node. This information is subsequently used to work out the best possible
query optimization plan. Obviously, the cost of extracting network metadata from every node is highly
prohibitive. Also query execution plans generated centrally may be outdated by the time they reach the
designated nodes as conditions in a sensor network can be highly volatile, for example, the node delegated
to carry out a certain task may have run out of power and died by the time instructions arrive from the
root node. In this regard, it is necessary to investigate methods where query optimizations are carried
out using only local information. While they may not be as optimal as plans generated based on global
network metadata, it will result in signicant saving in terms of the number of radio transmissions.
Reference 24 looks into creating an adaptive and decentralized algorithm that places operators optimally
within a sensor network. However, the preliminary simulation results are questionable since the overhead
incurred during the neighbor exploration phase is not considered. Also there is no mention of how fast the
algorithm responds to changes in networkdynamics.
33.3 Data-Centric Architecture
As we previously stated, the layered protocol stack description of the system architecture for a sensing
node cannot cover all the aspects involved (such as crosslayer communication, dynamic update, etc.).
In this section we address the problem of describing the system architecture in a more suited way and
its implications in the application design.
33.3.1 Motivation
The sensor networks are dynamic from many points of view. Continuously changing behaviors can be
noticed in several aspects of sensor networks, some of them being:
Sensing process. The natural environment is dynamic by all means (the basic purpose of sensor networks
is to detect, measure, and alert the user of the changing of its parameters). The sensor modules themselves
can become less accurate, need calibration or even break down.
Network topology. One of the features of the sensor networks is their continuously changing topology.
There are a lot of factors contributing to this, such as: failures of nodes or the unreliable communica-
tion channel, mobility of the nodes, variations of the transmission ranges, clusters reconguration,
addition/removal of sensor nodes, etc. Related to this aspect, the algorithms designed for sensor networks
need to have two main characteristics: they need to be independent on the network topology and need to
scale well with the network size.
Available services. Mobility of nodes, failures, or availability of certain kinds of nodes might trigger
recongurations inside the sensor network. The functionality of nodes may depend on existing services
at certain moments and when they are no longer available, the nodes will either recongure themselves or
try to provide them themselves.
Network structure. New kinds of nodes may be added to the network. Their different and increased
capabilities will bring changes to the regular way in which the network functions. Software modules
might be improved or completely new software functionality might be implemented and deployed in the
sensor nodes.
Most wireless sensor network architectures currently use a xed layered structure for the protocol stack in
each node. This approach has certain disadvantages for wireless sensor networks. Some of them are:
Dynamic environment. Sensor nodes address a dynamic environment where nodes have to recongure
themselves to adapt to the changes. Since resources are very limited, reconguration is also needed
in order to establish an efcient system (a totally new functionality might have to be used if energy
levels drop under certain values). The network can adapt its functionality to a new situation, in order
to lower the use of the scarce energy and memory resources, while maintaining the integrity of its
operation.
Error control. It normally resides in all protocol layers so that for all layers the worst-case scenario is
covered. For a wireless sensor network this redundancy might be too expensive. Adopting a central view
on how error control is performed and crosslayer design will reduce the resources spent for error control.
Power control. It is traditionally done only at the physical layer, but since energy consumption in sensor
nodes is a major design constraint, it is found in all layers (physical, data-link, network, transport, and
application layers).
Protocol place in the sensor node architecture. An issue arises when trying to place certain layers in
the protocol stack. Examples may include: timing and synchronization, localization, and calibration.
These protocols might shift their place in the protocol stack as soon as their transient phase is over. The
data producedby some of these algorithms might make a different protocol stack more suitedfor the sensor
node (e.g., a localization algorithm for static sensor networks might enable a better routing algorithm that
uses information about the location of the routed data destination).
Protocol availability. New protocols might become available after the network deployment. At certain
moments, in specic conditions, some of the sensor nodes might use a different protocol stack that better
suits their goal and the environment.
It is clear from these examples that dynamic reconguration of each protocol as well as dynamic
reconguration of the active protocol stack is needed.
33.3.2 Architecture Description
The system we are trying to model is an event-driven system, meaning that it reacts and processes the
incoming events and afterwards, in the absence of these stimuli, it spends its time in the sleep state (the
software components running inside the sensor node are not allowed to perform blocking waiting).
Let us name a higher level of abstractionfor the event class as data. Data may encapsulate the information
provided by one or more events, have a unique name and containadditional informationsuchas deadlines,
identity of producer, etc. Data will be the means used by the internal mechanisms of the architecture to
exchange information components.
In the following we will address any protocol or algorithm that can run inside a sensor node with the
term entity (see Figure 33.8). An entity is a software component that will be triggered by the availability
of one or more data types. While running, each entity is allowed to read available data types (but not wait
for additional data types becoming available). As a result of the processing, each software component can
produce one or more types of data (usually on their exit).
An entity is also characterized by some functionality, meaning the sort of operation it can produce
on the input data. Based on their functionality, the entities can be classied as being part of a certain
protocol layer as in the previous description. For one given functionality, several entities might exist inside
a sensor node; to discern among them, one should take into consideration their capabilities. By capability
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Module manager
Publish/subscribe
server
FIGURE 33.8 Entity description.
we understand high-level description containing the cost for a specic entity to perform its functionality
(as energy, resources, time, etc.) and some characteristics indicating the estimated performance and
quality of the algorithm.
In order for a set of components to work together, the way in which they have to be interconnected
should be specied. The architectures existent up to this moment in the wireless sensor network eld,
assume a xed way in which these components can be connected, which is dened at compile time
(except for the architectures that e.g., allow execution of agents). To change the protocol stack in such an
architecture, the user should download the whole compiled code into the sensor node (via the wireless
interface) and then make use of some boot code to replace the old running code in it. In the proposed
architecture we are allowing this interconnection to be changed at runtime, thus making online update
of the code possible, the selection of a more suited entity to perform some functionality based on
the changes in the environment, etc. (in one word allowing the architecture to become dynamically
recongurable).
To make this mechanism work, a new entity needs to be implemented; let us call this the data manager.
The data manager will monitor the different kinds of data being available and will coordinate the dataow
inside the sensor node. At the same time it will select the most tting entities to perform the work and it
will even be allowed to change the whole functionality of the sensor node based on the available entities
and external environment (see Figure 33.9).
The implementation of these concepts can not make abstraction of the small amount of resources
each sensor node has (as energy, memory, computation power, etc.). Going down from the abstraction
level to the point where the device is actually working, a compulsory step is implementing the envi-
sioned architecture in a particular operating system (in this case a better term maybe system software).
A large range of operating systems exist for embedded systems in general [25, 26]. Scaled down ver-
sions with simple schedulers and limited functionality have been developed especially for wireless sensor
networks [27].
Usually, the issues of systemarchitecture andoperating systemare treatedseparately, bothof themtrying
to be as general as possible and to cover all the possible application cases. A simplistic view of a running
operating system is a scheduler that manages the available resources and coordinates the execution of a
set of tasks. This operation is centralized from the point of view of the scheduler that is allowed to take
all the decisions. Our architecture can also be regarded as a centralized system, with the data manager
coordinating the dataow of the other entities. To obtain the smallest overhead possible there should
be a correlation between the function of the central nucleus from our architecture and the function of
the scheduler from the operating system. This is why we propose a close relationship between the two
concepts by extending the functionality of the scheduler with the functionality of the data manager.
The main challenges that arise are keeping the size of the code low and the context-switching time.
33.3.2.1 Requirements
As we mentioned earlier, the general concept of data is used rather than the event one. For the decision
based on data to work, there are some additional requirements to be met.
First of all, all the modules need to declare the name of the data that will trigger their action, the name of
the data they will need to read during their action (this can generically incorporate all the shared resources
in the system) and the name of the data they will produce. The scheduler needs all this information to
take the decisions.
From the point of view of the operating system, a new component that takes care of all the data
exchange needs to be implemented. This would in fact be an extended message passing mechanism, with
the added feature of notifying the scheduler when new data types become available. The mapping of this
module in the architecture is the constraint imposed to the protocols to send/receive data via, for example,
a publish/subscribe mechanism to the central scheduler.
An efcient naming systemfor the entities and the data is needed. Downloading newentities to a sensor
node involves issues similar to services discovery. Several entities with the same functionality but with
3
3
-
2
0
E
m
b
e
d
d
e
d
S
y
s
t
e
m
s
H
a
n
d
b
o
o
k
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
Data
Data
Data
Input
Input Output
Functionality
Capabilities
{...}
Data
Data
FIGURE 33.9 Architecture transitions.
different requirements and capabilities might coexist. The data-centric scheduler has to make the decision
as to which one is the best.
33.3.2.2 Extension of the Architecture
The architecture presented earlier might be extended to groups of sensor nodes. Several data-centric
schedulers together with a small, xed number of protocols can communicate with each other and form
a virtual backbone of the network.
Entities running inside sensor nodes can be activated using data types that become available at other
sensor nodes (e.g., imagine one node using his neighbor routing entity because it needs the memory to
process some other data).
Of course, this approach raises new challenges. A naming system for the functionalities and data
types, reliability issues of the system (for factors, such as mobility, communication failures, node failures,
security attacks) are just a few examples. Related work on these topics already exist (e.g., References 28
and 29).
33.4 Conclusion
In this chapter, we have outlined the characteristics of wireless sensor networks from an architectural
point of view. As sensor networks are designed for specic applications, there is no precise archi-
tecture to t them all but rather a common set of characteristics that can be taken as a starting
point.
The combination of the data-centric features of sensor networks and the need to have a dynamic
recongurable structure has led to a new architecture that provides enhanced capabilities than the existing
ones. The new architecture characteristics and implementation issues have been discussed, laying the
foundations for future work.
This area of research is currently in its infancy and major steps are required in the elds of com-
munication protocols, data processing, and application support to make the vision of Mark Weiser
a reality.
References
[1] EYES. Eyes European project, http://eyes.eu.org.
[2] Chu, P., Lo, N.R., Berg, E., and Pister, K.S.J. Optical communication using micro corner cuber
reectors. In Proceedings of the MEMS97. IEEE, Nagoya, Japan, 1997, pp. 350355.
[3] Havinga, P. et al. Eyes deliverable 1.1 system architecture specication.
[4] SmartDust. http://robotics.eecs.berkeley.edu/ pister/SmartDust.
[5] Akyildiz, I., Su, W., Sankarasubramaniam, Y., and Cayirci, E. A survey on sensor networks. IEEE
Communication Magazine, 40(8), 102114, 2002.
[6] Intanagonwiwat, C., Govindan, R., Estrin, D., Heidemann, J., and Silva, F. Directed diffusion for
wireless sensor networks. IEEE/ACMTransactions on Networking, 11(1), 216, 2003.
[7] Pottie, G.J. and Kaiser, W.J. Embedding the internet: wireless integrated network sensors.
Communications of the ACM, 43(5), 5158, 2000.
[8] Ganesan, D., Cerpa, A., Ye, W., Yu, Y., Zhao, J., and Estrin, D. Networking issues in wireless sensor
networks. Journal of Parallel and Distributed Computing, Special Issue on Frontiers in Distributed
Sensor Networks, 64(7), 799814, 2004.
[9] Estrin, D., Govindan, R., Heidemann, J.S., and Kumar, S. Next century challenges: scalable coordin-
ation in sensor networks. Mobile Computing and Networking. IEEE, Seattle, Washington, USA,
1999, pp. 263270.
[10] Bonnet, P., Gehrke, J., and Seshadri, P. Towards sensor database systems. In Proceedings of the
Second International Conference on Mobile Data Management. Springer-Verlag, Heidelberg, 2001,
pp. 314.
[11] Madden, S., Szewczyk, R., Franklin, M., and Culler, D. Supporting aggregate queries over ad-hoc
wireless sensor networks. In Proceedings of the Fourth IEEE Workshop on Mobile Computing and
Systems Applications. IEEE, 2002.
[12] Postel, J. Internet protocol, rfc 791, 1981.
networks. In Proceedings of the International Conference on Accoustics, Speech and Signal Processing.
IEEE, Salt Lake City, Utah, 2001.
[14] Heidemann, J.S., Silva, F., Intanagonwiwat, C., Govindan, R., Estrin, D., and Ganesan, D. Building
efcient wireless sensor networks with low-level naming. In Symposium on Operating Systems
Principles. ACM, 2001, pp. 146159.
[15] Pottie, G.J. and Kaiser, W.J. Wireless integrated network sensors. Communications of the ACM,
43(5), 5158, 2000.
[16] Bonnet, P. and Seshadri, P. Device database systems. In Proceedings of the International Conference
on Data Engineering. IEEE, San Diego, CA, 2000.
[17] Bonnet, P., Gehrke, J., and Seshadri, P. Querying the physical world. IEEE Personal Communica-
tions, 7, 1015, 2000.
[18] Madden, S., Franklin, M.J., Hellerstein, J.M., and Hong, W. The design of an acquisitional query
processor for sensor networks. Proceedings of the 2003 ACM SIGMOD International Conference on
Management of Data. ACM Press, San Diego, CA, 2003, pp. 491502.
[19] Madden, S. The design and evaluation of a query processing architecture for sensor networks.
PhD thesis, University of California, Berkeley, 2003.
[20] Yao, Y. and Gehrke, J. The cougar approach to in-network query processing in sensor networks.
SIGMOD Record, 31(3), 2002.
[21] Yao, Y. and Gehrke, J. Query processing for sensor networks. In Proceedings of the Conference on
Innovative Data Systems Research. Asilomar, CA, 2003.
[22] Gehrke, J. and Madden, S. Query processing in sensor networks. Pervasive Computing. IEEE, 2004,
pp. 4655.
[23] Beaver, J., Sharaf, M.A., Labrinidis, A., and Chrysanthis, P.K. Power aware in-network
query processing for sensor data. In Proceedings of the Second Hellenic Data Management
Symposium. Athens, Greece, 2003.
[24] Bonls, B.J. and Bonnet, P. Adaptive and decentralized operator placement for in-network query
processing. In Proceedings of the Second International Workshop on Information Processing in
Sensor Networks (IPSN), Vol. 2634 of Lecture Notes in Computer Science. Springer-Verlag, Berlin,
Heidelberg, 2003, pp. 4762.
[25] VxWorks. Wind river, http://www.windriver.com.
[26] Salvo. Pumpkin incorporated, http://www.pumpkininc.com.
[27] Hill, J., Szewczyk, R., Woo, A., Hollar, S., Culler, D.E., and Pister, K.S.J. System architecture direc-
tions for networked sensors. In Architectural Support for Programming Languages and Operating
Systems, 2000, pp. 93104.
[28] Verissimo, P. and Casimiro, A. Event-driven support of real-time sentient objects. In Proceedings of
the Eighth IEEE International Workshop on Object-Oriented Real-Time Dependable Systems. IEEE,
Guadalajara, Mexico, 2003.
[29] Cheong, E., Liebman, J., Liu, J., and Zhao, F. TinyGALS: a programming model for event-driven
embedded systems. In Proceedings of the 2003 ACMSymposiumon Applied Computing. ACMPress,
Melbourne, Florida, 2003, pp. 698704.
34
Energy-Efcient
Medium Access
Control
Koen Langendoen and
Gertjan Halkes
34.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-1
Contention-Based Medium Access Schedule-Based
Medium Access
34.2 Requirements for Sensor Networks. . . . . . . . . . . . . . . . . . . . . 34-5
Hardware Characteristics Communication Patterns
Miscellaneous Services
34.3 Energy Efciency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-7
Sources of Overhead Trade-Offs
34.4 Contention-Based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-11
IEEE 802.11 LPL and Preamble Sampling WiseMAC
34.5 Slotted Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-12
Sensor-MAC Timeout-MAC Data-Gathering MAC
34.6 TDMA-Based Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-14
Lightweight Medium Access
34.7 Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-17
Simulation Framework Micro-Benchmarks
Homogeneous Unicast and Broadcast Local Gossip
Convergecast Discussion
34.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-27
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34-28
34.1 Introduction
Managing wireless communication will be the key to effective deployment of large-scale sensor networks
that need to operate for years. On the one hand, wireless communication is essential (1) to foster collab-
oration between neighboring sensor nodes to help overcome the inherent limitations of their cheap, and
hence inaccurate sensors observing physical events, and (2) to report those events back to a sink node
connected to the wired world. On the other hand, wireless communication consumes a lot of energy, is
error prone, and has limited range, forcing many nodes to participate in relaying information, all of which
severely limit the lifetime of the (unattended) sensor network. In typical sensor nodes, such as the Mica2
mote, communicating one bit of information consumes as much energy as executing several hundred
34-1
Physical
Network
Data link
MAC protocol
Layer 2
Layer 3
Layer 1
FIGURE 34.1 Network protocol stack.
instructions. Therefore, one should think twice before actually transmitting a message. Nevertheless,
whenever a message shouldbe sent, the protocol stack must operate as efciently as possible. Inthis chapter,
we will study the medium access layer, which is part of the data link layer (layer 2 of the OSI model) and
sits directly on top of the physical layer (layer 1) (see Figure 34.1). Since the medium access layer controls
the radio, it has a large impact on the overall energy consumption, and hence, the lifetime of a node.
A Medium Access Control (MAC) protocol decides when competing nodes may access the shared
medium, that is, the radio channel, and tries to ensure that no two nodes are interfering with each others
transmissions. In the unfortunate event of a collision, a MAC protocol may deal with it through some
contention resolution algorithm, for example, by resending the message later at a randomly selected
time. Alternatively, the MAC protocol may simply discard the message and leave the retransmission
if any up to the higher layers in the protocol stack. MAC protocols for wireless networks have been
studied since the 1970s, but the successful introduction of wireless LANs (WLANs) in the late 1990s has
accelerated the pace of developments; the recent survey by Jurdak et al. [1] reports an exponential growth
of new MAC protocols. We will now provide a brief historic perspective on the evolution of MAC, and
describe the two major approaches contention-based and schedule-based regularly used in wireless
communication systems. Readers familiar with medium access in wireless networks may proceed to
Section 34.2 immediately.
34.1.1 Contention-Based Medium Access
In the classic (pure) ALOHAprotocol [2], developed for packet radio networks in the 1970s, a node simply
transmits a packet when it is generated. If no other node is sending at the same time, the data transmission
succeeds andthe receiver responds withanacknowledgment. Inthe case of a collision, no acknowledgment
will be generated, and the sender retries after a random period. The price to be paid for ALOHAs
simplicity is its poor use of the channel capacity; the maximum throughput of the ALOHA protocol is
only 18%[2]. However, a minor modication toALOHAcan increase the channel utilization considerably.
In slotted ALOHA, time is divided into slots, and nodes may only transmit at the beginning of a slot. This
organization halves the probability of a collision and raises the channel utilization to around 35% [3].
34.1.1.1 Carrier Sense Multiple Access
Instead of curing the effects (retransmissions) after the fact, it is often much better to take out the root of
the problem(collisions). The Carrier Sense Multiple Access (CSMA) protocol [4], originally introduced by
Kleinrock and Tobagi in 1975, tries to do just that. Before transmitting a packet, a node rst listens to the
channel for a small period of time. If it does not sense any trafc, it assumes that the channel is clear and
starts transmitting the packet. Since it takes some time to switch the radio from receive mode to transmit
mode, the CSMAmethod is not bullet proof and collisions can still occur. In practice however, CSMA-style
MAC protocols can achieve a maximal channel utilization in the order of 50 to 80% depending on the
exact access policy [4].
34.1.1.2 Carrier Sense Multiple Access with Collision Avoidance
When all nodes can sense each others transmissions, CSMA performs just ne. It took until 1990 before
a signicant new development in MAC was recorded. The Medium Access with Collision Avoidance
Energy-Efcient Medium Access Control 34-3
(MACA) protocol [5] addresses the so-called hidden terminal problem that occurs in ad hoc (sensor)
networks where the radio range is not large enough to allow communication between arbitrary nodes and
two (or more) nodes may share a common neighbor while being out of each others reach. Consider the
situation in Figure 34.2 where nodes Aand Cboth want to transmit a packet to their common neighbor B.
Both nodes sense an idle channel and start to transmit their packets, resulting in a collision at B. Note that
since node A is hidden from C, any packet sent by C will disrupt an ongoing transmission from A to B, so
this type of collision is quite common in ad hoc networks.
The MACA protocol introduces a three-way handshake to make hidden nodes aware of upcoming
transmissions, so collisions at common neighbors can be avoided. The sender (node A in Figure 34.2)
initiates the handshake by transmitting a short Request-To-Send (RTS) control packet announcing its
intended data transmission. The receiver (B) responds with a Clear-To-Send (CTS) packet, which informs
all neighbors of the receiver (including hidden nodes like C) of the upcoming transfer. The nal DATA
transfer (from A to B) is now guaranteed to be collision-free. When two RTS packets collide, which is
technically still possible, the intended receiver does not respond with a CTS and both senders backoff for
some random time. To account for the unreliability of the radio channel, MACA Wireless (MACAW [6])
adds a fourth packet to the control sequence to guarantee delivery. When the data is received correctly,
an explicit ACKnowledgment is send back to the sender. If the sender does not receive the ACK in due
time, it initiates a retransmission sequence to account for the corrupted or lost data.
The collision avoidance protocol in MACA (and derivatives) is widely used and is generally known as
CSMA/CA (Carrier Sense Multiple Access with Collision Avoidance). It has proved to be very effective
in eliminating collisions. In fact, CSMA/CA is too good at it and also silences nodes whose transmissions
would not interfere with the data transfer between the senderreceiver pair. The so-called exposed terminal
problem is illustrated in Figure 34.3. In principle, the data transmissions BA and CD can take place
concurrently since the signals from B cannot disturb the reception at D, and similarly Cs signals cannot
collide at A. However, since B must be able to receive the CTS by A, all nodes who can hear Bs RTS packet
must remain silent even if they are outside the reach of the receiver (A). Node C is thus exposed to Bs
transmission (and vice versa). Since exposed nodes are prohibited from sending, aggregate throughput
may be reduced.
R
T
S
C
T
S C
T
S
Blocked
T
i
m
e
D
a
ta
A B C A B C
D
a
ta
D
a
ta
(a) (b)
FIGURE 34.2 (a) The hidden terminal problem resolved through (b) Request-To-Send/Clear-To-Send signaling.
R
T
S
T
i
m
e
Blocked
Data Data
A B C D A B C D
D
a
ta
R
T
S
C
T
S
(b) (a)
FIGURE 34.3 The exposed terminal problem: (a) concurrent transfers are (b) synchronized.
34.1.1.3 IEEE 802.11
In 1999, the IEEE Computer Society published the 802.11 WLAN standard [7], specifying the PHYsical
and MAC layers. IEEE 802.11 compliant equipment, usually PC cards operating in the 2.4 or 5 GHz band,
can operate in infrastructure mode as well as in ad hoc mode. In both cases, 802.11 implements carrier
sense and collision avoidance to reduce collisions (see Section 34.4.1 for details). To preserve the energy of
mobile nodes, the 802.11 standard includes a power-saving mechanism that allows nodes to go into sleep
mode (i.e., disable their radios) for long periods of time. This mode of operation requires the presence of
an access point that records the status of each node and buffers any data addressed to a sleeping node. The
access point regularly broadcasts beacon packets indicating for which nodes it has buffered packets. These
nodes may then send a poll request to the access point to retrieve the buffered data (or switch back from
sleep to active mode). Krashinksy and Balakrishnan report up to 90% energy savings for web browsing
applications, but at the expense of considerable delays [8]. Currently, power saving in 802.11s ad hoc
mode is only supported when all nodes are within each others reach, so a simple, distributed scheme can
be used to coordinate actions; the standard does not include a provision for power saving in multihop
networks.
34.1.2 Schedule-Based Medium Access
The MAC protocols discussed so far are based on autonomous nodes contending for the channel. A com-
pletely different approach is to have a central authority (access point) regulate the access to the medium by
broadcasting a schedule that species when, and for how long, each controlled node may transmit over the
shared channel. The lack of contention overhead guarantees that this approach does not collapse under
high loads. Furthermore, with the proper scheduling policy, nodes get deterministic access to the medium
and can provide delay-bounded services as voice and multimedia streaming. Schedule-based medium
access is, therefore, the preferred choice for cellular phone systems (e.g., GSM) and wireless networks
supporting a mix of data and real-time trafc (e.g., Bluetooth).
34.1.2.1 Time-Division Multiple Access
Time-Division Multiple Access (TDMA) is an important schedule-based approach that controls the access
to a single channel (techniques for handling multiple channels will be discussed in Section 34.3.2.1).
In TDMA systems the channel is divided into slots, which are grouped into frames (see Figure 34.4). The
access point decides (schedules) which slot is to be used by which node. This decision can be made on a
per frame basis, or it can span several frames in which case the schedule is repeated.
In typical WLAN setups, most trafc is exchanged between the access point and the individual nodes.
In particular, communication between nodes rarely occurs. By limiting communication to up- and down-
link only, the scheduling problem is greatly simplied. Figure 34.4 shows a typical frame layout. The rst
slot in the frame is used by the access point to broadcast trafc control information to all nodes in its cell.
This information includes a schedule that species when each node must be ready to expect to receive a
packet (in the down-link section), and when it may send a packet (in the uplink section). The frame ends
with a contention period in which new nodes can register themselves with the access point, so they can be
included in future schedules.
Traffic
control
Contention
period
Frame n Frame n+2 Frame n+1
Downlink Uplink
FIGURE 34.4 TDMA frame structure: trafc controldownlinkuplinkcontention period.
The TDMA systems provide a natural way to conserve energy. A node can turn off its radio during all
slots in a frame in which it is not engaged in communication to/from the access point. It does require,
however, accurate time-synchronization between the access point and the individual nodes to ensure that
a node can wake-up exactly at the start of its slots. In a sensor network where activity is usually low,
a node is thenonaverage only awake for one slot eachframe to receive the trafc control information.
Enlarging the frame size reduces the energy consumption, but also increases the latency since a node has
to wait longer before its slot turns up. This fundamental energy/latency trade-off is further explored in
Section 34.3.2.
34.2 Requirements for Sensor Networks
The vast majority of MAC protocols described in the literature so far were designed, and optimized,
for scenarios involving satellite links (early work) and WLANs (recent developments). The deployment
scenarios for wireless sensor networks differ considerably, leading to a different set of requirements.
Inparticular, the unattended operationof sensor networks stresses the importance of energy efciency and
reduces the signicance of performance considerations, such as lowlatency, high throughput, and fairness.
Nevertheless, there are lessons to be learned from MAC protocols developed for wireless communication
systems, especially those targeting ad hoc networks of mobile nodes. The interested reader is referred to
a number of recent surveys in this area [1, 9, 10].
The task of the MAC layer in the context of sensor networks is to use the radio, with its limited
resources, as efciently as possible to send and receive data generated by the upper layers in the protocol
stack. It should take into account that data is often routed across multiple hops, and be able to handle
large-scale networks with hundreds, or even thousands of (mobile) nodes. To understand the design
trade-offs involved we will discuss the hardware characteristics of prototype sensor nodes in use today, as
well as common trafc patterns that have emerged in preliminary experiences with applications.
34.2.1 Hardware Characteristics
The current generation of sensor nodes, some of which are commercially available, are made up of
off-the-shelf components mounted on a small printed circuit board. In the future, we expect single chip
solutions with some of the protocol layers implemented in hardware. At the moment however, the MAC
protocols are running on the main processor, which drives a separate chip that takes care of converting
(modulating) bits to/from radio waves. The interface between the processor and the radio chip is at the
level of exchanging individual bits or bytes. The advantage of this low-level interface is that the MAC
designer has absolute control, which contrasts sharply with 802.11 WLAN equipment where the MAC
is usually included as part of the chipset on the PC card.
Popular processors include the 8-bit Atmel ATmega128L CPU used on the Mica motes, the 16-bit Texas
Instruments MSP430 used on the Eyes nodes, and the PIC-16 from Microchip. The exact specications
vary, but the processors typically run at a frequency in the 110 MHz range, and are equipped with 24 KB
of RAM. The processing capabilities provide ample headroom to drive the radio, but the limited amount
of storage space for local data puts a strong constraint on the memory footprint of the MAC protocol.
Since the focus of sensor node development is on energy consumption and form factor, we do anticipate
that future generations will still be quite limited in their processing and memory resources.
Table 34.1 provides details on the characteristics of two low-power radios employed in various state-
of-the-art sensor nodes. For reference, the specications of a typical 802.11 PC card are included. Several
important observations can be made. First, the energy consumed when sending or receiving data is two
to three orders of magnitude more than keeping the radio in a low-power standby state. Thus, the key to
effective energy management will be in switching the radio off and on. Second, the time needed to switch
from standby to active mode is considerable (518 sec to 2.0 msec), and the time needed to switch the
radio between transmit and receive mode is also nonnegligible. Therefore, the number of mode switches
should be kept to a minimum. Finally, the WaveLAN card (including the MAC) outperforms the other
TABLE 34.1 Characteristics of Typical Radios in State-of-the-Art Sensor Nodes
Lucent WaveLAN
RFM TR 1001 [11] CC1000 [12] PCSilver card [13]
Operating frequency 868 MHz 868 MHz
a
2.4 GHz
Modulation scheme ASK FSK DSSS
Bit rate 115.2 kbps 76.8 kbps 11 Mbps
Energy consumption
Transmit 12 mA (1.5 dBm) 8.6 mA (20 dBm) 284 mA
25.4 mA (5 dBm)
Receive 3.8 mA 11.8 mA 190 mA
Standby 0.7 A 30 A 10 mA
Switch times
Standby-to-transmit 16 sec 2.0 msec
Receive-to-transmit 12 sec 270 sec
Standby-to-receive 518 sec
b
2.0 msec
Transmit-to-receive 12 sec 250 sec
Transmit-to-standby 10 sec
Receive-to-standby 10 sec
a
The CC1000 radio supports any frequency in the 300 to 1000 MHz range; the quoted numbers are
for 868 MHz.
b
Time needed to fully initialize receive circuitry; a simple carrier sense can be performed in 30 sec.
radios in terms of energy per bit (77 versus 312 J/bit); future nodes should include radios with higher
frequencies and more complex modulation schemes.
34.2.2 Communication Patterns
In the rapidly emerging eld of wireless sensor networks there is little experience with realistic, long-
running applications, which is unfortunate since a good characterization of the workload (in terms
of network trafc) is mandatory for designing a robust and efcient MAC protocol, or any other part
of the network stack for that matter. It is, however, clear that the nature of the trafc for sensor networks
has a few remarkable characteristics that sets it apart from your average WLAN trafc. From the various
proposed deployment scenarios, usually in the area of remote monitoring, and the limited data from
preliminary studies, such as the Great Duck Island [14] and vehicle tracking system [15], it becomes clear
that data rates are very low: typically in the order of 1200 bytes per second, with message payload sizes
around 2025 bytes. Furthermore, two distinct communication patterns (named convergecast and local
gossip in Reference 16) appear to be responsible for generating the majority of network trafc:
Convergecast. In many monitoring applications, information needs to be periodically transmitted to
a sink node so it can be processed at a central location or simply stored in a database for future use. Since
these individual reports are often quite small and need to travel across the whole network, the overhead
is quite large. Aggregating messages along the spanning tree to the sink node therefore pays off. At the very
least, two (or more) packets can be coalesced to share a common header. At the very best, two (or more)
messages can be combined into one, for example, when reporting the maximum room temperature.
Local gossip. When a sensor node observes a physical event, so do its neighbors since the node density
in a sensor network is expected to be high. This allows a node to check with the nodes in its vicinity if they
observed the same event or not, and in the latter case to derive that its sensor is probably malfunctioning.
If its neighbors do observe the same event (e.g., a moving target) they can collaborate to obtain a better
estimate of the event (location and speed) and report that back to the sink. Besides improving the quality
of the reported information, the collaboration also avoids n duplicate messages traveling all the way back
to the sink. Depending on the situation, neighbors may be addressed individually (unicast) or collectively
(broadcast). In any case, by sharing (gossiping) their sensor readings (rumors) nodes can reduce the
likelihood of false positives, and efciently report signicant events.
The important implication of these two communication patterns is that trafc is not distributed evenly
over the network. The amount of data varies both in space and in time. Nodes in the vicinity of the sink
relay much more trafc than nodes at the edges of the network due to the convergecast pattern. The
uctuation in time is caused by the physical events triggering outbursts of local gossip. In the extreme case
of a forest re detection system, nodes may be dormant for years before nally reporting an event. MAC
protocols should be able to handle these kinds of uctuations.
34.2.3 Miscellaneous Services
Often the MAC layer is expected to provide some network-related services not directly associated with
data transfer. Localization and time-synchronization algorithms often need precise information about
the moment of the physical transmission of a packet to factor out any time spent by the MAC layer
in contention resolution. The routing layer needs to be informed of any local changes in network topology,
for example, it needs to know when mobile nodes move in and out of radio range. Since the MAC layer
sits directly on top of the radio it can perform these services at no extra cost. Neighborhood discovery,
for example, must be carried out to ensure the proper operation of TDMA-based MAC protocols. We will
not consider these miscellaneous requirements in the remainder of this chapter, but concentrate on the
MAC protocols ability to transfer data as efciently as possible.
34.3 Energy Efciency
The biggest challenge for designers of sensor networks is to develop systems that will run unattended for
years. This calls for robust hardware andsoftware, but most of all for careful energy management, since that
is and will continue to be a limited resource. The current generation of sensor nodes is battery powered, so
lifetime is a major constraint; future generations powered by ambient energy sources (sunlight, vibrations,
etc.) will provide very low currents, so energy consumption is heavily constrained.
It is important to realize that the failure of individual nodes may not harm the overall functioning of
a sensor network, since neighboring nodes can take over provided that the node density is high enough
(which can be guaranteed at roll out). Therefore, the key parameter to optimize is network lifetime,
that is, the time until the network gets partitioned. The MAC layer operates on a local scale (all nodes
within reach) and lacks the global information to optimize for network lifetime. This is therefore best
accomplished at the upper layers of the protocol stack, in particular the routing and transport (data
aggregation) layers, which do have a global overview. This works most effectively when the MAC layer
ensures that the energy it spends is directly related to the amount of trafc that it handles. Thus, the MAC
layer should optimize for energy efciency.
In contrast to typical WLAN protocols, MAC protocols designed for sensor networks usually trade-off
performance (latency, throughput, fairness) for cost (energy efciency, reduced algorithmic complexity).
It is, however, not clear cut what the best trade-off is, and various designs differ signicantly as will
become apparent in Section 34.3.2 where we will reviewthe basic design choices made by 20 WSN-specic
MAC protocols. Before that, we will consider the major sources of overhead that render WLAN-style
(contention-based) MAC protocols ineffective in the context of sensor networks.
34.3.1 Sources of Overhead
When running a contention-based MAC protocol on an ad hoc network with little trafc, much energy
is wasted due to the following sources of overhead:
Idle listening. Since a node does not know when it will be the receiver of a message from one of its
neighbors, it must keep its radio in receive mode at all times. This is the major source of overhead, since
typical radios consume two orders of magnitude more energy in receive mode (even when no data is
arriving) than in standby mode (cf. Table 34.1).
TABLE 34.2 Impact of Overhead on Contention-Based Protocols (C) and
Schedule-Based Protocols (S)
Performance (latency, Cost
Source throughput, fairness) (energy efciency)
Collisions C C
Protocol overhead C, S C, S
Idle listening C
Overhearing C
Trafc uctuations C, S C, S
Scalability/mobility S S
Collisions. If two nodes transmit at the same time and interfere with each others transmission, pack-
ets are corrupted. Hence, the energy used during transmission and reception is wasted. The RTS/CTS
handshake effectively resolves the collisions for unicast messages, but at the expense of protocol overhead.
Overhearing. Since the radio channel is a shared medium, a node may receive packets that are not
destined for it; it would have been more efcient to have turned off its radio.
Protocol overhead. The MAC headers and control packets used for signaling (ACK/RTS/CTS) do not
contain application data and are therefore considered overhead; these overheads can be signicant since
many applications only send a few bytes of data per message.
Trafc uctuations. A sudden peak in activity raises the probability of a collision, hence, much time
and energy are spent on waiting in the random backoff procedure. When the load approaches the channel
capacity, the performance can collapse with little or no trafc being delivered while the radio, sensing for
a clear channel, is consuming a lot of energy.
Switching to a schedule-based protocol (i.e., TDMA) has the great advantage of avoiding all energy waste
due to collisions, idle listening, and overhearing since TDMA is inherently collision free and the schedule
noties each node when it should be active and, more importantly, when not. The price to be paid is in
xed costs (i.e., broadcasting trafc schedules) and reduced exibility to handle trafc uctuations and
mobile nodes. The usual solution is to resort to some form of overprovisioning and choosing a frame size
that is large enough to handle peak loads. Dynamically adapting the frame size is another approach, but
this largely increases the complexity of the protocol and, hence, is considered to be an unattractive option
for resource-limited sensor nodes. Table 34.2 compares the impact of the various sources of overhead on
the performance and cost (energy efciency) of contention-based and schedule-based MAC protocols.
34.3.2 Trade-Offs
Different MAC protocols make different choices regarding the performanceenergy trade-off, and also
between sources of overhead (e.g., signaling versus collisions). A survey of 20 medium access protocols
specially designed for sensor networks, and hence optimized for energy efciency, revealed that they can
be classied according to three important design decisions:
1. The number (and nature) of the physical channels used.
2. The degree of organization (or independence) between nodes.
3. The way in which a node is notied of an incoming message.
Table 34.3 provides a comprehensive protocol classication based on these three issues. Given that the
protocols are listed chronologically based on their publication date, we observe that there is no clear trend
indicating that medium access for wireless sensor networks is converging toward a unique, best solution.
On the contrary, new combinations are still being invented showing that additional information (from
simulations and practical experience) is needed to decide on the best approach. Section 34.7 provides a
simulation-based head-to-head comparison of four protocols representing very distinctive choices in the
design space. We will not discuss all individual MAC protocols listed in Table 34.3 in detail, but rather
TABLE 34.3 Protocol Classication
Protocol Published Channels Organization Notication
SMACS [17] 2000 FDMA Frames Schedule
PACT [18] 2001 Single Frames Schedule
PicoRadio [19] 2001 CDMA +tone Random Wake-up
STEM [20] 2002 Data +control Random Wake-up
Preamble sampling [21] 2002 Single Random Listening
Arisha [22] 2002 Single Frames Schedule
S-MAC [23] 2002 Single Slots Listening
PCM [24] 2002 Single Random Listening
Low Power Listening [25] 2002 Single Random Listening
Sift [26] 2003 Single Random Listening
EMACs [27] 2003 Single Frames Schedule (per node)
T-MAC [28] 2003 Single Slots Listening
TRAMA [29] 2003 Single Frames Schedule (per node)
WiseMAC [30] 2003 Single Random Listening
B-MAC [31] 2003 Single Random Listening
BMA [32] 2004 Single Frames Schedule
Miller [33] 2004 Data +tone Random Wake-up +listening
DMAC [34] 2004 Single Slots (per level) Listening
SS-TDMA [16] 2004 Single Frames Schedule
LMAC [35] 2004 Single Frames Listening
review three fundamental design choices that MAC designers will encounter while crafting a protocol best
matching their envisioned deployment scenario.
34.3.2.1 Use Multiple Channels, or Not?
Data+tone
Double
Data+control
Multiple
CDMA FDMA
Single
Channels
The rst design choice that we discuss is whether or not the radio should be capable of dividing the avail-
able bandwidth into multiple channels. Two common techniques for doing so are Frequency-Division
Multiple Access (FDMA) and Code-Division Multiple Access (CDMA). FDMA partitions the total band-
width of the channel into number of small frequency bands, called subcarriers, on which multiple nodes
can transmit simultaneously without collision. CDMA on the other hand, uses a single carrier in com-
bination with a set of orthogonal codes. Data packets are XOR-ed with a specic code by the sender
before transmission, and then XOR-ed again by the receiver with the same code to retrieve the original
data. Receivers using another code perceive the transmission as (pseudo) random noise. This allows the
simultaneous and collision-free transmission of multiple messages.
The absence of collision in a multiple-channel system is attractive, hence its popularity in early pro-
posals, such as SMACS (FDMA) and PicoRadio (CDMA). It requires, however, a rather complicated
radio consuming considerable amounts of energy, so most MAC protocols are designed for a simple radio
providing just a single channel. An interesting alternative is to use a second, extremely low-power radio
that can be used for signaling an intended receiver to wake-up and turn on its primary radio to receive a
data packet. In the most simple, most energy-efcient case, the second radio is only capable of emitting
a xed tone waking-up all neighboring nodes (including the intended receiver). Miller and Vaidya [33]
discuss several policies to minimize the number of false wake-ups by overhearing nodes. STEM uses a
full-blown second radio to control exactly which node responds on the primary channel.
34.3.2.2 Get Organized, or Not?
Frames Random
Organization
Slots
The second design choice that we discuss is if, and how much, the nodes in the network should be organized
to act together at the MAC layer. The CSMA and TDMA protocols discussed before represent the two
extremes in the degree of organization: from completely random to frame-based access. The advantages of
contention-based protocols (random access) are the low implementation complexity, the ad hoc nature,
and the exibility to accommodate mobile nodes and trafc uctuations. The major advantage of frame-
based TDMA protocols is the inherent energy efciency due to the lack of collisions, overhearing, and
idle-listening overheads.
Since the advantages of random access are the drawbacks of frame-based access, and vice versa, some
MAC protocols have chosen to strike a middle ground between these two extremes and organize the sensor
nodes in a slotted system (much like slotted-ALOHA). The Sensor-MAC (S-MAC) protocol was the rst to
propose that nodes agree on a common slot structure, allowing them to implement an efcient duty cycle
regime; nodes are awake in the rst part of each slot and go to sleep in the second part, which signicantly
reduces the energy waste due to idle-listening.
The protocol classication in Table 34.3 shows that the research community is divided into what degree
of organization to apply: we nd nine contention-based, three slotted, and eight TDMA-based protocols.
Since we view the organizational design decision as the most critical, we will detail the main protocols
from each class in Sections 34.4 to 34.6.
34.3.2.3 Get Notied, or Not?
Schedule Listening
Notification
Wake-up
The third and nal design issue is about how the intended receiver of a message transfer will get
notied. In schedule-based protocols, the actual data transfers are scheduled ahead of time, so receiving
nodes know exactly when to turn on the radio. Such knowledge is not available in contention-based
protocols, so receiving nodes must be prepared to handle an incoming transfer at any moment. Without
further assistance fromthe sender, the receiver has noother optionthantolistencontinuously. Toeliminate
the resulting idle-listening overhead completely, senders may actively send a wake-up signal (tone) over
a second, very low-power radio. Although the wake-up model matches well with the low-packet rates
of sensor network applications, all contention-based protocols except PicoRadio, STEM, and Millers
proposal are designed for nodes with a single radio. The general approach to reduce the inherent idle
listening in these nodes is to enforce some kind of duty cycle by periodically switching the radio on for
a short time. This can be arranged individually per node (Low-Power Listening [LPL], and preamble
sampling, Section 34.4.2) or collectively per slot (S-MAC, Section 34.5.1). An alternative is to circumvent
the idle-listening problem, as the Sift protocol does, by restricting the network to a cellular topology where
access points collect data from nearby sensor nodes.
We like to point out that the choice for a particular notication policy is largely dependent on the
available hardware channels and the organizational model discussed before. Schedule-based notication
matches with TDMA frames; wake-up is only possible on dual-channel nodes. The Lightweight Medium
ACcess (LMAC) protocol (Section 34.6.1), however, is the exception to the rule and combines TDMA
frames with listening, striking a different balance between exibility and energy efciency.
34.4 Contention-Based Protocols
We now proceed with describing in detail some of the medium access protocols developed for sensor
networks according to their particular choice of organizational model (see Table 34.3). In this section, we
review contention-based protocols in which nodes can start a transmission at any random moment and
must contend for the channel. The main challenge with contention-based protocols is to reduce the energy
consumption caused by collisions, overhearing, and idle listening. CSMA/CA protocols effectively deal
with collisions and can be easily adapted to avoid a lot of overhearing overhead (i.e., switch off the radio
for the duration of another transmissions sequence). We also discuss the familiar IEEE 802.11 protocol.
even though it was not developed specically for sensor networks. It does, however, form the basis of the
energy-efcient derivatives discussed in this section (LPL and WiseMAC), as well as the slotted protocols
(S-MAC and Timeout-MAC [T-MAC]) discussed in the next section.
34.4.1 IEEE 802.11
The MAC in the IEEE 802.11 standard [7] is based on carrier sensing (CSMA) and collision detection
(through acknowledgments). A node wanting to transmit a packet must rst test the radio channel to
check if it is free for a specied time called the Distributed Inter Frame Space (DIFS). If so, a DATApacket
1
is transmitted, and the receiver waits a Short Inter Frame Space (SIFS) before acknowledging the reception
of the data by sending an ACK packet. Since the SIFS interval is set shorter than the DIFS interval, the
receiver takes precedence over any other node attempting to send a packet. If the sender does not receive
the acknowledgment, it assumes that the data was lost due to a collision at the receiver and enters a binary
exponential backoff procedure. At each retransmission attempt, the length of the contention window(CW)
is doubled. Since contending nodes randomly select a time from their CW, the probability of a subsequent
collision is reduced by half. To bound access latency somewhat, the CW is not doubled once a certain
maximum (CW
max
) has been reached.
To account for the hidden terminal problem in ad hoc networks, the 802.11 standard denes a virtual
carrier sense mechanismbased on the collision avoidance handshake of the MACAprotocol. The RTS/CTS
control packets include a time eld in their header, that species the duration of the upcoming DATA/ACK
sequence. This allows neighboring nodes overhearing the control packets to set their network allocation
vector (NAV) and defer transmission until it expires (see Figure 34.5). To save energy, the radio can be
switched off for the duration of the NAV. Thus CSMA/CAeffectively eliminates collisions and overhearing
overhead for unicast packets. Broadcast and multicast packets are always transmitted without an RTS/CTS
reservation sequence (and without an ACK), so they are susceptible to collisions.
34.4.2 LPL and Preamble Sampling
The major disadvantage of CSMA/CA is the energy wasted by idle-listening. Both Hill and Culler [25],
and El-Hoiydi [21] independently developed a low-level carrier sense technique that effectively duty cycles
RTS
CTS
Data
ACK
CW
Sender node
Receiver node
Other nodes
SIFS
SIFS
SIFS DIFS
NAV (RTS)
NAV (CTS)
DIFS
FIGURE 34.5 IEEE 802.11 access control.
1
The 802.11 standard denes the transmission protocol in terms of frames, but we use the term packet instead to
avoid confusion with the framing structure of TDMA protocols.
Receiver
Preamble Message
Sender
FIGURE 34.6 LPL: a long preamble allows periodic sampling at the receiver.
the radio, that is, turns it off repeatedly, without losing any incoming data. This technique operates at the
physical layer and concerns the layout of the PHY header prepended to each radio packet. This header
starts off with a preamble that is used to notify receivers of the upcoming transfer and allows them to
adjust (train) their circuitry to the current channel conditions; next follows the startbyte, signaling the
true beginning of the data transfer. The basic idea behind the efcient carrier-sense technique is to shift
the cost from the receiver (the frequent case) to the transmitter (the rarer case) by increasing the length
of the preamble. This allows the receiver to periodically turn on the radio to sample for incoming data,
and detect if a preamble is present or not. If it detects a preamble, it will continue listening until the start-
symbol arrives and the message can be properly received (see Figure 34.6). If no preamble is detected, the
radio is turned-off again until the next sample.
This efcient carrier-sense method can be applied to any contention-based MAC protocol. El-Hoiydi
combined it with ALOHA and named it preamble sampling [21]. Hill and Culler combined it with CSMA
and named it Low-Power Listening [25]. Neither implementation includes collision avoidance to save on
protocol overhead. The energy savings depend on the duty cycle, which in turn depends on the switching
times of the radio. LPL, for example, was implemented as part of TinyOS running on Mica motes equipped
with an RFM 1000 radio capable of performing a carrier sense in just 30 sec (cf. Table 34.1). The carrier
is sensed every 300 sec, yielding a duty-cycle of 10%, effectively reducing the idle-listening overhead by a
factor of ten. The energy savings come at a slight increase in latency (the length of the preamble is doubled
to 647 sec), and minor reduction in throughput. In the recently proposed B-MAC implementation (part
of TinyOS 1.1.3) the preamble length is provided as a parameter to the upper layers, so they can select the
optimal trade-off between energy savings and performance [31].
34.4.3 WiseMAC
El-Hoiydi has rened his preamble sampling one step further, by realizing that long preambles are not
necessary when the sender knows the sampling schedule of the intended receiver. The sender can then
simply wait until the moment the receiver is about to sample the channel, and send a packet with an
ordinary preamble. This not only saves energy at the sender, who waits instead of emitting an extended
preamble, but also at the receiver, since the time until the start symbol occurs is reduced in length
considerably. In WiseMAC [30] nodes maintain the schedule offsets of their neighbors through piggy
backed information on the ACKnowledgments of the underlying CSMA protocol. Whenever a node needs
to send a message to a specic neighbor n, it uses ns offset to determine when to start transmitting the
preamble; to account for any clock drift, the preamble is extended with a time proportional to the length of
the interval since the last message exchange. The overall effect of these measures is that WiseMAC adapts
automatically to trafc uctuations. Under low load, WiseMAC uses long preambles and consumes low
power (receiver costs dominate); under high loads, WiseMAC uses short preambles and operates energy
efciently (overheads are minimized). Finally, note that WiseMACs preamble length optimization is not
very effective for broadcast messages, since the preamble must span the sampling points of all neighbors
and account for drift, so it is quite often stretched to full length.
34.5 Slotted Protocols
The three slotted protocols (S-MAC, T-MAC, and Data-gathering MAC [DMAC]) listed in Table 34.3 are
all derived from classical contention-based protocols. They address the inherent idle-listening overhead
Sleep Sleep Active Active SYNC SYNC
FIGURE 34.7 Slot structure of S-MAC with built-in duty cycle.
by synchronizing the nodes, and implementing a duty cycle within each slot. At the beginning of a slot,
all nodes wake-up and any node wishing to transmit a message must contend for the channel. This
synchronized behavior increases the probability of collision in comparison to the random organization of
the energy-efcient CSMA protocols discussed in the previous section. To mitigate the increased collision
overheads S-MAC and T-MAC include an RTS/CTS handshake, but DMAC does without to save on
protocol overhead. The three slotted protocols also differ in their way of deciding when and how to switch
back from active to sleep mode, as will become apparent in the following discussions.
34.5.1 Sensor-MAC
The S-MAC protocol developed by Ye et al. [23] introduces a technique called virtual clustering to allow
nodes to synchronize on a common slot
2
structure (Figure 34.7). To this end nodes regularly broadcast
SYNC packets at the beginning of a slot, so other nodes receiving these packets can adjust their clocks
to compensate for drift. The SYNC packets also allow new (mobile) nodes to join the ad hoc network.
In principle, the whole network runs the same schedule, but due to mobility and bootstrapping a network
may comprise several virtual clusters. For the details of the synchronization procedure that resolves the
rare occasion of two clusters meeting each other, please refer to Reference 36.
An S-MAC slot starts off with a small synchronization phase, followed by a xed-length active period,
and ends with a sleep period in which nodes turn off their radio. Slots are rather large, typically in the
order of 500 msec to 1 sec. The energy savings of S-MACs built-in duty cycle are under control of the
application: the active part is xed
3
to 300 msec, while the slot length can be set to any value. Besides
addressing the idle-listening overhead, S-MAC includes collision avoidance (RTS/CTS handshake) and
overhearing avoidance. Finally, S-MAC includes message passing support to reduce protocol overhead
when streaming a sequence of message fragments.
The applications explicit control over the idle-listening overhead is a mixed blessing. On the one hand,
the application is in control of the energy-performance trade-off, which is good. On the other hand, the
duty cycle must be decided upon before starting S-MAC, which is bad since the optimal setting depends
on many factors including the expected occurrence rate of events observed after the deployment of the
nodes, and may even change over time.
34.5.2 Timeout-MAC
The T-MAC protocol by van Dam and Langendoen [28] introduces an adaptive duty cycle to improve
S-MAC on two accounts. First, T-MAC frees the application from the burden of selecting an appropriate
duty cycle. Second, T-MAC automatically adapts to trafc uctuations inherent to the local gossip and
convergecast patterns, while S-MACs slot length must be chosen conservatively to handle worst-case
trafc.
T-MAC borrows the virtual clustering method of S-MAC to synchronize nodes. In contrast to S-MAC,
it operates with xed length slots (615 msec) and uses a timeout mechanism to dynamically determine
the end of the active period. The timeout value (15 msec) is set to span a small contention period and an
RTS/CTS exchange. If a node does not detect any activity (an incoming message or a collision) within the
2
The S-MAC protocol is dened in terms of frames, but we use the term slot instead to avoid confusion with the
framing structure of TDMA protocols.
3
A recent enhancement of S-MAC, which is called adaptive listening, includes a variable length active part to reduce
multihop latency [36]. Since the timeout policy of the T-MAC protocol behaves similarly and was designed to handle
trafc uctuations as well, we do not discuss adaptive listening further.
Recv
Recv
Recv
Recv
Recv
Recv
Send
Send
Send
Sink
Sleep
Sleep
Sleep
Send
Send
Send
FIGURE 34.8 Convergecast tree with matching, staggered DMAC slots.
timeout interval, it can safely assume that no neighbor wants to communicate with it and goes to sleep.
On the other hand, if the node engages or overhears a communication, it simply starts a new timeout
after that communication nishes. To save energy, a node turns off its radio while waiting for other
communications to nish (overhearing avoidance).
The adaptive duty cycle allows T-MAC to automatically adjust to uctuations in network trafc. The
downside of T-MACs rather aggressive power-down policy, however, is that nodes often go to sleep too
early: when a node s wants to send a message to r, but loses contention to a third node n that is not
a common neighbor, s must remain silent and r goes to sleep. After ns transmission nishes, s will send
out an RTS to sleeping r and receives no matching CTS, hence, s must wait until the next frame to try
again. T-MAC includes two measures to alleviate this so-called early-sleeping problem, for details refer
to Reference 28, but the results in Section 34.7 showthat it strongly favors energy savings over performance
(latency/throughput).
34.5.3 Data-Gathering MAC
The DMACprotocol by Lu et al. [34] is the third slotted protocol that we discuss. For energy efciency and
ease of use, DMAC includes an adaptive duty cycle like T-MAC. In addition, it provides low node-to-sink
latency, which is achieved by supporting one communication paradigm only: convergecast.
DMAC divides time into rather short slots (around 10 msec) and runs CSMA (with acknowledgments)
within each slot to send or receive at most one message. Each node repeatedly executes a basic sequence
of one receive, one send, n sleep slots. At setup DMAC ensures that the sequences are staggered to match
the structure of the convergecast tree rooting at the sink node (see Figure 34.8). This arrangement allows
a single message from a node at depth d in the tree to arrive at the sink with a latency of just d slot times,
which is typically in the order of tens of milliseconds. DMAC includes an overow mechanism to handle
multiple messages in the tree. In essence, a node will stay awake for one more slot after relaying a message,
so in the case of two children contending for their parents receive slot, the one losing will get a second
chance. To account for interference, the overow slot is not scheduled back to back with the send slot,
but instead, receive slots are scheduled ve slots apart. The overow policy automatically takes care of
adapting to the trafc load, much like T-MACs extension of the active period.
The results reported in Reference 34 show that DMAC outperforms S-MAC in terms of latency
(due to the staggered schedules), throughput, and energy efciency (due to the adaptivity).
It remains to be seen if DMAC can be enhanced to support communications other than convergecast
equally well.
34.6 TDMA-Based Protocols
The major attractions of a schedule-based MACprotocol are that it is inherently collision free and that idle
listening can be ruled out since nodes know beforehand when to expect incoming data. The challenge is to
adapt TDMA-based protocols to operate efciently in ad hoc sensor networks without any infrastructure
(i.e., access points). We will now briey discuss the different approaches taken by the frame-based protocols
listed in Table 34.3:
Sink-based scheduling. The approach taken by Arisha et al. [22] is to partition the network into large
clusters, in which multihop trafc is possible. The trafc within each cluster is scheduled by a sink
node that is connected to the wired backbone network, and hence equipped with increased resources.
The goal is to optimize network lifetime, and the sink therefore takes the energy levels of each node into
account when deciding (scheduling) which nodes will sense, which nodes will relay, and which nodes
may sleep. The TDMA schedule is periodically refreshed to adapt to changes. It is required that all nodes
can directly communicate with the sink node (at maximum transmit power), which clearly limits the
scalability. Furthermore, the TDMA frame is of xed length, so the maximum number of nodes must be
known before deployment.
Static scheduling. The Self-Stabilizing (SS-TDMA) protocol by Kulkarni and Arumugam [16] uses a
xed schedule throughout the lifetime of the network, which completely removes the need for a centralized
(or distributed) scheduler. SS-TDMA operates on regular topologies, such as square and hexagonal grids
and synchronizes trafc network-wide in rounds: all even rows transmit a north-bound message, all odd
rows transmit a south-bound message, and so on. They show that such static schedules can result in
acceptable performance for typical communication patterns (broadcast, convergecast, and local gossip),
but their constraints on the location of the nodes renders it impractical in many deployment scenarios.
Rotating duties. When the node density is high, the costs of serving as an access point may be amortized
over multiple nodes by rotating duties among them. The PACT [18] protocol uses passive clustering to
organize the network into a number of clusters connected by gateway nodes; the rotation of the cluster
heads and gateways is based on information piggy-backed on the control messages exchanged during the
trafc control phase of the TDMA schedule. The BMA protocol [32] uses the LEACH approach [37]
to manage cluster formation and rotation. At the start of a TDMA frame, each node broadcasts one
bit of information to its cluster head stating whether or not the node has data to send. Based on this
information, the cluster head determines the number of data slots needed, computes the slot assignment,
and broadcasts that to all nodes under its control. Note that the bit-level trafc announcements require
very tight time synchronization between the nodes in the cluster.
Partitioned scheduling. In the EMACs protocol by van Hoesel et al. [27] the scheduling duties are
partitioned according to slot number. Each slot serves as a mini-TDMAframe and consists of a contention
phase, a trafc control section, and a data section. An active node that owns a slot always transmits in
its own slot. Therefore, a node n must listen to the trafc control sections of all its neighbors, since n
may be the intended receiver of any of them. The contention phase is included to serve passive nodes that
do not own a slot; the idea being that only some nodes need to be active to form a backbone network
ready to be used by passive nodes when they detect an event. In many scenarios, events occur rarely, so
the energy spent in listening for requests forms a major source of overhead. The LMAC protocol by the
same authors therefore simply does without a contention interval. This improved protocol is discussed in
detail below. In comparison to other TDMA-based protocols, both EMACs and LMAChave the advantage
of supporting node mobility, which signicantly increases their scope of deployment. The results in
Section 34.7 show that, performance-wise, partitioned scheduling is also an attractive option.
Replicated scheduling. The approach taken by Rajendran et al. [29] in the TRAMAprotocol is to replicate
the scheduling process over all nodes within the network. Nodes regularly broadcast information about
(long-running) trafc ows routed through them and the identities of their one-hop neighbors. This
results in each node being informed about the demands of its one-hop neighbors and the identity of
its two-hop neighbors. This information is sufcient to determine a collision-free slot assignment by
means of a distributed hash function that computes the winner (i.e., sender) of each slot based on the
node identities and slot number. During execution the schedule may be adapted to match actual trafc
conditions; nodes with little trafc may release their slot for the remainder of the frame for use by other
(overloaded) nodes. Although TRAMA achieves high channel utilization, it does so at the expense of
considerable latency and high algorithmic complexity.
Fromthe discussionabove, it becomes apparent that distributing TDMAout into ad hoc networks is rather
complicated and requires major compromises on deployment scenario (SS-TDMAand Arishas protocol),
algorithmic complexity (TRAMA), exibility/adaptivity (EMACs and LMAC), and latency (all protocols).
Although TDMA is inherently free of collision and idle-listening overheads, PACT and BMA rely on the
higher layers to amortize the overheads of the TDMA scheduler over rotating cluster heads.
Note that the partitioned and replicated scheduling approaches are most similar to contention-based
and slotted protocols in the sense that nodes operate autonomously, making them easy to install and
operate, and robust to node failures. The algorithmic complexity of the TRAMA protocol (replication)
is beyond the scope of this chapter, so we will only detail the LMAC protocol (partitioning).
34.6.1 Lightweight Medium Access
With the LMAC protocol [35], nodes organize time into slots, grouped into xed-length frames. A slot
consists of a trafc control section (12 bytes) and a xed-length data section. The scheduling discipline is
extremely simple: each active node is in control of a slot. When a node wants to send a packet, it waits until
its time-slot comes around, broadcasts a message header in the control section detailing the destination
and length, and then immediately proceeds with transmitting the data. Nodes listening to the control
header turn off their radio during the data part if they are not an intended receiver of the broadcast
or unicast message. In contrast to all other MAC protocols, the receiver of a unicast message does not
acknowledge the correct reception of the data; LMAC puts the issue of reliability at the upper layers.
The LMACprotocol ensures collision-free transmission by having nodes select a slot number that is not
in use within a two-hop neighborhood (much like frequency reuse in cellular communication networks).
To this end, the information broadcasted in the control section includes a bit set detailing which slots
are occupied by the one-hop neighbors of the sending node (i.e., the slot owner). New nodes joining the
network listen for a complete frame to all trafc control sections. By OR-ing the occupancy bit sets, they
can determine which slots are still free (Figure 34.9). The new node randomly selects a slot and claims
it by transmitting control information in that slot. Collisions in slot-selection result in garbled control
sections. A node observing such a collision, broadcasts the involved slot number in its control section,
which will be overheard by the unfortunate new nodes, who will then backoff and repeat the selection
process.
The drawback of LMACs contention-based slot-selection mechanism is that nodes must always listen
to the control sections of all slots in a frame even the unused ones since other nodes may join the
network at arbitrary moments. The resulting idle-listening overhead is minimized by taking one sample
of the carrier in an unused slot to sense any activity (cf. preamble sampling in Section 34.4.2). If there was
activity, the slot is included in the occupancy bit set and listened to completely in the next frame. The end
result is that LMAC combines a frame-based organization with notication by listening.
...1010111...
4
1
?
7
6
5
...1001010...
...0100111...
...1001111...
...1001110...
...0100110...
...0100111...
...0110110...
...0010110...
2
3
5
6
OR-ed bit sets for new node:
...1110111...
FIGURE 34.9 Slot-selection by LMAC. Nodes are marked with slot number and occupancy bit set.
34.7 Comparison
In the previous sections we reviewed 20 energy-efcient MAC protocols especially developed for sensor
networks. We discussed the qualitative merits of the different organizations: contention-based, slotted,
and TDMA-based protocols. When available, we reported quantitative results published by the designers
of the protocol at hand. Unfortunately, results from different publications are difcult to compare due to
the lack of a standard benchmark, making it hard to draw any nal conclusions. This section addresses
the need for a quantitative comparison by presenting the results from a study into the performance
and energy efciency of four MAC protocols (LPL, S-MAC, T-MAC, and LMAC) on top of a common
simulation platform. For reference we also report on the classic IEEE 802.11 protocol (in ad hoc mode).
The work load used to evaluate the protocols ranges from standard micro-benchmarks (latency and
throughput tests) to communication patterns specic to sensor networks (local gossip and convergecast).
34.7.1 Simulation Framework
The discrete-event simulator developed at Delft University of Technology includes a detailed model of the
popular RFM TR1001 low-power radio (discussed in Section 34.2.1) taking turnaround and wake-up
times (12 and 518 sec, respectively) into account. Energy consumption is based on the amount of
energy the radio uses; we do not take protocol processing costs on a CPU driving the radio into account.
The simulator records the amount of time spent in various states (standby, transmit, and receive/idle);
transitions between states are modeled as time spent in the most energy consuming state. At the end of
a run the simulator computes the average energy consumed for each node in the network using the current
drawn by the radio in each state (Table 34.1) and an input voltage of 3 V.
The ve MAC protocols under study are implemented as a class hierarchy on top of the physical
layer, which is a thin layer encapsulating the RFM radio model. The physical layer takes care of low-
level synchronization (preambles, start/stop bits) and proper channel coding. We now briey discuss the
implementation details of the ve MAC protocols:
802.11. The IEEE 802.11 (CSMA/CA) protocol was implemented using an 8 byte header encoding the
message type (RTS/CTS/DATA/ACK), source and destination ID (2 bytes each), sequence number, data
length, and CRC. The payload of the DATA packet can be up to 250 bytes. The sequence number serves
for detecting duplicate packets; retransmissions are triggered upon detection of a missing CTS or ACK
packet.
LPL. The LPL protocol (CSMA with acknowledgments) was implemented with the DATA and ACK
packets from the 802.11 implementation. LPL was set to sample the radio with a 10% duty cycle: 30 sec
carrier sense, 300 sec sample period. The preamble was stretched with one sample period to 647 sec.
Since hidden nodes make CSMAsusceptible to collisions, LPLs initial contend time is set somewhat larger
than for 802.11 (9.15 versus 3.05 msec).
S-MAC. The implementation of the S-MAC protocol extends the 802.11 model with SYNC packets
(8 byte header + 2 byte timestamp) to divide time into slots of 610 msec (20,000 ticks of a 32 kHz crystal).
Like LPL, S-MAC is set to operate with a 10% duty cycle, hence, the active period is set to 61 msec. This
is different from the original implementation to account for the different radio bitrate in our simulator
and to bring the frame length in line with T-MAC. Since trafc is grouped into bursts in the active
period, S-MAC deviates from the 802.11 backoff scheme and uses a xed contend time of 9.15 msec.
To reduce idle-listening overhead we choose to remove the synchronization section from the original
S-MAC protocol; SYNC packets are transmitted in the active period of a slot. To reduce interference with
other packets, a node transmits a SYNC packet only once every 90 sec on average. In our grid topology
with eight neighbors within radio range, that amounts to receiving a SYNC message every 11 sec.
T-MAC. The implementation of the T-MAC protocol enhances the S-MAC model with a variable-
length active period controlled by a 15 msec timeout value, which is set to span the contention period
(9.15 msec), an RTS (1.83 msec), the radio turnover period (12 sec), and the start of a CTS. This timeout
value causes T-MAC to operate with a 2.5% duty cycle in an empty network. In a loaded network the duty
TABLE 34.4 Implementation Details of the Simulator
PHYsical layer
Channel coding 8-to-16 bit coding
Effective bit rate 46 kbps
Prelude 433 sec (347 sec preamble + startbyte)
Carrier sense 30 sec
802.11 [extends PHY]
Control packets 8 bytes
DATA packets 8 byte header and 0250 byte payload
Contend time 3.05305 msec
LPL [extends 802.11]
Sample period 300 sec
S-MAC [extends 802.11]
SYNC packets 10 bytes
Slot time 610 msec
Active period 61 msec
T-MAC [extends S-MAC]
Activity timeout 15 msec
LMAC [extends PHY]
Slot time 14.3 msec (76 bytes)
Frame time 456 msec (32 slots)
cycle will increase as the active period is adaptively extended. All options for mitigating the early-sleeping
problem are included, see Reference 28 for details.
LMAC. The LMAC protocol was implemented from scratch on top of the physical layer. It was set to
operate with the maximum of 32 slots per frame to ensure that all nodes within a two-hop neighborhood
can own a slot for typical node densities (up to ten neighbors). The slot size was set to 76 bytes (12 byte
header + 63 byte data section + 1 byte CRC) to support a reasonable range of application-dependent
message sizes. We short-circuited LMACs collision-based registration procedure by randomly selecting
a slot number for each node at the start of a simulation run. A node listens to the 12 byte control
sections of all slots owned by its one-hop neighbors, it polls the other slots in the frame with the short,
30 sec carrier sense function to detect new nodes joining the network (which never happens during the
experiments).
For convenience Table 34.4 lists the key parameters of the MAC protocols used in our comparison. Note
that the LMAC implementation includes a certain overprovisioning, since the experiments involve just
24 two-hop neighbors (<32 slots) and messages with a 25 byte payload (<63 bytes). This is the price to
be paid for LMACs simplicity; other protocols, however, pay in terms of overhead (RTS/CTS signaling).
Another important characteristic of LMAC is that it does not try to correct any transmission errors, while
the others automatically do so through their retransmission policy for handling collisions. This difference
also shows up in the estimated memory footprint (i.e., RAM usage) and code complexity of the MAC
protocols listed in Table 34.5. All protocols except LMAC maintain information about the last sequence
number seen from each neighbor to lter out duplicates.
Our experiments use a static network with a grid topology. The radio range was set so that the nonedge
nodes all have eight neighbors. Concurrent transmissions are modeled to cause collisions if the radio
ranges (circles) of the senders intersect; nodes in the intersection receive a garbled packet with a failing
CRC check.
The application is modeled by a trafc generator at every node. The generator is parameterized to
send messages with a 25 byte payload either to direct neighbors (i.e., nodes within the radio range of the
sender), or to the sink node, which is located in the bottom-left corner of the grid. To route the latter
TABLE 34.5 Code Complexity and Memory Usage
802.11 LPL S-MAC T-MAC LMAC
a
Code complexity (lines) 400 325 625 825 250
RAM usage (bytes) 51 49 78 80 15
a
The LMAC protocol leaves acknowledgments and retransmissions to the
higher layers, adding about 75 lines of code and 40 bytes of RAM to those
layers.
TABLE 34.6 Base Performance with an Empty Network
802.11 LPL S-MAC T-MAC LMAC
Energy consumption (mW) 11.4 1.14 1.21 0.37 0.75
Effective duty cycle (%) 100 10 11 3.2 6.6
messages to the sink, we use a randomized shortest path routing method; for each message, the possible
next hops are enumerated. Next hops are eligible if they have a shorter path to the nal destination than
the sending node. From these next hops, a random one is chosen. Thus messages ow in the correct
direction, but do not use the same path every time. No control messages are exchanged for this routing
scheme: nodes automatically determine the next hop. By varying the message interarrival times, we can
study how the protocols perform under different loads.
34.7.2 Micro-Benchmarks
To determine the organizational overhead associated with each protocol we ran the simulator with an
empty workload. The resulting energy consumption is shown in Table 34.6. This table also shows the
effective duty cycle relative to the performance of the 802.11 protocol, which keeps all nodes listening all
the time.
The contention-based LPL protocol wastes no energy on organizing nodes, and achieves its target duty
cycle of 10%. The slotted protocols (S-MAC and T-MAC) spend some energy on sending and receiving
SYNC packets, but the impact is limited as the effective duty cycles only marginally exceed the built-in
active/sleep ratios (10 and 2.5%). Finally, note that the overhead of the TDMA-based LMAC protocol is
remarkably low (6.6%), which is largely due to the efcient carrier sense at the physical layer. If the nodes
were to listen to all trafc control sections completely, the overhead would grow to about 16% (12 control
bytes per 76 byte slot).
Our second experiment measured the multihop latency in an empty network, which we expect to be
signicant for slotted and schedule-based protocols. The results in Figure 34.10 conrm this: S-MAC,
T-MAC, and LMAC show end-to-end latencies that are much higher than those obtained by 802.11 and
LPL. In the case of LMAC a node that wants to send or relay a packet must wait until its slot turns up.
On an average, this means that packets are delayed by half the length of a frame, or 236 msec, which is an
order of magnitude more than the one-hop latency under 802.11 (13.2 msec). With T-MAC and S-MAC
the source node must wait for the next active period to show up before it can transfer the message with
an RTS/CTS/DATA/ACK sequence. This accounts for the initial offset of 263 msec. Then, in the case of
T-MAC, the second node may immediately relay the message since the third node is again awake due
to overhearing the rst CTS packet. The fourth node, however, did not receive that same CTS and by
lack of activity went to sleep. Therefore, the third nodes attempt to relay will fail, and it has to wait
until the start of the next slot. This accounts for T-MACs staircase pattern in Figure 34.10. S-MAC is
less aggressive in putting nodes to sleep, and messages can travel about 3 to 4 hops during one active
period. The exact number depends on the random numbers selected from the contention interval prior
to each RTS, and may be different for each data packet and active period. These numbers get averaged
over multiple messages and this explains the erosion of the staircase pattern when traveling more hops.
0
500
1000
1500
2000
2500
3000
3500
4000
4500
0 2 4 6 8 10 12 14 16
L
a
t
e
n
c
y

(
m
s
e
c
)
Number of hops
802.11
LPL
S-MAC
T-MAC
LMAC
FIGURE 34.10 Multihop latency in an empty network.
0
10
20
30
40
50
60
70
0 20 40 60 80 100
P
a
c
k
e
t
s

r
e
c
e
i
v
e
d
/
s
e
c
Generated packets/sec
802.11
LPL
S-MAC
T-MAC
LMAC
FIGURE 34.11 Throughput in a 3 3 grid.
Finally, observe that LPL outperforms 802.11 because it does not include an RTS/CTS handshake, but
sends the DATA immediately.
The third experiment that we carried out measured the maximum throughput that a single node can
handle (channel utilization). We selected a 3 3 section of the network grid, and arranged the 8 edge
nodes to repeatedly send a message (25 byte payload) to the central node. By increasing the sending
rate we were able to determine what the maximum throughput is that each MAC protocol can handle,
and whether or not it collapses under high loads. Figure 34.11 shows the results of this stress test. LPL
performs very poorly because of the collisions generated by hidden nodes; in the 3 3 conguration each
sending node senses only the communications by its direct neighbors on the edge, but the other nodes
are hidden from it. The repeated retransmissions issued to resolve the collisions cause the internal queues
to overow, and packets to be dropped. The RTS/CTS handshake eliminates most collisions and 802.11
achieves a maximum throughput of around 70 packets per second, which is about 30% of the effective
bitrate (46 kbps, or 230 packets/sec) offered by the physical layer. The signaling overhead (33 bytes MAC
control + physical layer headers + radio turnaround times) reduces this capacity already to 85 packets/sec;
the remaining loss is caused by the contention period prior to each RTS. S-MAC runs at a 10% duty cycle,
and its throughput is therefore reduced by a factor of 10. T-MAC on the other hand, adapts its duty cycle
and is able to follow the 802.11 curve at much higher loads than the other protocols. It attens off abruptly
(around 45 packets/sec) due to its xed contention window (9.15 msec), which is much shorter than the
maximum length of 802.11s binary backoff period (305 msec). The throughput of LMAC is limited by
two factors: (1) only 8 out 32 slots in each frame are used, and (2) only 25 bytes out of each 76 byte slot
are used. Consequently, LMACs throughput is maximized at 8% of the channel capacity.
34.7.3 Homogeneous Unicast and Broadcast
The micro-benchmarks discussed in the previous section studied the behavior of the MAC protocols
in isolation. In this section, we report on experiments involving all nodes in the network. The results
in this section provide a stepping stone for understanding the performance of the complex local gossip
and convergecast patterns common to sensor network applications.
In our rst network-wide experiment, we had all 100 nodes in a 10 10 grid repeatedly send a message
(25 byte payload) to a randomly selected neighbor. The intensity of this homogeneous load on the network
was controlled by adjusting the sending rate of the nodes. The topmost graph in Figure 34.12 shows the
delivery ratio with increasing load. It reveals that S-MAC, T-MAC, and LPL collapse at some point, while
the performance of the LMAC and 802.11 protocols degrades gracefully. When comparing the order in
which the protocols break down (S-MAC, T-MAC, LPL, LMAC, 802.11) with that of the corresponding
throughput benchmark (LPL, S-MAC, LMAC, T-MAC, 802.11) we see some striking differences. First,
LPL does much better, because nodes are throttled back by eight neighbors instead of just a few reducing
the probability of a collision with a hidden node. Second, T-MAC does much worse, because the RTS/CTS
signaling in combination with T-MACs power-down policy silences nodes too early. Third, the gap
between LMAC and 802.11 for high loads has shrunk considerably, which is mainly caused by 802.11 now
suffering from exposed nodes not present in the micro-benchmark.
The middle graph in Figure 34.12 plots the energy consumption of each MAC protocol when intensi-
fying the homogeneous load. Again we observe a few remarkable facts. First, the energy consumption of
the 802.11 protocol decreases for higher loads. This is caused by the overhearing avoidance mechanism
that shuts down the radio during communications in which a node is not directly involved. Second, the
energy consumption of T-MAC and LPL initially increase linearly, then jump to 11 mW. The jumps
correspond with the breakdowns of the message delivery rates, showing that the most energy is spent
on retransmissions due to collisions. The difference in gradient is caused by T-MAC spending additional
energy on the RTS/CTS handshake and the early-sleeping problem. Third, the energy consumption of
LMAC and S-MAC cross at about 50 bytes/node/sec, but while LMAC still delivers more than 97% of
the messages S-MACs delivery rate is down to just 10%. This signicant difference in price/performance
ratio is shown in the bottom graph of Figure 34.12, which plots the energy spent per data bit delivered.
These energy-efciency curves clearly show the collapse of the (slotted) contention-based protocols.
In our second network-wide experiment we had all 100 nodes repeatedly send a broadcast message
(25 byte payload) to their neighbors. Figure 34.13 shows the delivery rates, energy consumption, and
energy-efciency metrics. When comparing these results with those for unicast (Figure 34.12) some
interesting differences and similarities emerge. Consider the LMACprotocol rst. For broadcast it achieves
the same delivery rate as for unicast, which is no surprise given that LMAC guarantees collision-free
communications. The energy consumption to handle broadcast trafc, on the other hand, is about
twice the amount needed for unicast under high loads. This is a consequence of each node processing
more incoming data; instead of one neighbor with its radio set to listen for unicast, all neighbors have to
listen for a broadcast packet. This effect also explains why energy per received bit of information is reduced
with a factor of about six for light loads: all neighbors (6.84 on average) receive useful data at little extra
cost, and the energy is calculated per received bit.
0
20
40
60
80
100
120
140
0 20 40 60 80 100 120
E
n
e
r
g
y

p
e
r

b
i
t

(
m
J
)
Payload (bytes/node/sec)
802.11
LPL
S-MAC
T-MAC
LMAC
0
2
4
6
8
10
12
0 20 40 60 80 100 120
E
n
e
r
g
y

c
o
n
s
u
m
e
d

(
a
v
g
.

m
W
/
n
o
d
e
)
802.11
LPL
S-MAC
T-MAC
LMAC
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120
D
e
l
i
v
e
r
y

r
a
t
i
o
802.11
LPL
S-MAC
T-MAC
LMAC
FIGURE 34.12 Performance under homogeneous unicast trafc: delivery rate (top), energy consumption (middle),
and energy efciency (bottom).
0
5
10
15
20
25
30
0 20 40 60 80 100 120
E
n
e
r
g
y

p
e
r

b
i
t

(
m
J
)
802.11
LPL
S-MAC
T-MAC
LMAC
0
2
4
6
8
10
12
14
0 20 40 60 80 100 120
E
n
e
r
g
y

c
o
n
s
u
m
p
t
i
o
n

(
a
v
g
.

m
W
/
n
o
d
e
)
802.11
LPL
S-MAC
T-MAC
LMAC
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120
D
e
l
i
v
e
r
y

r
a
t
i
o
802.11
LPL
S-MAC
T-MAC
LMAC
(a)
(b)
(c)
FIGURE 34.13 Performance under homogeneous broadcast trafc: (a) delivery rate, (b) energy consumption, and
(c) energy efciency.
When considering the other protocols we nd that the delivery rates degrade for light loads with
respect to the unicast experiments, but improve dramatically for high loads. In particular, we nd no
breakdown points as with unicast trafc. The reason is twofold: (1) there are no retransmissions that
clog up the network, and (2) even when collisions occur, some of the neighbors still receive data. Note
that the delivery ratio should be interpreted as the fraction of neighbors receiving a broadcast message,
not as the probability that the message is received by all neighbors. The slotted protocols (S-MAC and
T-MAC) perform considerably worse than the contention-based protocols (802.11 and LPL). The reason
for this is that by grouping all trafc into a rather short active period, the probability of a collision is
increased considerably. The reason that 802.11 outperforms LPL is that the latter uses a longer preamble,
and although this increases the length of the DATA only by about 5%, the probability of a collision is
raised enough to make a difference in delivery rate.
The energy-efciency curves show that all protocols except S-MAC spend less energy per bit when the
intensity of the broadcast trafc increases. In particular, the contention-based protocols do not suffer from
a collapse as with unicast. The reason that the energy spent per bit increases with S-MAC is threefold:
(1) it suffers from considerably more collisions in its small active period, (2) the fraction of time spent
in transmitting steadily increases, especially since no time is spent waiting during a handshake as for
unicast, and (3) overhearing avoidance is no longer applicable, forcing the radio to be on all the time
during S-MACs active period. The latter reason also explains why 802.11s energy consumption does not
go down with increasing load as it did for unicast trafc.
34.7.4 Local Gossip
The rst communication pattern specic to sensor network applications that we studied was local gossip.
We designated a 5 5 area in the middle of the grid as the event region in which nodes would repeatedly
send a message (25 byte payload) to a randomly selected neighbor. In essence local gossip is a mixture of
75% empty workload (Table 34.6) and 25% homogeneous workload (Figure 34.12). The delivery rates
associated with local gossip, as shown in Figure 34.14, are completely determined by the homogeneous
unicast component of the workload, and therefore resemble the curves in Figure 34.12 to a large extent.
The LMAC curve is identical, the others are shifted to the right because collisions occur less frequently
due to a relatively large number of edge nodes with inactive neighbors (16/25 versus 36/100). The energy
consumption numbers, which are averages over the whole network, are diluted by the empty workload
component (cf. Figure 34.12 and Figure 34.14). In contrast, the energy-efciency numbers, not shown
for brevity, are raised since the energy spent by passive nodes (idle listening) is amortized over the limited
trafc in the 5 5 region.
34.7.5 Convergecast
In our nal experiment we studied the convergecast communication pattern. All 100 nodes in the network
periodically send a message (25 byte payload) to the sink in the bottom-left corner of the grid. To maximize
the load on the MAC protocols, messages are not aggregated at intermediate nodes. Figure 34.15 shows
the delivery rates and energy efciencies for the convergecast pattern. The shapes of these curves show
a large similarity with the homogeneous unicast pattern. Note that the generated load that can be handled
is much lower than with homogeneous unicast, since each injected message needs to travel 6.15 hops
on average. The performance results, however, do not simply scale with the path-length factor. The
breakdown points on the delivery curves for convergecast are shifted far more to the left than a factor
of six, and also the order in which the protocols breakdown is changed signicantly. In particular, the
LMAC protocol cannot handle the heavy loads around the sink since each node can only use the capacity
of one slot as demonstrated by the throughput micro-benchmark. T-MAC and LPL handle the high
loads around the sink much better than LMAC, with LPL being slightly more efcient. Both suffer from
a collapse, however, when the load is increased causing the energy consumed per bit to suddenly rocket
upwards. Furthermore, note that energy efciency degrades more than a factor of six compared with that
0
2
4
6
8
10
12
0 20 40 60 80 100 120
E
n
e
r
g
y

c
o
n
s
u
m
e
d

(
a
v
g
.

m
W
/
n
o
d
e
)
802.11
LPL
S-MAC
T-MAC
LMAC
0
0.2
0.4
0.6
0.8
1
0 20 40 60 80 100 120
D
e
l
i
v
e
r
y

r
a
t
i
o
802.11
LPL
S-MAC
T-MAC
LMAC
FIGURE 34.14 Performance under local gossip: delivery rate (top), energy efciency (bottom).
for unicast under comparable load. Apparently, even the adaptive T-MAC protocol nds it impossible to
select the right duty cycle for each node.
34.7.6 Discussion
When reviewing the simulation results we nd that no MAC protocol outperforms the others in all
experiments. Each protocol has its strong and weak points, which reects the particular choice on how to
trade-off performance (latency, throughput) for cost (energy consumption). Some general observations,
however, can be made:
Communication grouping considered harmful. The slotted protocols (S-MAC and T-MAC) organize
nodes to communicate during small periods of activity. The advantage is that very low duty cycles can
be obtained, but at the expense of high latency and a collapse under high loads. T-MACs automatic
adaptation of the duty cycle allows it to handle higher loads; S-MACs xed duty cycle bounds the energy
consumption under a collapse.
0
100
200
300
400
500
600
700
800
900
0 5 10 15 20
E
n
e
r
g
y

p
e
r

b
i
t

(
m
J
)
802.11
LPL
S-MAC
T-MAC
LMAC
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20
D
e
l
i
v
e
r
y

r
a
t
i
o
802.11
LPL
S-MAC
T-MAC
LMAC
FIGURE 34.15 Performance under convergecast: delivery rate (top) and energy efciency (bottom).
The TDMA-based LMAC protocol also limits the moments at which nodes may communicate and
therefore incurs high latencies in general, and reduced throughput under high load. In contrast to T-MAC,
its energy consumption does not deteriorate; LMAC is rather robust and performance degrades gracefully
under higher loads.
The LPL protocol is most exible since it puts only minor restrictions on when nodes can communicate
(i.e., once every 300 sec). Its sampling approach, however, critically depends on the radios ability to
switch on quickly. This is the case for the RFM radio at hand, but preliminary experiments with the
Chipcon radio shows that LPLs advantage weakens when operating with a corresponding 2 out of 20 msec
duty cycle:
Collision avoidance considered prohibitive. On the one hand, the RTS/CTS handshake prevents collisions
due to hidden nodes, which is good. On the other hand, the RTS/CTS handshake reduces the effective
channel capacity since a communication takes more time (11.68 versus 8.31 msec), which decreases the
minimum packet transfer rate required before network collapse. Given that typical messages in sensor
networks are small, the overheads associated with collision avoidance prove to be prohibitive, especially
in combination with communication grouping.
Adaptivity considered essential. The results for local gossip and convergecast communication patterns
show that MAC protocols must be able to adapt to local trafc demands. Static protocols either consume
too much energy under low loads (e.g., S-MAC), or throttle throughput too much under high loads
(e.g., LMAC). The current generation of adaptive protocols (e.g., T-MAC and LPL), however, are not the
nal answer since they suffer from contention collapse, forcing applications to be aware of that and take
precautions.
34.8 Conclusions
Medium access protocols for wireless sensor networks trade-off performance (latency, throughput, and
fairness) for cost (energy consumption). They do so by turning off the radio for signicant amounts
of time reducing the energy wasted by idle listening, which dominates the cost of typical WLAN-based
MAC protocols. Other sources of overhead include collisions, overhearing, protocol overhead, and trafc
uctuations. Different protocols take different approaches to reduce (some of) these overheads. They can
be classied according three important design decisions: (1) the number of channels used (single, double,
or multiple), (2) the way in which nodes are organized (random, slotted, frames), and (3) the notication
method used (listening, wake-up, schedule). Given that the current generation of sensor nodes is equipped
with one radio, most protocols use a single channel. The organizational choice, however, is not so easily
decided on since it reects the fundamental trade-off between exibility and energy efciency.
Contention-based protocols like CSMA are extremely exible regarding the time, location, and amount
of data transfered by individual nodes. This gives them the advantage of handling the trafc uctuations
present in typical monitoring applications running on wireless sensor networks. Contention-based proto-
cols can be made energy efcient by implementing a duty cycle at the physical level provided that the radio
can be switched on and off rapidly. The idea is to stretch the preamble, which allows potential receivers to
sample the carrier at a low rate.
Slotted protocols organize nodes to synchronize on a common slot structure. They reduce idle-listening
by implementing a duty cycle within each slot. This duty cycle need not be xed, and can be adapted
automatically to match demands.
TDMA-based protocols have the advantage of being inherently free of idle-listening since nodes are
informed up front, by means of a schedule, when to expect incoming trafc. To control the overheads
associated with computing the schedule and its distribution through the network, TDMA-based protocols
must either limit the deployment scenario (e.g., single hop) or hard code some parameters (e.g., maximum
number of two-hop neighbors) compromising on exibility.
A head-to-head comparison of sample protocols from each class revealed that there is no single, best
MAC protocol that outperforms all others. What did become apparent, however, is that adaptivity is
mandatory to handle the generic local gossip and convergecast communication patterns displaying trafc
uctuations both in time and space. Considering the speed at which protocols have been developed so far,
we expect a number of new protocols to appear that will strike yet another balance between exibility and
energy efciency. Other future developments may include crosslayer optimizations with routing and data
aggregation protocols, and an increased level of robustness to handle practical issues, such as asymmetric
links and node failures.
Acknowledgments
We thank Tijs van Dam for his initial efforts in designing the T-MAC protocol and putting the issue of
energy-efcient MAC protocols on the Delft research agenda. We thank Ivaylo Haratcherev, Tom Parker,
and Niels Reijers for proofreading this chapter correcting numerous mistakes, ltering out jargon, and
rearranging material all of which greatly enhanced the readability of the text.
References
[1] R. Jurdak, C. Lopes, and P. Baldi. A survey, classication and comparative analysis of medium
access control protocols for ad hoc networks. IEEE Communications Surveys and Tutorials,
6, 216, 2004.
[2] N. Abramson. The ALOHA system another alternative for computer communications.
In Proceedings of the Fall Joint Computer Conference, Vol. 37. Montvale, NJ, 1970, pp. 281285.
[3] L. Roberts. ALOHApacket systemwith and without slots and capture. ACMSIGCOMMComputer
Communications Review, 5, 2842, 1975.
[4] L. Kleinrock and F. Tobagi. Packet switching in radio channels: part I carrier sense multiple-
access modes and their throughput-delay characteristics. IEEE Transactions on Communications,
23, 14001416, 1975.
[5] P. Karn. MACA a new channel access method for packet radio. In Proceedings of the 9th ARRL
Computing Networking Conference, September 1990, pp. 134140.
[6] V. Bharghavan, A. Demers, S. Shenker, and L. Zhang. MACAW: a media access protocol for wireless
LANs. InProceedings of the Conference onCommunications Architectures, Protocols andApplications.
London, August 1994, pp. 212225.
[7] IEEE standard 802.11. Wireless LAN Medium Access Control (MAC) and Physical Layer (PHY)
Specications, 1999.
[8] R. Krashinsky and H. Balakrishnan. Minimizing energy for wireless web access with bounded
slowdown. In Proceedings of the 8th ACM International Conference on Mobile Computing and
Networking (MobiCom02). Atlanta, GA, September 2002, pp. 119130.
[9] A. Gummalla and J. Limb. Wireless medium access control protocols. IEEE Communications
Surveys and Tutorials, 3, 215, 2000.
[10] F. Liu, K. Xing, X. Cheng, and S. Rotenstrech. Energy-efcient MAC layer protocols in ad hoc
networks. In Resource Management in Wireless Networking, M. Cardei, I. Cardei, and D.-Z. Du,
Eds. Kluwer Academic Publishers, Dordrecht, 2004.
[11] RF Monolithics. TR1001 868.35 MHz Hybrid Tranceiver.
[12] Chipcon Corporation. CC1000 Low Power FSK Tranceiver.
[13] L. Feeney and M. Nilsson. Investigating the energy consumption of a wireless network interface
in an ad hoc networking environment. In Proceedings of the IEEE INFOCOM, IEEE, Anchorage,
Alaska, 2001, pp. 15481557.
[14] R. Szewczyk, J. Polastre, A. Mainwaring, and D. Culler. Lessons from a sensor network expedition.
In Proceedings of the 1st European Workshop on Wireless Sensor Networks (EWSN 04). Berlin,
Germany, January 2004.
[15] T. He, S. Krishnamurthy, J. Stankovic, T. Abdelzaher, L. Luo, R. Stoleru, T. Yan, L. Gu, J. Hui, and
B. Krogh. An energy-efcient surveillance system using wireless sensor networks. In Proceedings of
the 2nd International Conference on Mobile Systems, Applications, and Services (MobiSys04). Boston,
MA, June 2004, pp. 270283.
[16] S. Kulkarni and M. Arumugam. TDMA service for sensor networks. In Proceedings of the 24th
International Conference on Distributed Computing Systems (ICDCS04), ADSN Workshop. Tokyo,
Japan, March 2004, pp. 604609.
[17] K. Sohrabi, J. Gao, V. Ailawadhi, and G. Pottie. Protocols for self-organization of a wireless sensor
network. IEEE Personal Communications, 7, 1627, 2000.
[18] G. Pei and C. Chien. Low power TDMA in large wireless sensor networks. In Proceedings of
the Military Communications Conference (MILCOM 2001), Vol. 1. Vienna, VA, October 2001,
pp. 347351.
[19] C. Guo, L. Zhong, and J. Rabaey. Low power distributed MAC for ad hoc sensor networks.
In Proceedings of the IEEE GlobeCom. San Antonio, AZ, November 2001.
[20] C. Schurgers, V. Tsiatsis, S. Ganeriwal, and M. Srivastava. Optimizing sensor networks in the
energy-latency-density design space. IEEE Transactions on Mobile Computing, 1, 7080, 2002.
[21] A. El-Hoiydi. ALOHA with preamble sampling for sporadic trafc in ad hoc wireless sensor
networks. In Proceedings of the IEEE International Conference on Communications (ICC). NewYork,
April 2002.
[22] K. Arisha, M. Youssef, and M. Younis. Energy-aware TDMA-based MAC for sensor networks.
In Proceedings of the IEEE Workshop on Integrated Management of Power Aware Communications,
Computing and NeTworking (IMPACCT 2002). New York City, NY, May 2002.
[23] W. Ye, J. Heidemann, and D. Estrin. An energy-efcient MAC protocol for wireless sensor
networks. In Proceedings of the 21st Conference of the IEEE Computer and Communications Societies
(INFOCOM), Vol. 3. June 2002, pp. 15671576.
[24] E.-S. Jung and N. Vaidya. Apower control MACprotocol for ad hoc networks. In Proceedings of the
8th ACM International Conference on Mobile Computing and Networking (MobiCom02). Atlanta,
GA, September 2002, pp. 3647.
[25] J. Hill and D. Culler. Mica: a wireless platform for deeply embedded networks. IEEE Micro, 22,
1224, 2002.
[26] K. Jamieson, H. Balakrishnan, and Y. Tay. Sift: a MAC protocol for event-driven wireless sensor
networks. Technical report LCS-TR-894, MIT, May 2003.
[27] L. van Hoesel, T. Nieberg, H. Kip, and P. Havinga. Advantages of a TDMA based, energy-efcient,
self-organizing MAC protocol for WSNs. In Proceedings of the IEEE VTC 2004 Spring. Milan, Italy,
May 2004.
[28] T. van Dam and K. Langendoen. An adaptive energy-efcient MAC protocol for wireless sensor
networks. In Proceedings of the 1st ACMConference on Embedded Networked Sensor Systems (SenSys
2003). Los Angeles, CA, November 2003, pp. 171180.
[29] V. Rajendran, K. Obraczka, and J. Garcia-Luna-Aceves. Energy-efcient, collision-free medium
access control for wireless sensor networks. In Proceedings of the 1st ACM Conference
on Embedded Networked Sensor Systems (SenSys 2003). Los Angeles, CA, November 2003,
pp. 181192.
[30] A. El-Hoiydi, J.-D. Decotignie, C. Enz, and E. Le Roux. Poster abstract: WiseMAC, an ultra
low power MAC protocol for the WiseNET wireless sensor network. In Proceedings of the
1st ACM Conference on Embedded Networked Sensor Systems (SenSys 2003). Los Angeles, CA,
November 2003.
[31] J. Polastre and D. Culler. B-MAC: an adaptive CSMA layer for low-power operation. Technical
report cs294-f03/bmac, UC Berkeley, December 2003.
[32] J. Li andG. Lazarou. Abit-map-assistedenergy-efcient MACscheme for wireless sensor networks.
In Proceedings of the 3rd International Symposium on Information Processing in Sensor Networks
(IPSN04). Berkeley, CA, April 2004, pp. 5560.
[33] M. Miller and N. Vaidya. Minimizing energy consumption in sensor networks using a wakeup
radio. In Proceedings of the IEEE Wireless Communications and Networking Conference (WCNC04).
Atlanta, GA, March 2004.
[34] G. Lu, B. Krishnamachari, and C. Raghavendra. An adaptive energy-efcient and low-latency MAC
for data gathering in sensor networks. In Proceedings of the International Workshop on Algorithms
for Wireless, Mobile, Ad Hoc and Sensor Networks (WMAN). Santa Fe, NM, April 2004.
[35] L. van Hoesel and P. Havinga. A lightweight medium access protocol (LMAC) for wireless sensor
networks. In Proceedings of the 1st International Workshop on Networked Sensing Systems (INSS
2004). Tokyo, Japan, June 2004.
[36] W. Ye, J. Heidemann, and D. Estrin. Medium access control with coordinated, adaptive sleeping
for wireless sensor networks. Technical report ISI-TR-567, USC/Information Sciences Institute,
January 2003 (accepted for publication in IEEE/ACM Transactions on Networking).
[37] W. Heinzelman, A. Chandrakasan, andH. Balakrishnan. Energy-efcient communicationprotocol
for wireless microsensor networks. In Proceedings of the 33rd Hawaii International Conference on
System Sciences. January 2000.
35
Overview of Time
Synchronization
Issues in Sensor
Networks
Weilian Su
Naval Postgraduate School
35.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35-1
35.2 Design Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35-2
35.3 Factors Inuencing Time Synchronization . . . . . . . . . . . . 35-3
35.4 Basics of Time Synchronization . . . . . . . . . . . . . . . . . . . . . . . . 35-3
35.5 Time Synchronization Protocols for
Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35-6
35.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35-9
Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35-9
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35-9
35.1 Introduction
In the near future, small intelligent devices will be deployed in homes, plantations, oceans, rivers, streets,
and highways to monitor the environment [1]. Events, such as target tracking, speed estimating, and
ocean current monitoring, require the knowledge of time between sensor nodes that detect the events.
In addition, sensor nodes may have to time-stamp data packets for security reasons. With a common view
of time, voice and video data from different sensor nodes can be fused and displayed in a meaningful
way at the sink. Also, mediumaccess scheme such as Time Division Multiple Access (TDMA) requires the
nodes to be synchronized, so the nodes can be turned off to save energy.
The purpose of any time synchronization technique is to maintain a similar time within a certain
tolerance throughout the lifetime of the network or among a specic set of nodes in the network.
Combining with the criteria that sensor nodes have to be energy efcient, low-cost, and small in a
multihop environment, this requirement makes a challenging problem to solve. In addition, the sensor
nodes may be left unattended for a long period of time, for example, in deep space or on an ocean oor.
When messages are exchanged using short distance multihop broadcast, the software and mediumaccess
time and the variation of the access time may contribute the most in time uctuations and differences in
the path delays. Also, the time difference between sensor nodes may be signicant over time due to the
drifting effect of the local clocks.
35-1
In this chapter, the backgrounds of time synchronization are provided to enable new developments or
enhancements of timing techniques for the sensor networks. The design challenges and factors inuencing
time synchronization are described in Sections 35.2 and 35.3, respectively. In addition, the basics of time
synchronization for sensor networks are explained in Section 35.4. Afterwards, different types of timing
techniques are discussed in Section 35.5. Last, the chapter is concluded in Section 35.6.
35.2 Design Challenges
In the future, many low-end sensor nodes will be deployed to minimize the cost of the sensor networks.
These nodes may work collaboratively together to provide time synchronization for the whole sensor
network. The precision of the synchronized clocks depends on the needs of the applications. For example,
a sensor network requiring TDMA service may require microseconds difference among the neighbor
nodes while a data gathering application for sensor networks requires only milliseconds of precision.
As sensor networks are application driven, the design challenges of a time synchronization protocol
are also dictated by the application. These challenges are to provide an overall guideline and requirement
when considering the features of a time synchronization protocol for sensor networks; they are robust,
energy aware, server-less, light-weight, and tunable service:
Robust. Sensor nodes may fail, and the failures should not have signicant effect on the time synchro-
nization error. If sensor nodes depend on a specic master to synchronize their clocks, a failure or anomaly
of the masters clock may create a cascade effect that nodes in the network may become unsynchronized.
So, a time synchronization protocol has to handle the unexpected or periodic failures of the sensor nodes.
If failures do occur, the errors caused by these failures should not be propagated throughout the network.
Energy aware. Since each node is battery limited, the use of resources should be evenly spread and
controlled. Atime synchronization protocol should use the minimumnumber of messages to synchronize
the nodes in the earliest time. In addition, the load for time synchronization should be shared, so some
nodes in the network do not fail earlier than others. If some parts of the network fail earlier than others,
the partitioned networks may drift apart fromeach other and become unsynchronized.
Server-less. Aprecise time server may not be available. In addition, the time servers may fail when placed
in the sensor eld. As a result, sensor nodes should be able to synchronize to a common time without the
precise time servers. When the precise time servers are available, the quality of the synchronized clocks
as well as the time to synchronize the clocks of the network should be much better. This server-less feature
also helps to address the robustness challenge as stated earlier.
Light-weight. The complexity of the time synchronization protocol has to be low in order to be
programmed into the sensor nodes. Besides being energy limited, the sensor nodes are memory limited
as well. The synchronization protocol may be programmed into a eld programmable gate array (FPGA)
or designed into an ASIC. By having the time synchronization protocol tightly integrated with the
hardware, the delay and variation of the processing may be smaller. With the increase of precision,
the cost of a sensor node is higher.
Tunable service. Some services, such as mediumaccess, may require time synchronization to be always
ON while others only need it when there is an event. Since time synchronization can consume a lot of
energy, a tunable time synchronization service is applicable for some applications. Nevertheless, there are
needs for both type of synchronization protocols.
The above challenges provide a guideline for developing various types of time synchronizationprotocols
that are applicable to the sensor networks. A time synchronization protocol may have a mixture of
these design features. In addition, some applications in the sensor networks may not require the time
synchronization protocol to meet all these requirements. For example, a data gathering application may
require the tunable service and light-weight features more than the server-less capability. The tunable
service and light-weight features allow the application to gather precise data when the users require it.
In addition, the nodes that are not part of this data gathering process may not have to be synchronized.
Time Synchronization Issues in Sensor Networks 35-3
Also, the precision of the time does not need to be high, because the users may only need milliseconds
precision to satisfy their needs.
As these design challenges are important for guiding the development of a time synchronization
protocol, the inuencing factors that affect the quality of the synchronized clocks have to be discussed.
Although the inuencing factors are similar to existing distributed computer system, they are at different
extreme levels. These inuencing factors are discussed in Section 35.3.
35.3 Factors Inuencing Time Synchronization
Regardless of the design challenges that a time synchronization protocol wants to address, the protocol
still needs to address the inherent problems of time synchronization. In addition, small and low-end
sensor nodes may exhibit device behaviors that may be much worst than large systems such as personal
computers (PCs). As a result, time synchronization with these nodes present a different set of problems.
Some of the factors inuencing time synchronization in large systems also apply to sensor networks [2];
they are temperature, phase noise, frequency noise, asymmetric delays, and clock glitches:
Temperature. Since sensor nodes are deployed in various places, the temperature variation throughout
the day may cause the clock to speedup or slow down. For a typical PC, the clock drifts few parts per
million (ppm) during the day [3]. For low-end sensor nodes, the drifting may be even worst.
Phase noise. Some of the causes of phase noise are due to access uctuation at the hardware interface,
response variation of the operating system to interrupts, and jitter in the network delay. The jitter in the
network delay may be due to medium access and queueing delays.
Frequency noise. The frequency noise is due to the unstability of the clock crystal [4]. A low-end
crystal may experience large frequency uctuation, because the frequency spectrum of the crystal has
large sidebands on adjacent frequencies. The drift rate values for quartz oscillators are between 10
4
and 10
6
[5].
Asymmetric delay. Since sensor nodes communicate with each other through the wireless medium,
the delay of the path from one node to another may be different than the return path. As a result, an
asymmetric delay may cause an offset to the clock that can not be detected by a variance type method [2].
If the asymmetric delay is static, the time offset between any two nodes is also static. The asymmetric delay
is bounded by one-half the round trip time between the two nodes [2].
Clock glitches. Clock glitches are sudden jumps in time. This may be caused by hardware or software
anomalies such as frequency and time steps.
Since sensor nodes are randomly deployed and their broadcast ranges are small, the inuencing factors
may shape the design of the time synchronization protocol. In addition, the links between the sensor
nodes may not be reliable. As a result, the inuencing factors may have to be addressed differently. In the
following section, the basics of time synchronization for sensor networks are discussed.
35.4 Basics of Time Synchronization
As the factors described in Section 35.3 inuence the error budget of the synchronized clocks, the purpose
of a time synchronization protocol is to minimize the effects of these factors. Before developing a solution
to address these factors, some basics of time synchronization for sensor networks need to be discussed.
These basics are to provide the fundamentals for designing a time synchronization protocol.
If a better clock crystal is used, the drift rate may be much smaller. Usually, the hardware clock time
H (t ) at real-time t is within a linear envelope of the real-time as illustrated in Figure 35.1. Since the clock
drifts away fromreal-time, the time difference between two events measured with the same hardware clock
may have a maximumerror of (b a) [5], where a and b are the time of occurrence of rst and second
events, respectively. For modern computers, the clock granularity may be negligible, but it may contribute
a signicant portion to the error budget if the clock of a sensor node is really coarse, running at kHz range
H(t)
1
2
3
4
5
6
7
8

1
0
9

1
1
1
2
1
3
1
4
1 2 3 4 5 6 7 8 10 9 11 12 13 14
Hardware clock time
1
1+ r
1 r
t, Real-time
Ideal time
FIGURE 35.1 Drifting of hardware clock time.
instead of MHz. In certain applications, a sensor node may have a volume of cm
3
[6], so a fast oscillator
may not be possible or suitable for such size.
Regardless of the clock granularity, the hardware clock time H (t ) is usually translated into a virtual
clock time by adding an adjustment constant to it. Normally, it is the virtual clock time that we read from
a computer. Hence, a time synchronization protocol may adjust the virtual clock time and discipline the
hardware clock to compensate for the time difference between the clocks of the nodes. Either approach
has to deal with the factors inuencing time synchronization as described earlier.
When an application issues a request to obtain the time, the time is returned after a certain delay. This
software access delay may uctuate according to the loading of the system. This type of uctuation is
nondeterministic and may be lessened if real-time operation system and hardware architecture are used.
For low-end sensor nodes, the software access time may be in the order of few hundred microseconds.
For example, a Mica mote is running at 4 MHz [7] having clock granularity of 0.25 sec. If the node
is 80% loaded and it takes 100 cycles to obtain the time, the software access time is around 125 sec.
In addition to the software access time, the medium access time also contributes to the nondeterministic
delay that a message experiences. If carrier-sense multiple access (CSMA) is used, the back-off window
size as well as the trafc load affect the medium access time [810]. Once the sensor node obtains the
channel, the transmission and propagation times are pretty deterministic, and they can be estimated by
the packet size, transmission rate, and speed-of-light.
In summary, the delays experienced when sending a message at real-time t
1
and receiving an acknowl-
edgment (ACK) at real-time t
4
are shown in Figure 35.2. The message from node A incurs the software
access, medium access, transmission, and propagation times. These times are represented by S
1
, M
1
, T
1
,
and P
12
. Once the message is received by node B at t
2
, it will incur extra delays through receiving and
processing. After the message is processed, an ACK is sent to node A at t
3
. The total delay at node B is the
Node B
S=Software access time
M=Medium access time
T=Transmission time
P=Propagation time
R=Reception time
Sending message at t
1
Receiving message at t
2
Receiving ACK at t
4
S
1
M
1
T
1
P
12
R
2
S
2
S
4
R
4
P
34
T
3
M
3
S
3
Sending message at t
3
Node A
FIGURE 35.2 Round-trip time.
summationof R
2
, S
2
, (1
B
)(t
3
t
2
), S
3
, M
3
, and T
3
, where
B
is the drift rate at node Band the difference
(t
3
t
2
) is to account for the waiting time before an ACK is sent to node A by node B. After node B sends
the ACK, the ACK propagates through the wireless medium and arrives at node A. Afterwards, node A
processes the ACK. The path delays for sending and receiving the ACK from node B to A are P
34
, R
4
,
and S
4
. The round-trip time in real-time t for sending a message and receiving an ACK is calculated by
t
4
t
1
= S
1
+M
1
+T
1
+P
12
+R
2
+S
2
+(1
B
)(t
3
t
2
) +S
3
+M
3
+T
3
+P
34
+R
4
+S
4
(35.1)
where S, M, T, P, and R are the software access, mediumaccess, transmission, propagation, and reception
times, respectively.
If the round-trip time is measured using the hardware clock of node A, it has to be adjusted by the
drift rate of node A
A
. If the granularity of the hardware clock is coarse, the error contributed by the
granularity should be accounted for. As a result, the round-trip time measured with the hardware clock is
bounded by an error associated with the clock drift and granularity as determined by
(1
A
)(t
4
t
1
) H(t
4
) H(t
1
) < (1 +
A
)(t
4
t
1
) + (35.2)
The bound for the round-trip time uctuates with respect to time since the software and mediumaccess
uctuate according to the load at the node and in the channel. Although the transmission, propagation,
and reception times may be deterministic, they may contribute to the asymmetric delay that can cause
time offset between nodes A and B.
In the following section, different types of time synchronization protocols are described. Each of them
tries to minimize the effect of the nondeterministic and asymmetric delays. For sensor networks, it is best
to minimize the propagation delay variation. For example, the delays and jitters between two nodes may be
different inthe forwardandreturnpaths. Inaddition, the jitters may vary signicantly due tofrequent node
failures, since the messages are relayed hop-by-hop between the two nodes. The synchronization protocols
in the following section focus on synchronizing nodes hop-by-hop, so the propagation time and variation
do not play too much effect on the error of the synchronized clocks. Although the sensor nodes are densely
deployed and they can take advantage of the close distance, the medium and software access times may
contribute the most in the nondeterministic of the path delay during a one hop synchronization. The
way to provide time synchronization for sensor networks may be different for different applications. The
current timing techniques that are available for different applications are describedinthe following section.
35.5 Time Synchronization Protocols for Sensor Networks
There are three types of timing techniques as shown in Table 35.1, and each of these types has to
address the design challenges and factors affecting time synchronization as mentioned in Sections 35.2
and 35.3, respectively. In addition, the timing techniques have to address the mapping between the
sensor network time and the Internet time, for example, universal coordinated time. In the following
paragraphs, examples of these types of timing techniques are described, namely the Network Time
Protocol (NTP) [11], Timing-sync Protocol for Sensor Networks (TPSN) [12], Reference-Broadcast
Synchronization (RBS) [13], and Time-Diffusion Synchronization Protocol (TDP) [14].
In Internet, the NTP is used to discipline the frequency of each nodes oscillator. The accuracy of the
NTP synchronization is in the order of milliseconds [15]. It may be useful to use NTP to discipline the
oscillators of the sensor nodes, but the connection to the time servers may not be possible because of
frequent sensor node failures. In addition, disciplining all the sensor nodes in the sensor eld maybe
a problemdue to interference fromthe environment and large variation of delay between different parts
of the sensor eld. The interference can temporarily disjoint the sensor eld into multiple smaller elds
causing undisciplined clocks among these smaller elds. The NTP protocol may be considered as type (1)
of the timing techniques. Inaddition, it has to be renedinorder to address the designchallenges presented
by the sensor networks.
As of now, the NTP is very computational intensive and requires a precise time server to synchronize the
nodes in the network. In addition, it does not take into account the energy consumption required for time
synchronization. As a result, the NTP does not satisfy the energy aware, server-less, and light-weight design
challenges of the sensor networks. Although the NTP can be robust, it may suffer large propagation delay
whensending timing messages to the time servers. Inaddition, the nodes are synchronized ina hierarchical
manner, and some time servers in the middle of the hierarchy may fail causing unsynchronized nodes in
the network. Once these nodes fail, it is hard to recongure the network since the hierarchy is manually
congured.
Another time synchronization technique that adopts some concepts from NTP is TPSN. The TPSN
requires the root node to synchronize all or part of the nodes in the sensor eld. The root node synchro-
nizes the nodes in a hierarchical way. Before synchronization, the root node constructs the hierarchy by
TABLE 35.1 Three Types of Timing Techniques
Type Description
Relies on xed time servers The nodes are synchronized to time servers that
to synchronize the network are readily available. These time servers are expected
to be robust and highly precise
Translates time throughout The time is translated hop-by-hop fromthe
the network source to the sink. In essence, it is a time
translation service
Self-organizes to synchronize The protocol does not depend on specialized time
the network servers. It automatically organizes and determines
the master nodes as the temporary time servers
Synchronization pulse
g
4
g
3
g
1
A
B
Acknowledgment
g
2
FIGURE 35.3 Two-way message handshake.
broadcasting a level_discovery packet. The rst level of the hierarchy is level 0, which is where the root
node resides. The nodes receiving the level_discovery packet from the root node are the nodes belonging
to level 1. Afterwards, the nodes in level 1 broadcast their level_discovery packet, and neighbor nodes
receiving the level_discovery packet for the rst time are the level 2 nodes. This process continues until all
the nodes in the sensor eld has a level number.
The root node sends a time_sync packet to initialize the time synchronization process. Afterwards, the
nodes in level 1 synchronize to level 0 by performing the two way handshake as shown in Figure 35.3. This
type of handshake is used by the NTP to synchronize the clocks of distributed computer systems. At the
end of the handshake at time g
4
, node A obtains the time g
1
, g
2
, and g
3
from the acknowledgment packet.
The time g
2
and g
3
are obtained from the clock of sensor node B while g
1
and g
4
are from the node A.
After processing the acknowledgment packet, the nodeA readjusts its clock by the clock drift value , where
= ((g
2
g
1
) (g
4
g
3
))/2. At the same time, the level 2 nodes overhear this message handshake and
wait for a random time before synchronizing with level 1 nodes. This synchronization process continues
until all the nodes in the network are synchronized. Since TPSN enables time synchronization from one
root node, it is type (1) of the timing techniques.
The TPSN is based on a senderreceiver synchronization model, where the receiver synchronizes with
the time of the sender according to the two-way message handshake as shown in Figure 35.3. It is trying
to provide a light-weight and tunable time synchronization service. On the other hand, it requires a time
server and does not address the robust and energy aware design goal. Since the design of TPSN is based
on a hierarchical methodology similar to NTP, nodes within the hierarchy may fail and cause nodes to
be unsynchronized. In addition, node movements may render the hierarchy useless, because nodes may
move out of their levels. Hence, nodes at level i can not synchronize with nodes at level i 1. Afterwards,
synchronization may fail throughout the network.
As for type (2) of the timing techniques, the RBS provides an instantaneous time synchronization among
a set of receivers that are within the reference broadcast of the transmitter. The transmitter broadcasts
m reference packets. Each of the receivers that are within the broadcast range records the time-of-arrival
of the reference packets. Afterwards, the receivers communicate with each other to determine the offsets.
To provide multihop synchronization, it is proposed to use nodes that are receiving two or more reference
broadcasts fromdifferent transmitters as translation nodes. These translation nodes are used to translate
the time between different broadcast domains.
As shown in Figure 35.4, nodes A, B, and C are the transmitter, receiver, and translation nodes,
respectively. The transmitter nodes broadcast their timing messages, and the receiver nodes receive these
messages. Afterwards, the receiver nodes synchronize with each other. The sensor nodes that are within
the broadcast regions of both transmitter nodes A are the translation nodes. When an event occurs,
a message describing the event with a time stamp is translated by the translation nodes when the message
is routed back to the sink. Although this time synchronization service is tunable and light-weight, there
may not be translation nodes on the route path that the message is relayed. As a result, services may not
be available on some routes. In addition, this protocol is not suitable for medium access scheme, such as
TDMA, since the clocks of all the nodes in the network are not adjusted to a common time.
C
Transmitters Receivers
Translation nodes
A
B
FIGURE 35.4 Illustration of the RBS.
Hops
D F G
C
1 2 E 3
1 2 3
Diffused leader nodes
Master nodes
N
M
FIGURE 35.5 TDP concept.
Another emerging timing technique is the TDP. The TDP is used to maintain the time through-
out the network within a certain tolerance. The tolerance level can be adjusted based on the purpose
of the sensor networks. The TDP automatically self-congures by electing master nodes to synchro-
nize the sensor network. In addition, the election process is sensitive to energy requirement as well as
the quality of the clocks. The sensor network may be deployed in unattended areas, and the TDP still
synchronizes the unattended network to a common time. It is considered as a type (3) of the timing
techniques.
The TDP concept is illustrated in Figure 35.5. The elected master nodes are nodes C and G. First, the
master nodes send a message to their neighbors to measure the round-trip times. Once the neighbors
receive the message, they self-determine if they should become diffuse leader nodes. The ones elected to
become diffuse leader nodes reply to the master nodes and start sending a message to measure the round-
trip to their neighbors. As shown in Figure 35.5, nodes M, N, and D are the diffused leader nodes of node C.
Once the replies are received by the master nodes, the round-trip time and the standard deviation of the
round-trip time are calculated. The one-way delay fromthe master nodes to the neighbor nodes is half of
the measured round-trip time. Afterwards, the master nodes send a time-stamped message containing the
standard deviation to the neighbor nodes. The time in the time-stamped message is adjusted with the
one-way delay. Once the diffuse leader nodes receive the time-stamped message, they broadcast the time-
stamped message after adjusting the time, which is in the message, with their measured one-way delay and
inserting their standard deviation of the round-trip time. This diffusion process continues for n times,
where n is the number of hops from the master nodes. From Figure 35.5, the time is diffused 3 hops
from the master nodes C and G. The nodes D, E, and F are the diffused leader nodes that diffuse the
time-stamped messages originated fromthe master nodes.
For the nodes that have received more than one time-stamped messages originated from different
master nodes, they use the standard deviations carried in the time-stamped messages as weighted ratio
of their time contribution to the new time. In essence, the nodes weight the times diffused by the master
nodes to obtain a new time for them. This process is to provide a smooth time variation between the
nodes in the network. The smooth transition is important for some applications, such as target tracking
and speed estimating.
The master nodes are autonomously elected, so the network is robust to failures. Although some of
the nodes may die, there are still other nodes in the network that can self-determine to become master
nodes. This feature also enables the network to become server-less if necessary and to reach an equilibrium
time. In addition, the master and diffusion leader nodes are self-determined based on their own energy
level. Also, the TDP is light-weight, but it may not be as tunable as the RBS.
In summary, these timing techniques may be used for different types of applications; each of them
has its own benets. All of these techniques try to address the factors inuencing time synchronization
while designing according to the challenges as described in Section 35.2. Depending on the types of
services required by the applications or the hardware limitation of the sensor nodes, some of these timing
techniques may be applied.
35.6 Conclusions
The design challenges and factors inuencing time synchronization for sensor networks are described in
Sections 35.2 and 35.3, respectively. They are to provide guidelines for developing time synchronization
protocols. The requirements of sensor networks are different from traditional distributed computer
systems. As a result, new types of timing techniques are required to address the specic needs of the
applications. These techniques are described in Section 35.5. Since the range of applications in the sensor
networks is wide, new timing techniques are encouraged for different types of applications. This is to
provide optimized schemes tailored for unique environments and purposes.
Acknowledgment
The author wishes to thank Dr. Ian F. Akyildiz for his encouragement and support.
References
[1] Akyildiz, I.F. et al., Wireless Sensor Networks: A Survey. Computer Networks Journal, 393422,
2002.
[2] Levine, J., Time Synchronization Over the Internet Using an Adaptive Frequency-Locked Loop.
IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency Control, 46, 888896, 1999.
[3] Mills, D.L., Adaptive Hybrid Clock Discipline Algorithm for the Network Time Protocol.
IEEE/ACM Transactions on Networking, 6, 505514, 1998.
[4] Allan, D., Time and Frequency (Time-Domain) Characterization, Estimation, and Prediction of
Precision Clocks and Oscillators. IEEE Transactions on Ultrasonics, Ferroelectrics, and Frequency
Control, 34, 647654, 1987.
[5] Cristian, F. and Fetzer, C., Probabilistic Internal Clock Synchronization. In Proceedings of the
Thirteenth Symposium on Reliable Distributed Systems. Dana Point, CA, October 1994, pp. 2231.
[6] Pottie, G.J. and Kaiser, W.J., Wireless Integrated Network Sensors. Communications of the ACM,
43, 551558, 2000.
[7] MICA Motes and Sensors, http://www.xbow.com.
[8] Bianchi, G., Performance Analysis of the IEEE 802.11 Distributed Coordination Function. IEEE
Journal on Selected Areas in Communications, 18, 535547, 2000.
[9] Crow, B.P. et al., Investigation of the IEEE 802.11 Medium Access Control (MAC) Sublayer
Functions. In Proceedings of the IEEE INFOCOM97. Kobe, Japan, April 1997, pp. 126133.
[10] Tay, Y.C. and Chua, K.C., A Capacity Analysis for the IEEE 802.11 MAC Protocol. ACM Wireless
Networks Journal, 7, 159171, 2001.
[11] Mills, D.L., Internet Time Synchronization: The Network Time Protocol. Global States and Time
in Distributed Systems. IEEE Computer Society Press, Washington, 1994.
[12] Ganeriwal, S., Kumar, R., andSrivastava, M.B., Timing-Sync Protocol for Sensor Networks. InACM
SenSys 2003. Los Angeles, CA, November 2003 (to appear).
[13] Elson, J., Girod, L., and Estrin, D., Fine-Grained Network Time Synchronization Using Reference
Broadcasts. In Proceedings of the Fifth Symposiumon Operating Systems Design and Implementation
(OSDI 2002). Boston, MA, December 2002.
[14] Su, W. and Akyildiz, I.F., Time-Diffusion Synchronization Protocol for Wireless Sensor Networks.
IEEE/ACM Transaction on Networking, 13(2), April 2005.
[15] IEEE 1588, Standard for a Precision Clock Synchronization Protocol for Networked Measurement
and Control Systems, 2002.
36
Distributed
Localization
Algorithms
Koen Langendoen and
Niels Reijers
36.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-1
36.2 Localization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-2
Generic Approach Phase 1: Distance to Anchors Phase 2:
Node Position Phase 3: Renement
36.3 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-8
Standard Scenario
36.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-9
Phase 1: Distance to Anchors Phase 2: Node Position
36.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-14
Phases 1 and 2 Combined Phase 3: Renement
Communication Cost Recommendations
36.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-21
Future Work
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36-22
36.1 Introduction
Newtechnology offers newopportunities, but it also introduces newproblems. This is particularly true for
sensor networks where the capabilities of individual nodes are very limited. Hence, collaboration between
nodes is required, but energy conservation is a major concern, which implies that communication should
be minimized. These conicting objectives require unorthodox solutions for many situations.
A recent survey by Akyildiz et al.[1] discusses a long list of open research issues that must be addressed
before sensor networks can become widely deployed. The problems range from the physical layer (low-
power sensing, processing, and communication hardware) all the way up to the application layer (query
and data dissemination protocols). In this chapter we address the issue of localization in ad-hoc sensor
networks. That is, we want to determine the location of individual sensor nodes without relying on
external infrastructure (base stations, satellites, etc.).
The localization problemhas received considerable attention in the past, as many applications need to
know where objects or persons are, and hence various location services have been created. Undoubtedly,
Reprinted fromK. Langendoen and N. Reijers. Elsevier Computer Networks, 43: 499518, 2003. With permission.
36-1
the Global Positioning System (GPS) is the most well-known location service in use today. The approach
taken by GPS, however, is unsuitable for low-cost, ad-hoc sensor networks since GPS is based on extensive
infrastructure (i.e., satellites). Likewise solutions developed in the area of robotics [24] and ubiquitous
computing [5] are generally not applicable for sensor networks as they require too much processing power
and energy.
Recently a number of localization systems have been proposed specically for sensor networks [614].
We are interested in truly distributed algorithms that can be employed on large-scale ad-hoc sensor
networks (100+ nodes). Such algorithms should be:
1. Self-organizing (i.e., do not depend on global infrastructure).
2. Robust (i.e., be tolerant to node failures and range errors).
3. Energy efcient (i.e., require little computation and, especially, communication).
These requirements immediately rule out some of the proposed localization algorithms for sensor net-
works. In this chapter, we carry out a thorough sensitivity analysis on three algorithms that do meet the
above requirements to determine how well they perform under various conditions. In particular, we study
the impact of the following parameters: range errors, connectivity (density), and anchor fraction. These
algorithms differ in their position accuracy, network coverage, induced network trafc, and processor
load. Given the (slightly) different design objectives for the three algorithms, it is no surprise that each
algorithm outperforms the others under a specic set of conditions. Under each condition, however, even
the best algorithm leaves much room for improving accuracy and/or increasing coverage.
In this chapter we will:
1. Identify a common, 3-phase, structure in the selected distributed localization algorithms.
2. Identify a generic optimization applicable to all algorithms.
3. Provide a detailed comparison on a single (simulation) platform.
4. Show that there is no algorithm that performs best, and that there exists room for improvement in
most cases.
Section 36.2 discusses the selection, generic structure, and operation of three distributed localization
algorithms for large-scale ad-hoc sensor networks. These algorithms are compared on a simulation plat-
form, which is described in Section 36.3. Section 36.4 presents intermediate results for the individual
phases, while Section 36.5 provides a detailed overall comparison and an in-depth sensitivity analysis.
Finally, we give conclusions in Section 36.6.
36.2 Localization Algorithms
Before discussing distributed localization in detail, we rst outline the context in which these algorithms
have to operate. A rst consideration is that the requirement for sensor networks to be self-organizing
implies that there is no ne control over the placement of the sensor nodes when the network is installed
(e.g., when nodes are dropped from an airplane). Consequently, we assume that nodes are randomly
distributed across the environment. For simplicity and ease of presentation we limit the environment to
2 dimensions, but all algorithms are capable of operating in 3D. Figure 36.1 shows an example network
with 25 nodes; pairs of nodes that can communicate directly are connected by an edge. The connectivity
of the nodes in the network (i.e., the average number of neighbors) is an important parameter that has a
strong impact on the accuracy of most localization algorithms (see Sections 36.4 and 36.5). It is initially
determined by the node density and radio range, and in some cases it can be adjusted dynamically by
changing the transmit power of the RF radio.
In some application scenarios, nodes may be mobile. In this chapter, however, we focus on static net-
works, where nodes do not move, since this is already a challenging condition for distributed localization.
We assume that some anchor nodes have a priori knowledge of their own position with respect to some
Distributed Localization Algorithms 36-3
Anchor
Unknown
FIGURE 36.1 Example network topology.
global coordinate system. Note that anchor nodes have the same capabilities (processing, communica-
tion, energy consumption, etc.) as all other sensor nodes with unknown positions; we do not consider
approaches based on an external infrastructure with specialized beacon nodes (access points) as used in,
for example, the GPS-less [6], Cricket [15], and RADAR [16] location systems. Ideally the fraction of
anchor nodes should be as low as possible to minimize the installation costs. Our simulation results show
that, fortunately, most algorithms are rather insensitive to the number of anchors in the network.
The nal element that denes the context of distributed localization is the capability to measure the
distance between directly connected nodes in the network. From a cost perspective it is attractive to
use the RF radio for measuring the range between nodes, for example, by observing the signal strength.
Experience has shown, however, that this approach yields poor distance estimates [1719]. Much better
results are obtained by time-of-ight measurements, particularly when acoustic and RF signals are com-
bined [14,19,20]; accuracies of a fewpercent of the transmission range are reported. However, this requires
extra hardware on the sensor boards.
Several different ways of dealing with the problem of inaccurate distance information have been pro-
posed. The APIT [10] algorithm by He et al. only needs distance information accurate enough for two
nodes determine which of themis closest to an anchor. GPS-less [6] by Bulusu et al. and DV-hop [11] by
Niculescu and Nath do not use distance information at all, and are based on topology information only.
Ramadurai and Sichitiu [12] propose a probabilistic approach to the localization problem. Not only the
measured distance, but also the condence in the measurement is used.
It is important to realize that the main three context parameters (connectivity, anchor fraction, and
range errors) are dependent. Poor range measurements can be compensated for by using many anchors
and/or a high connectivity. This chapter provides insight in the complex relation between connectivity,
anchor fraction, and range errors for a number of distributed localization algorithms.
36.2.1 Generic Approach
From the known localization algorithms specically proposed for sensor networks, we selected the three
approaches that meet the basic requirements for self-organization, robustness, and energy-efciency:
1. Ad-hoc positioning by Niculescu and Nath [11]
2. N-hop multilateration by Savvides et al. [14]
3. Robust positioning by Savarese et al. [13]
The other approaches often include a central processing element (e.g., convex optimization by Doherty
et al. [9]), rely on an external infrastructure (e.g., GPS-less by Bulusu et al. [6]), or induce too much
TABLE 36.1 Algorithm Classication
Phase Ad-hoc positioning [11] Robust positioning [13] N-hop multilateration [14]
1. Distance Euclidean DV-hop Sum-dist
2. Position Lateration Lateration Min-max
3. Renement No Yes Yes
communication (e.g., GPS-free by Capkun et al. [7]). The three selected algorithms are fully distributed
and use local broadcast for communication with immediate neighbors. This last feature allows them to
be executed before any multi-hop routing is in place, hence, they can support efcient location-based
routing schemes like GAF [21].
Although the three algorithms were developed independently, we found that they share a common
structure. We were able to identify the following generic, 3-phase approach
1
for determining the individual
node positions:
1. Determine the distances between unknowns and anchor nodes.
2. Derive for each node a position from its anchor distances.
3. Rene the node positions using information about the range (distance) to, and positions of,
neighboring nodes.
The original descriptions of the algorithms present the rst two phases as a single entity, but we found
that separating them provides two advantages. First, we obtain a better understanding of the combined
behavior by studying intermediate results. Second, it becomes possible to mix-and-match alternatives
for both phases to tailor the localization algorithm to the external conditions. The renement phase is
optional and may be included to obtain more accurate locations.
In the remainder of this section we will describe the three phases (distance, position, and renement)
in detail. For each phase we will enumerate the alternatives as found in the original descriptions. Table 36.1
gives the breakdown into phases of the three approaches. When applicable we also discuss (minor) adjust-
ments to (parts of) the individual algorithms that were needed to ensure compatibility with the alternatives.
During our simulations we observed that we occasionally operated (parts of) the algorithms outside their
intended scenarios, which deteriorated their performance. Often, small improvements brought their
performance back in line with the alternatives.
36.2.2 Phase 1: Distance to Anchors
In this phase, nodes share information to collectively determine the distances between individual nodes
and the anchors, so that an (initial) position can be calculated in Phase 2. None of the Phase 1 alternat-
ives engages in complicated calculations, so this phase is communication bounded. Although the three
distributed localization algorithms each use a different approach, they share a common communication
pattern: information is ooded into the network, starting at the anchor nodes. A network-wide ood by
some anchor A is expensive since each node must forward As information to its (potentially) unaware
neighbors. This implies a scaling problem: ooding information from all anchors to all nodes will become
much too expensive for large networks, even with low anchor fractions. Fortunately a good position can
be derived inPhase 2 with knowledge (positionand distance) froma limited number of anchors. Therefore
nodes can simply stop forwarding information when enough anchors have been located. This simple
optimization presented in the Robust positioning approach proved to be highly effective in controlling
the amount of communication (see Section 36.5.3). We modied the other two approaches to include a
ood limit as well.
1
Our three phases donot correspondtothe three of Savvides et al. [14]; our structure allows for aneasier comparison
of all algorithms.
36.2.2.1 Sum-Dist
The most simple solution for determining the distance to the anchors is simply adding the ranges
encountered at each hop during the network ood. This is the approach taken by the N-hop multi-
lateration approach, but it remained nameless in the original description [14]; we name it Sum-dist in this
chapter. Sum-dist starts at the anchors, who send a message including their identity, position, and a path
length set to 0. Each receiving node adds the measured range to the path length and forwards (broadcasts)
the message if the ood limit allows it to do so. Another constraint is that when the node has received
information about the particular anchor before, it is only allowed to forward the message if the current
path length is less than the previous one. The end result is that each node will have stored the position
and minimumpath length to at least ood limit anchors.
36.2.2.2 DV-Hop
A drawback of Sum-dist is that range errors accumulate when distance information is propagated over
multiple hops. This cumulative error becomes signicant for large networks with few anchors (long
paths) and/or poor ranging hardware. A robust alternative is to use topological information by counting
the number of hops instead of summing the (erroneous) ranges. This approach was named DV-hop by
Niculescu and Nath [11], and Hop-TERRAIN by Savarese et al. [13]. Since the results of DV-hop were
published rst we will use this name.
DV-hop essentially consists of two ood waves. After the rst wave, which is similar to Sum-dist, nodes
have obtained the position and minimumhop count to at least ood limit anchors. The second calibration
wave is needed to convert hop counts into distances such that Phase 2 can compute a position. This
conversion consists of multiplying the hop count with an average hop distance. Whenever an anchor a
1
infers the position of another anchor a
2
during the rst wave, it computes the distance between them, and
divides that by the number of hops to derive the average hop distance between a
1
and a
2
. When calibrating,
an anchor takes all remote anchors into account that it is aware of. When later, information on extra
anchors is received, the calibration procedure is repeated. Nodes forward (broadcast) calibration messages
only from the rst anchor that calibrates them, which reduces the total number of messages in the
network.
36.2.2.3 Euclidean
A drawback of DV-hop is that it fails for highly irregular network topologies, where the variance in actual
hop distances is very large. Niculescu and Nath have proposed another method, named Euclidean, which is
based on the local geometry of the nodes around an anchor. Again anchors initiate a ood, but forwarding
the distance is more complicated than in the previous cases. When a node has received messages fromtwo
neighbors that know their distance to the anchor, and to each other, it can calculate the distance to the
anchor.
Figure 36.2 shows a node (Self ) that has two neighbors n1 and n2 with distance estimates (a and b) to
an anchor. Together with the known ranges c, d, and e, Euclidean arrives at two possible values (r1 and r2)
n2
n1
Anchor
Self Self
b
e
e
r 1
r 2
a
d
c
d
FIGURE 36.2 Determining distance using Euclidean.
for the distance of the node to the anchor. Niculescu describes two methods to decide on which, if any,
distance to use. The neighbor vote method can be applied if there exists a third neighbor (n3) that has a
distance estimate to the anchor and that is connected to either n1 or n2. Replacing n2 (or n1) by n3 will
again yield a pair of distance estimates. The correct distance is part of both pairs, and is selected by a
simple voting. Of course, more neighbors can be included to make the selection more accurate.
The second selection method is called common neighbor and can be applied if node n3 is connected
to both n1 and n2. Basic geometric reasoning leads to the conclusion that the anchor and n3 are on
the same or opposite side of the mirroring line n1 to n2, and similarly whether or not Self and n3
are on the same side. From this it follows whether or not self and the anchor lay on the same side.
To handle the uncertainty introduced by range errors Niculescu implements a safety mechanism that
rejects ill-formed (at) triangles, which can easily derail the selection process by neighbor vote and
common neighbor. This check veries that the sum of the two smallest sides exceeds the largest side
multiplied by a threshold, which is set to two times the range variance. For example, the triangle Self-n1-n2
in Figure 36.2 is accepted when c +d > (1 +2RangeVar )e. Note that the safety check becomes more strict
as the range variance increases. This leads to a lower coverage, dened as the percentage of non-anchor
nodes for which a position was determined.
We now describe some modications to Niculescus neighbor vote method that remedy the poor
selection of the location for Self in important corner cases. The rst problem occurs when the two votes
are identical because, for instance, the three neighbors (n1, n2, and n3) are collinear. In these cases it is
hard to select the right alternative. Our solution is to leave equal vote cases unsolved, instead of picking
an alternative and propagating an error with 50% chance. We lter all indecisive cases by adding the
requirement that the standard deviation of the votes for the selected distance must be at most 1/3rd of the
standard deviation of the other distance. The second problem that we address is that of a bad neighbor
with inaccurate information spoiling the selection process by voting for two wrong distances. This case is
ltered out by requiring that the standard deviation of the selected distance is at most 5%of that distance.
To achieve good coverage, we use both the neighbor vote and common neighbor methods. If both
produce a result, we use the result fromthe modied neighbor vote because we found it to be the most
accurate of the two. If both fail, the ooding process stops, leading to the situation where certain nodes
are not able to establish the distance to enough anchor nodes. Sum-dist and DV-hop, on the other hand,
never fail to propagate the distance and hop count, respectively.
36.2.3 Phase 2: Node Position
In the second phase nodes determine their position based on the distance estimates to a number of anchors
provided by one of the three Phase 1 alternatives (Sum-dist, DV-hop, or Euclidean). The ad-hoc posi-
tioning and Robust positioning approaches use lateration for this purpose. N-hop multilateration, on the
other hand, uses a much simpler method, which we named Min-max. In both cases the determination of
the node positions does not involve additional communication.
36.2.3.1 Lateration
The most common method for deriving a position is lateration, which is a form of triangulation. From
the estimated distances (d
i
) and known positions (x
i
, y
i
) of the anchors we derive the following systemof
equations:
(x
1
x)
2
+(y
1
y)
2
= d
1
2
.
.
.
(x
n
x)
2
+(y
n
y)
2
= d
n
2
where the unknown position is denoted by (x, y). The system can be linearized by subtracting the last
equation from the rst n 1 equations.
x
1
2
x
n
2
2(x
1
x
n
)x + y
1
2
y
n
2
2(y
1
y
n
)y = d
1
2
d
n
2
.
.
.
x
n 1
2
x
n
2
2(x
n 1
x
n
)x + y
n 1
2
y
n
2
2(y
n 1
y
n
)y = d
n 1
2
d
n
2
Reordering the terms gives a proper system of linear equations in the form Ax = b, where
A =
2(x
1
x
n
) 2(y
1
y
n
)
.
.
.
.
.
.
2(x
n 1
x
n
) 2(y
n 1
y
n
)
b =
x
1
2
x
n
2
+ y
1
2
y
n
2
+ d
n
2
d
1
2
.
.
.
x
n 1
2
x
n
2
+ y
n 1
2
y
n
2
+ d
n
2
d
n 1
2
The system is solved using a standard least-squares approach: x = (A

T
A )
1
A
T
b. In exceptional cases the
matrix inverse cannot be computed and Lateration fails. In the majority of the cases, however, we succeed
in computing a location estimate x. We run an additional sanity check by computing the residue between
the given distances (d
i
) and the distances to the location estimate x:
residue =
n
i =1
(x
i
x )
2
+(y
i
y )
2
d
i
n
A large residue signals an inconsistent set of equations; we reject the location x when the length of the
residue exceeds the radio range.
36.2.3.2 Min-Max
Lateration is quite expensive in the number of oating point operations that is required. A much simpler
method is presented by Savvides et al. as part of the N-hop multilateration approach. The main idea is to
construct a bounding box for each anchor using its position and distance estimate, and then to determine
the intersection of these boxes. The position of the node is set to the center of the intersection box.
Figure 36.3 illustrates the Min-max method for a node with distance estimates to three anchors. Note that
the estimated position by Min-max is close to the true position computed through Lateration (i.e., the
intersection of the three circles).
The bounding box of anchor a is created by adding and subtracting the estimated distance (d
a
) from
the anchor position (x
a
, y
a
):
[x
a
d
a
, y
a
d
a
] [x
a
+d
a
, y
a
+d
a
]
The intersection of the bounding boxes is computed by taking the maximumof all coordinate minimums
and the minimumof all maximums:
[max(x
i
d
i
), max(y
i
d
i
)] [min(x
i
+d
i
), min(y
i
+d
i
)]
The nal position is set to the average of both corner coordinates. As for Lateration, we only accept the
nal position if the residue is small.
pos.
est.
Anchor2
Anchor1
Anchor3
FIGURE 36.3 Determining position using Min-max.
36.2.4 Phase 3: Renement
The objective of the third phase is to rene the (initial) node positions computed during Phase 2. These
positions are not very accurate, even under good conditions (high connectivity, small range errors),
because not all available information is used in the rst two phases. In particular, most ranges between
neighboring nodes are neglected when the node-anchor distances are determined. The iterative Rene-
ment procedure proposed by Savarese et al. [13] does take into account all inter-node ranges, when nodes
update their positions in a small number of steps. At the beginning of each step a node broadcasts its posi-
tion estimate, receives the positions and corresponding range estimates from its neighbors, and performs
the Lateration procedure of Phase 2 to determine its new position. In many cases the constraints imposed
by the distances to the neighboring locations will force the new position towards the true position of
the node. When, after a number of iterations, the position update becomes small, Renement stops and
reports the nal position.
The basic iterative renement procedure outlined above proved to be too simple to be used in practice.
The main problem is that errors propagate quickly through the network; a single error introduced by
some node needs only d iterations to affect all nodes, where d is the network diameter. This effect was
countered by (1) clipping undetermined nodes with non-overlapping paths to less than three anchors,
(2) ltering out difcult symmetric topologies, and (3) associating a condence metric with each node and
using them in a weighted least-squares solution (wAx = wb). The details (see Reference 17) are beyond
the scope of this chapter, but the adjustments considerably improved the performance of the Renement
procedure. This is largely due to the condence metric, which allows ltering of bad nodes, thus increasing
the (average) accuracy at the expense of coverage.
The N-hop multilateration approach by Savvides et al. [14] also includes an iterative renement pro-
cedure, but it is less sophisticated than the Renement discussed above. In particular, they do not use
weights, but simply group nodes into so-called computation subtrees (over-constrained congurations)
and enforce nodes within a subtree to execute their position renement in turn in a xed sequence to
enhance convergence to a pre-specied tolerance. In the remainder of this chapter we will only consider
the more advanced Renement procedure of Savarese et al.
36.3 Simulation Environment
Tocompare the three original distributedlocalizationalgorithms (Ad-hoc positioning, Robust positioning,
and N-hop multilateration) and to try out newcombinations of phase 1, 2, and 3 alternatives, we extended
the simulator developed by Savarese et al. [13]. The underlying OMNeT++discrete event simulator [22]
takes care of the semi-concurrent execution of the specic localization algorithm. Each sensor node
runs the same C++code, which is parameterized to select a particular combination of phase 1, 2, and 3
alternatives.
Our network layer supports localized broadcast only, and messages are simply delivered at the neigh-
bors within a xed radio range (circle) from the sending node; a more accurate model should take radio
propagation effects into account (see future work). Concurrent transmissions are allowed if the transmis-
sion areas (circles) do not overlap. If a node wants to broadcast a message while another message in its area
is in progress, it must wait until that transmission (and possibly other queued messages) are completed.
In effect we employ a CSMA policy. Furthermore we do not consider message corruption, so all messages
sent during our simulation are delivered (after some delay).
At the start of a simulation experiment we generate a random network topology according to some
parameters (#nodes, #anchors, etc.). The nodes are randomly placed, with a uniformdistribution, within
a square area. Next we select which nodes will serve as an anchor. To this end we superimpose a grid on top
of the square, and designate to each grid point its closest node as an anchor. The size of the grid is chosen
as the maximal number s that satises s s #anchors; any remaining anchors are selected randomly. The
reason for carefully selecting the anchor positions is that most localization algorithms are quite sensitive
to the presence, or absence, of anchors at the edges of the network. (Locating unknowns at the edges of
the network is more difcult because nodes at the edge are less well connected and positioning techniques
like lateration performbest when anchors surround the unknown.) Although anchor placement may not
be feasible in practice, the majority of the nodes in large-scale networks (1000+ nodes) will generally be
surrounded by anchors. By placing anchors we can study the localization performance in large networks
with simulations involving only a modest number of nodes.
The measured range between connected nodes is blurred by drawing a random value from a normal
distribution having a parameterized standard deviation and having the true range as the mean. We
selected this error model based on the work of Whitehouse and Culler [23], which shows that, although
individual distance measurements tendtoovershoot the real distance, a proper calibrationprocedure yields
distance estimates with a symmetric error distribution. The connectivity (average number of neighbors)
is controlled by specifying the radio range.
At the end of a run the simulator outputs a large number of statistics per node: position information,
elapsed time, message counts (broken down per type), etc. These individual node statistics are combined
and presented as averages (or distributions), for example, as an average position error. Nodes that do not
produce a position are excluded fromsuch averaged metrics. To account for the randomness in generating
topologies and range errors we repeated each experiment 100 times with a different seed, and report the
averaged results. To allow for easy comparison between different scenarios, range errors as well as errors
on position estimates are normalized to the radio range (i.e., 50%position error means a distance of half
the range of the radio between the real and estimated positions).
36.3.1 Standard Scenario
The experiments described in the subsequent sections share a standard scenario, in which certain para-
meters are varied: radio range (connectivity), anchor fraction, and range errors. The standard scenario
consists of a network of 225 nodes placed in a square with sides of 100 units. The radio range is set to 14,
resulting in an average connectivity of about 12. We use an anchor fraction of 5%, hence, 11 anchors in
total, of which 9 (3 3) are placed in a grid-like position. The standard deviation of the range error is set
to 10%of the radio range. The default ood limit for Phase 1 is set to 4 (Lateration requires a minimum
of 3 anchors). Unless specied otherwise, all data will be based on this standard scenario.
36.4 Results
In this section we present results for the rst two phases (anchor distances and node positions). We study
each phase separately and show how alternatives respond to different parameters. These intermediate
results will be used in Section 36.5, where we will discuss the overall performance, and compare complete
localization algorithms. Throughout this section we will vary one parameter in the standard scenario
(radio range, anchor fraction, range error) at a time to study the sensitivity of the algorithms. The reader,
however, should be aware that the three parameters are not orthogonal.
36.4.1 Phase 1: Distance to Anchors
Figure 36.4 shows the performance of the Phase 1 alternatives for computing the distances between
nodes and anchors under various conditions. There are two metrics of interest: rst, the bias in the
estimate, measured here using the mean of the distance errors, and second, the precision of the estimated
distances, measured here using the standard deviation of the distance errors. Therefore, Figure 36.4 plots
both the average error, relative to the true distance, and the standard deviation of that relative error. We
will now discuss the sensitivity of each alternative: Sum-dist, DV-hop, and Euclidean.
36.4.1.1 Sum-Dist
Sum-dist is the cheapest of the three methods, both with respect to computation and communication
costs. Nevertheless it performs quite satisfactorily, except for large range errors (0.1). There are two
opposite tendencies affecting the bias of Sum-dist. First, without range errors, the sum of the ranges along
a multi-hop path will always be larger than the actual distance, leading to an overestimation of the distance.
Second, the algorithm searches for the shortest path, forcing it to select links that underestimate the actual
distance when range errors are present. The combined effect shows non-intuitive results. A small range
error reduces the bias of Sum-dist. Initially, the detour effect leads to an overshoot, but the shortest-path
effect takes over when the range errors increase, leading to a large undershoot.
When the radio range (connectivity) is increased, more nodes can be reached in a single hop. This
leads to straighter paths (less overshoot), and provides more options for selecting a (incorrect) shortest
path (higher undershoot). Consequently, increasing the connectivity is not necessarily a good thing for
Sum-dist.
36.4.1.2 DV-Hop
The DV-hop method is a stable and predictable method. Since it does not use range measurements, it is
completely insensitive to this source of errors. The low relative error (5%) shows that the calibration wave
is very effective. DV-hop searches for the path with the minimum number of hops, causing the average
hop distance to be close to the radio range. The last hop on the path from an anchor to a node, however,
is usually shorter than the radio range, which leads to a slight overestimation of the node-anchor distance.
This effect is more pronounced for short paths, hence the increased error for larger radio ranges and
higher anchor fractions (i.e., fewer hops).
36.4.1.3 Euclidean
Euclidean is capable of determining the exact anchor-node distances, but only in the absence of range errors
and in highly connected networks. When these conditions are relaxed, Euclideans performance rapidly
degrades. The curves in Figure 36.4 show that Euclidean tends to underestimate the distances. The reason
is that the selection process is forced to choose between two options that are quite far apart and that in
many cases the shortest distance is incorrect. Consider Figure 36.2 again, where the shortest distance r2
falls within the radio range of the anchor. If r2 would be the correct distance then the node should be in
direct contact with the anchor avoiding the need for a selection. Therefore nodes simply have more chance
to underestimate distances than to overestimate them in the face of (small) range errors. This error can
then propagate to nodes that are multiple hops away from the anchor, causing them to underestimate the
distance to the anchor as well.
We quantied the impact of the selection bias towards short distances. Figure 36.5 shows the distribution
of the errors, relative to the true distance, on the standard scenario for Euclideans selection mechanism
(solid line) and an oracle that always selects the best distance (dashed line). The oracles distribution is
0.4
0.2
0
0.2
0.4
0.6
0.8
0.2 0.3 0.4 0.5
R
e
l
a
t
i
v
e

d
i
s
t
a
n
c
e

e
r
r
o
r
(

a
c
t
u
a
l

d
i
s
t
a
n
c
e
)
R
e
l
a
t
i
v
e

d
i
s
t
a
n
c
e

e
r
r
o
r
(

a
c
t
u
a
l

d
i
s
t
a
n
c
e
)
R
e
l
a
t
i
v
e

d
i
s
t
a
n
c
e

e
r
r
o
r
(

a
c
t
u
a
l

d
i
s
t
a
n
c
e
)
Range variance
DV-hop
Sum-dist
Euclidean
Mean
Std. dev.
0.4
0.2
0
0.2
0.4
0.6
0.8
DV-hop
Sum-dist
Euclidean
Mean
Std. dev.
0.4
0.2
0
0.2
0.4
0.6
0.8
DV-hop
Sum-dist
Euclidean
Mean
Std. dev.
0 0.1
8
(4.2)
9 10
(6.4)
11 12
(9.0)
13 14
(12.1)
15 16
(15.5)
Radio range (avg. connectivity)
0 0.05 0.1 0.15 0.2
Anchor fraction
FIGURE 36.4 Sensitivity of Phase 1 methods: distance error (solid lines) and standard deviation (dashed lines).
0
1
2
3
4
5
Relative range error
Euclidean
Oracle
P
r
o
b
a
b
i
l
i
t
y

d
e
n
s
i
t
y
1 0.5 0 0.5 1
FIGURE 36.5 The impact of incorrect distance selection on Euclidean.
nicely centered around zero (no error) with a sharp peak. Euclideans distribution, in contrast, is skewed
by a heavy tail at the left, signalling a bias for underestimations.
Euclideans sensitivity for connectivity is not immediately apparent from the accuracy data in
Figure 36.4. The main effect of reducing the radio range is that Euclidean will not be able to propag-
ate the anchor distances. Recall that Euclideans selection methods require at least three neighbors with a
distance estimate to advance the anchor distance one hop. In networks with low connectivity, two parts
connected only by a few links will often not be able to share anchors. This leads to problems in Phase 2,
where fewer node positions can be computed. The effects are quite pronounced, as will become clear in
Section 36.5 (see the coverage curves in Figure 36.10).
36.4.2 Phase 2: Node Position
To obtain insight into the fundamental behavior of the Lateration and Min-max algorithms we now report
on some experiments with controlled distance errors and anchor placement. The impact of actual distance
errors as produced by the Phase 1 methods will be discussed in Section 36.5.
36.4.2.1 Distance Errors
Starting from the standard scenario we select for each node the ve nearest anchors, and add some noise
to the real distances. This noise is generated by rst taking a sample from a normal distribution with
the actual distance as the mean and a parameterized percentage of the distance as the standard deviation.
The result is then multiplied by a bias factor. The ranges for the standard deviation and bias factor follow
fromthe Phase 1 measurements.
Figure 36.6 shows the sensitivity of Lateration and Min-max when the standard deviation percentage
was varied from 0 to 0.25, and the bias factor xed at zero. Lateration outperforms Min-max for precise
distance estimates, but Min-max takes over for large standard deviations ( 0.15).
Figure 36.7 shows the effect of adding a bias to the distance estimates. The curves show that Lateration is
very sensitive to a bias factor, especially for precise estimates (std. dev. = 0). Min-max is rather insensitive
to bias, because stretching the bounding boxes has little effect on the position of the center. For precise
distance estimates and a small bias factor Lateration outperforms Min-max, but the bottom graph in
Figure 36.7 shows that Min-max is probably the preferred technique when the standard deviation rises
above 10%.
Although Min-max is not very sensitive to bias, we do see that Min-max performs better for a positive
range bias (i.e., an overshoot). This is a consequence of the error introduced by Min-max using a bound-
ing box instead of a circle around anchors. For simplicity we limit the explanation to the effects on
0
20
40
60
80
100
P
o
s
i
t
i
o
n

e
r
r
o
r

(
%
r
)
Range error std. dev.
Lateration
Min-max
0 0.05 0.1 0.15 0.2 0.25
FIGURE 36.6 Sensitivity of Phase 2 to precision.
std. dev. = 0
Bias factor
Lateration
Minmax
P
o
s
i
t
i
o
n

e
r
r
o
r

(
%
r
)
std. dev. = 0.1 Lateration
Minmax
0
20
40
60
80
100
P
o
s
i
t
i
o
n

e
r
r
o
r

(
%
r
)
0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2
0
20
40
60
80
100
0.2 0.15 0.1 0.05 0 0.05 0.1 0.15 0.2
Bias factor
FIGURE 36.7 Sensitivity of Phase 2 to bias.
Anchor1
Anchor2
r 1
a1
r 2
a2
FIGURE 36.8 Min-max scenario.
the x-coordinate only. Figure 36.8 shows that Anchor1 making a small angle with the x-axis yields a tight
bound (to the right of the true location), and that the large angle of Anchor2 yields a loose bound (to
the left of the true location). The estimated position is off in the direction of the loose bound (to the
left). Adding a positive bias to the range estimates causes the two bounds to shift proportionally. As a
Min-max
Anchor
Unknown
Estimated position
Lateration
FIGURE 36.9 Node locations computed for network topology of Figure 36.1.
consequence the center of the intersection moves into the direction of the bound with the longest range
(to the right). Consequently the estimated coordinate moves closer to the true coordinate. The opposite
will happen if the anchor with the largest angle has the longest distance. Min-max selects the strongest
bounds, leading to a preference for small angles and small distances, which favors the number of good
cases where the coordinate moves closer to the true coordinate if a positive range bias is added.
36.4.2.2 Anchor Placement
Min-max has the advantage of being computationally cheap and insensitive to errors, but it requires
a good constellation of anchors; in particular, Savvides et al. recommend to place the anchors at the
edges of the network [14]. If the anchors cannot be placed and are uniformly distributed across the
network, the accuracy of the node positions at the edges is rather poor. Figure 36.9 illustrates this problem
graphically. We applied Min-max and Lateration to the example network presented in Figure 36.1. In the
case of Min-max, all nodes that lie outside the convex envelope of the four anchor nodes are drawn
inwards, yielding considerable errors (indicated by the dashed lines); the nodes within the envelope are
located adequately. Lateration, on the other hand, performs much better. Nodes at the edges are located
less accurately than interior nodes, but the magnitude of and variance in the errors is smaller than for
Min-max.
The differences in sensitivity to anchor placement between Lateration and Min-max can be considerable.
For instance, for DV-hop/Lateration in the standard scenario, the average position accuracy degrades from
43 to 77%, when anchors are randomly distributed instead of the grid-based placement. The accuracy of
DV-hop/Lateration also degrades, but only from 42 to 54%.
36.5 Discussion
Now that we know the behavior of the individual phase 1 and 2 components, we can turn to the per-
formance effects of concatenating both phases, followed by applying Renement in Phase 3. We will study
the sensitivity of various combinations to connectivity, anchor fraction, and range errors using both the
resulting position error and coverage.
36.5.1 Phases 1 and 2 Combined
Combining the three Phase 1 alternatives (Sum-dist, DV-hop, and Euclidean) with the two Phase 2
alternatives (Lateration and Min-max) yields a total of six possibilities. We will analyze the differences
in terms of coverage (Figure 36.10) and position accuracy (Figure 36.11). When ne-tuning localization
algorithms, the trade-off between accuracy and coverage plays an important role; dropping difcult cases
increases average accuracy at the expense of coverage.
36.5.1.1 Coverage
Figure 36.10 shows the coverage of the six Phase 1/Phase 2 combinations for varying range error (top),
radio range (middle), and anchor fraction (bottom). The solid lines denote the Lateration variants; the
dashed lines denote the Min-max variants. The rst observation is that Sum-dist and DV-hop are able to
determine the range to enough anchors to position all the nodes, except in cases when the radio range
is small (11), or equivalently when the connectivity is low ( 7.5). In such sparse networks, Lateration
provides a slightly higher coverage than Min-max. This is caused by the sanity check on the residue.
A consistent set of anchor positions and distance estimates leads to a low residue, but the reverse does
not hold. Occasionally if Lateration is used with an inconsistent set, an outlier is produced with a small
residue, which is accepted. Min-max does not suffer from this problem because the positions are always
constrained by the bounding boxes and thus cannot produce such outliers. Laterations higher coverage
results in higher errors, see the accuracy curves in Figure 36.11.
The second observation is that Euclidean has great difculty in achieving a reasonable coverage when
conditions are non-ideal. The combination with Min-max gives the highest coverage, but even that
combination only achieves acceptable results under ideal conditions (range variance 0.1, connectivity
15, anchor fraction 0.1). The reason for Euclideans poor coverage is twofold. First, the triangles used
to propagate anchor distances are checked for validity (see Section 36.2.2); this constraint becomes more
strict as the range variance increases, hence the signicant drop in coverage. Second, Euclidean can only
forward anchor distances if enough neighbors are present (see Section 36.4.1) resulting in many nodes
locating only one or two anchors. Lateration requires at least three anchors, but Min-max does not have
this requirement. This explains why the Euclidean/Min-max combination yields a higher coverage. Again,
the price is paid in terms of accuracy (cf. Figure 36.11).
36.5.1.2 Accuracy
Figure 36.11 gives the average position error of the six combinations under the same varying conditions
as for the coverage plots. To ease the interpretation of the accuracies we ltered out anomalous cases
whose coverage is below 50%, which mainly concerns Euclideans results. The most striking observation is
that the Euclidean/Lateration combination clearly outperforms the others in the absence of range errors:
0% error versus at least 29% (Sum-dist/Min-max). This follows from the good performance of both
Euclidean and Lateration in this case (see Section 36.4). The downside is that both components were also
shown to be very sensitive to range errors. Consequently, the average position error increases rapidly if
noise is added to the range estimates; at just 2% range variance, Euclidean/Lateration looses its advantage
over the Sum-dist/Min-max combination. When the range variance exceeds 10%, DV-hop performs best.
In this scenario DV-hop achieves comparable accuracies for both Lateration and Min-max. Which Phase 2
algorithm is most appropriate depends on anchor placement, and whether the higher computation cost
of Lateration is important.
Notice that Sum-dist/Lateration actually becomes more accurate when a small amount of range vari-
ance is introduced, while the errors of Sum-dist/Min-max increase. This matches the results found in
Sections 36.4.1 and 36.4.2. Adding a small range error causes Sum-dist to yield more accurate distance
estimates (cf. Figure 36.4). Lateration benets greatly from a reduced bias, but Min-max is not that
sensitive and even deteriorates slightly (cf. Figure 36.7). The combined effect is that Sum-dist/Lateration
benets fromsmall range errors; Sum-dist/Min-max does not show this unexpected behavior.
All six combinations are quite sensitive to the radio range (connectivity). A minimum connectivity of
9.0 is required (at radio range 12) for DV-hop and Sum-dist, in which case Sum-dist slightly outperforms
DV-hop and the difference between Lateration and Min-max is negligible. Euclidean does not perform
well because of the 10%range variance in the standard scenario.
Range variance
DV-hop
Sum-dist
Euclidean
Lateration
Min-max
DV-hop
Sum-dist
Euclidean
Lateration
Min-max
DV-hop
Sum-dist
Euclidean
Lateration
Min-max
0
20
40
60
80
100
C
o
v
e
r
a
g
e

(
%
)
0
20
40
60
80
100
0
20
40
60
80
100
0 0.1 0.2 0.3 0.4 0.5
8
(4.2)
9 10
(6.4)
11 12
(9.0)
13 14
(12.1)
15 16
(15.5)
0 0.05 0.1 0.15 0.2
Anchor fraction
C
o
v
e
r
a
g
e

(
%
)
C
o
v
e
r
a
g
e

(
%
)
FIGURE 36.10 Coverage of phase 1 /2 combinations.
The sensitivity to the anchor fraction is quite similar for all combinations. More anchors ease the
localization task, especially for Euclidean, but there is no hard threshold like for the sensitivity to
connectivity.
36.5.2 Phase 3: Renement
For brevity we do not report the effects of rening the initial positions produced by all six phase 1 /2
combinations, but limit the results to the three combinations proposed in the original papers: Sum-dist/
Min-max, Euclidean/Lateration, and DV-hop/Lateration (cf. Table 36.1). Figure 36.12 shows the coverage
DV-hop
Sum-dist
Euclidean
Lateration
Min-max
DV-hop
Sum-dist
Euclidean
Lateration
Min-max
DV-hop
Sum-dist
Euclidean
Lateration
Min-max
0
50
100
150
P
o
s
i
t
i
o
n

e
r
r
o
r

(
%
r
)
0
50
100
150
200
250
300
P
o
s
i
t
i
o
n

e
r
r
o
r

(
%
r
)
0
50
100
150
P
o
s
i
t
i
o
n

e
r
r
o
r

(
%
r
)
0 0.1 0.2 0.3 0.4 0.5
Range variance
8
(4.2)
9 10
(6.4)
11 12
(9.0)
13 14
(12.1)
15 16
(15.5)
0 0.05 0.1 0.15 0.2
Anchor fraction
FIGURE 36.11 Accuracy of phase 1/2 combinations.
with (solid lines) and without (dashed lines) Renement for the three selected combinations. Figure 36.13
shows the average position error, but only if the coverage exceeds 50%.
The most important observation is that Renement dramatically reduces the coverage for all com-
binations. For example, in the standard case (10% range variance, radio range 14, and 5% anchors) the
coverage for Sum-dist/Min-max and DV-hop/Lateration drops from100%to a mere 51%. For the nodes
that are not rejected Renement results in a better accuracy: the average error decreases from 42 to 23%
for DV-hop, and from 38 to 24% for Sum-dist. Other tests have revealed that Renement does not only
improve accuracy by merely ltering out bad nodes; the initial positions of good nodes are improved
as well. A second observation is that Renement equalizes the performance by Sum-dist and DV-hop.
C
o
v
e
r
a
g
e

(
%
)
DV-hop, Lateration
Sum-dist, Min-max
Euclidean, Lateration
Phases 1 and 2 only
With Refinement
C
o
v
e
r
a
g
e

(
%
)
0
20
40
60
80
100
0
20
40
60
80
100
0 0.1 0.2 0.3 0.4 0.5
Range variance
8
(4.2)
9 10
(6.4)
11 12
(9.0)
13 14
(12.1)
15 16
(15.5)
FIGURE 36.12 Coverage after renement.
P
o
s
i
t
i
o
n

e
r
r
o
r

(
%
r
)
DV-hop, Lateration
Sum-dist, Min-max
Phases 1 and 2 only
With Refinement
P
o
s
i
t
i
o
n

e
r
r
o
r

(
%
r
)
DV-hop, Lateration
Sum-dist, Min-max
Phases 1 and 2 only
With Refinement
50
100
150
0
50
100
150
0 0.1 0.2 0.3 0.4 0.5
Range variance
0
8
(4.2)
9 10
(6.4)
11 12
(9.0)
13 14
(12.1)
15 16
(15.5)
FIGURE 36.13 Accuracy after renement.
As a consequence the simpler Sum-dist is to be preferred in combination with Renement to save on
computation and communication.
36.5.3 Communication Cost
The network simulator maintains statistics about the messages sent by each node. Table 36.2 presents a
breakdown per message type of the three original localization combinations (with Renement) on the
standard scenario.
The number of messages in Phase 1 (Flood + Calibration) is directly controlled by the ood limit
parameter, which is set to 4 by default. Figure 36.14 shows the message counts in Phase 1 for various ood
limits. Note that Sum-dist and DV-hop scale almost linearly; they level off slightly because information on
multiple anchors can be combined in a single message. Euclidean, on the other hand, levels off completely
because of the difculties in propagating anchor distances, especially along long paths.
For Sum-dist and DV-hop we expect nodes to transmit a message per anchor. Note, however, that for
low ood limits the message count is higher than expected. In the case of DV-hop, the count also includes
the calibration messages. With some ne-tuning the number of calibration messages can be limited to
one, but the current implementation needs about as many messages as the ooding itself. A second factor
that increases the number of messages for DV-hop and Sum-dist is the update information to be sent
TABLE 36.2 Average Number of Messages Per Node
Type Sum-dist DV-hop Euclidean
Flood 4.3 2.2 3.5
Calibration 2.6
Renement 32 29 20
#
M
e
s
s
a
g
e
s

p
e
r

n
o
d
e
Flood limit
DV-hop, Lateration
Sum-dist, Min-max
P
o
s
i
t
i
o
n

e
r
r
o
r

(
%
r
)
Flood limit
DV-hop, Lateration
Sum-dist, Min-max
0
1
2
3
4
5
6
7
8
0
50
100
150
0 2 4 6 8 10 0 2 4 6 8 10
FIGURE 36.14 Sensitivity to ood limit.
when a shorter path is detected, which happens quite frequently for Sum-dist. Finally, all three algorithms
are self-organizing and nodes send an extra message when discovering a new neighbor that needs to be
informed of the current status.
Although the ood limit is essential for crafting scalable algorithms, it affects the accuracy, see the
bottom graph in Figure 36.14. Note that using a higher ood limit does not always improve accuracy. In
the case of Sum-dist, there is a trade-off between using few anchors with accurate distance information,
and using many anchors with less accurate information. With DV-hop, on the other hand, the distance
estimates become more accurate for longer paths (last-hop effect, see Section 36.4.1). Euclideans error
only increases with higher ood limits because it starts with a low coverage, which also increases with
higher ood limits. DV-hop and Sum-dist reach almost 100% coverage at ood limits of 2 (Min-max)
or 3 (Lateration).
With the ood limit set to 4, the nodes send about 4 messages during Phase 1 (cf. Table 36.2). This
is comparable to the three messages needed by a centralized algorithm: set up a spanning tree, collect
range information, and distribute node positions. Running Renement in Phase 3, on the other hand,
is extremely expensive, requiring 20 (Euclidean) to 32 messages (Sum-dist). The problem is that Rene-
ment takes many iterations before local convergence criteria decide to terminate. We added a limit to the
number of Renement messages a node is allowed to send. The effect of this is shown in Figure 36.15.
ARenement limit of 0 means that no renement messages are sent, andRenement is skippedcompletely.
The position errors in Figure 36.15 show that most of the effect of Renement takes place in the rst
fewiterations, so hard limiting the iteration count is a valid option. For example, the accuracy obtained by
DV-hop without Renement is 42%and it drops to 28%after two iterations; an additional 4%drop can
be achieved by waiting until Renement terminates based on the local stopping criteria, but this requires
another 27 messages (29 in total). Thus the communication cost of Renement can effectively be reduced
to less than the costs for Phase 1. Nevertheless, the poor coverage of Renement limits its practical use.
36.5.4 Recommendations
From the previous discussion it follows that no single combination of phase 1, 2, and 3 alternatives
performs best under all conditions; each combination has its strengths and weaknesses. The results
DV-hop, Lateration, Refinement
Sum-dist, Min-max, Refinement
Euclidean, Lateration, Refinement
DV-Hop, Lateration, Refinement
Sum-dist, Min-max, Refinement
Euclidean, Lateration, Refinement
0
20
40
60
80
100
C
o
v
e
r
a
g
e

[
%
]
0
20
40
60
80
100
P
o
s
i
t
i
o
n

e
r
r
o
r

[
%
r
]
0 2 4 6 8 10
Refinement limit
0 2 4 6 8 10
Refinement limit
FIGURE 36.15 Effect of renement limit.
TABLE 36.3 Comparison; Anchor Fraction Fixed at 5%, No Renement
Range
variance 16 (15.5) 14 (12.1) 12 (9.0) 10 (6.4) 8 (4.2)
0 Euclidean/Lateration Euclidean/Lateration Sum-dist/Min-max Sum-dist/Min-max DV-hop/Lateration
0.025 Sum-dist/Lateration Sum-dist/Min-max Sum-dist/Min-max Sum-dist/Min-max DV-hop/Lateration
0.05 Sum-dist/Lateration Sum-dist/Lateration Sum-dist/Min-max Sum-dist/Min-max DV-hop/Lateration
0.1 Sum-dist/Min-max Sum-dist/Lateration Sum-dist/Min-max Sum-dist/Min-max DV-hop/Lateration
0.25 DV-hop/Lateration DV-hop/Min-max DV-hop/Min-max DV-hop/Min-max DV-hop/Lateration
0.5 DV-hop/Lateration DV-hop/Min-max DV-hop/Min-max DV-hop/Min-max DV-hop/Lateration
presented in Section 36.5 follow from changing one parameter (radio range, range variance, and anchor
fraction) at a time. Since the sensitivity of the localization algorithms may not be orthogonal in the three
parameters, it is difcult to derive general recommendations. Therefore, we conducted an exhaustive
search for the best algorithmin the three-dimensional parameter space. For readability we do not present
the rawoutcome, a 665 cube, but showa two-dimensional slice instead. We found that the localization
algorithms are the least sensitive to the anchor fraction, so Table 36.3 presents the results of varying the
radio range and the range variance, while keeping the anchor fraction xed at 5%. In each case we list the
algorithmthat achieves the best accuracy (i.e., the lowest average position error) under the condition that
its coverage exceeds 50%. Since Renement often results in very poor coverage, we only examine Phases 1
and 2 here.
The exhaustive parameter search, and basic observations about Renement, lead to the following
recommendations:
1. Euclidean should always be used in combination with Lateration, but only if distances can be
measured very accurately (range variance <2%) and the network has a high connectivity (12).
When the anchor fraction is increased, Euclidean captures some more entries in the left-upper
corner of Table 36.3, and the conditions on range variance and connectivity can be relaxed slightly.
Nevertheless, the window of opportunity for Euclidean/Lateration is rather small.
2. DV-hop should be used when there are no or poor distance estimates, for example, those obtained
from the signal strength (cf. the bottom rows in Table 36.3). Our results show that DV-hop
outperforms the other methods when the range variance is large (>10%in this slice). The presence
of Lateration in the last column, that is, with a very low connectivity, is an artifact caused by
the ltering on coverage. DV-hop/Min-max has a coverage of 49% in this case (versus 56% for
DV-hop/Lateration), but also a much lower error. Regarding the issue of combining DV-hop with
Lateration or Min-max, we observe that overall, Min-max is the preferred choice. Recall, however,
its sensitivity for anchor placement leading to large errors at the edges of the network.
3. Sum-dist performs best in the majority of cases, especially if the anchor fraction is (slightly)
increased above 5%. Increasing the number of anchors reduces the average path length between
nodes and anchors, limiting the accumulation of range errors along multiple hops. Except for a few
corner cases, Sum-dist performs best in combination with Min-max. In scenarios with very low
connectivity and a low anchor fraction, Sum-dist tends to overestimate the distance signicantly.
Therefore DV-hop performs better in the far right column of Table 36.3.
4. Renement can be used to improve the accuracy of the node positions when the range estimates
between neighboring nodes are quite accurate. The best results are obtained in combination with
DV-hop or Sum-dist, but at a signicant (around 50%) drop in coverage. This renders the usage
of Renement questionable, despite its the modest communication overhead.
A nal important observation is that the localization problem is still largely unsolved. In ideal condi-
tions Euclidean/Lateration performs ne, but in all other cases it suffers from severe coverage problems.
Although Renement uses the extra information of the many neighbor-to-neighbor ranges and reduces
the error, it too suffers fromcoverage problems. Under most conditions, there is still signicant roomfor
improvement.
36.6 Conclusions
This chapter addressed the issue of localization in ad-hoc wireless sensor networks. From the known
localization algorithms specically proposed for sensor networks, three approaches were selected that
meet basic requirements of self-organization, robustness, and energy-efciency: Ad-hoc positioning [11],
Robust positioning [13], and N-hop multilateration [14]. Although these three algorithms were developed
independently, they share a common structure. We were able to identify a generic, 3-phase approach to
determine the individual node positions consisting of the steps below:
1. Determine the distances between unknowns and anchor nodes.
2. Derive for each node a position fromits anchor distances.
3. Rene the node positions using information about the range to, and positions of, neighboring
nodes.
We studied three Phase 1 alternatives (Sum-dist, DV-hop, and Euclidean), two Phase 2 alternatives
(Lateration and Min-max) and an optional Renement procedure for Phase 3. To this end the discrete
event simulator developed by Savarese et al. [13] was extended to allow for the execution of an arbitrary
combination of alternatives.
Section 36.4 dealt with Phase 1 and Phase 2 in isolation. For Phase 1 alternatives, we studied the
sensitivity to range errors, connectivity, and fraction of anchor nodes (with known positions). DV-hop
proved to be stable and predictable, Sum-dist and Euclidean showed tendencies to under estimate the
distances betweenanchors andunknowns. Euclideanwas foundto have difculties inpropagating distance
information under non-ideal conditions, leading to low coverage in the majority of cases. The results for
Phase 2 showed that Lateration is capable of obtaining very accurate positions, but also that it is very
sensitive to the accuracy and precision of the distance estimates. Min-max is more robust, but is sensitive
to the placement of anchors, especially at the edges of the network.
In Section 36.5 we compared all six phase 1/2 combinations under different conditions. No single
combination performs best; which algorithm is to be preferred depends on the conditions (range errors,
connectivity, anchor fraction, and placement). The Euclidean/Lateration combination [11] should be used
only in the absence of range errors (variance <2%) and requires a high node connectivity. The DV-hop/
Min-max combination, which is a minor variation on the DV-hop/Lateration approach proposed in [11]
and [13], performs best when there are no or poor distance estimates, for example, those obtained from
the signal strength. The Sum-dist/Min-max combination [14] is to be preferred in the majority of other
conditions. The benet of running Renement in Phase 3 is considered to be questionable since in many
cases the coverage dropped by 50%, while the accuracy only improved signicantly in the case of small
range errors. The communication overhead of Renement was shown to be modest (2 messages per node)
in comparison to the controlled ooding of Phase 1 (4 messages per node).
36.6.1 Future Work
Regarding the future, the ultimate distributed localization algorithm is yet to be devised. Under ideal
circumstances Euclidean/Lateration performs ne, but in all other cases there is signicant room for
improvement. Furthermore, additional effort is needed to bridge the gap between simulations and real-
world localization systems. For instance, we need to gather more data on the actual behavior of sensor
nodes, particularly with respect to physical effects like multipath, interference, and obstruction.
Acknowledgments
This work was rst published in Elsevier Computer Networks [24]. We thank Elsevier for giving us
permission to reproduce the material. We also thank Andreas Savvides and Dragos Niculescu for their
input and for sharing their code with us.
References
[1] I. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci. A survey on sensor networks. IEEE
Communications Magazine, 40: 102114, 2002.
[2] S. Atiya and G. Hager. Real-time vision-based robot localization. IEEE Transactions on Robotics
and Automation, 9: 785800, 1993.
[3] J. Leonard and H. Durrant-Whyte. Mobile robot localization by tracking geometric beacons. IEEE
Transanctions on Robotics and Automation, 7: 376382, 1991.
[4] R. Tinos, L. Navarro-Serment, and C. Paredis. Fault tolerant localization for teams of distributed
robots. InIEEEInternational Conference onIntelligent Robots and Systems, Vol. 2, Maui, HI, October
2001, pp. 10611066.
[5] J. Hightower and G. Borriello. Location systems for ubiquitous computing. IEEE Computer,
34: 5766, 2001.
[6] N. Bulusu, J. Heidemann, and D. Estrin. GPS-less low-cost outdoor localization for very small
devices. IEEE Personal Communications, 7: 2834, 2000.
[7] S. Capkun, M. Hamdi, and J.-P. Hubaux. GPS-free positioning in mobile ad-hoc networks. Cluster
Computing, 5: 157167, 2002.
[8] J. Chen, K. Yao, and R. Hudson. Source localization and beamforming. IEEE Signal Processing
Magazine, 19: 3039, 2002.
[9] L. Doherty, K. Pister, and L. El Ghaoui. Convex position estimation in wireless sensor networks.
In IEEE Infocom 2001, Anchorage, AK, April 2001.
[10] T. He, C. Huang, B. M. Blum, J. A. Stankovic, andT. Abdelzaher. Range-free localizationschemes for
large scale sensor networks. InACMInternational Conference on Mobile Computing and Networking
(Mobicom), San Diego, CA, September 2003, pp. 8195.
[11] D. Niculescu and B. Nath. Ad-hoc positioning system. In IEEE GlobeCom, San Antonio, TX,
November 2001, pp. 29262931.
[12] V. Ramadurai and M. Sichitiu. Localization in wireless sensor networks: a probabilistic
approach. In International Conference on Wireless Networks (ICWN), Las Vegas, NV, June 2003,
pp. 275281.
[13] C. Savarese, K. Langendoen, and J. Rabaey. Robust positioning algorithms for distributed ad-hoc
wireless sensor networks. In USENIX Technical Annual Conference, Monterey, CA, June 2002,
pp. 317328.
[14] A. Savvides, H. Park, and M. Srivastava. The bits and ops of the n-hop multilateration primitive
for node localization problems. In Proceedings of the First ACM International Workshop on Wireless
Sensor Networks and Application (WSNA), Atlanta, GA, September 2002, pp. 112121.
[15] N. Priyantha, A. Chakraborty, and H. Balakrishnan. The cricket location-support system.
In Proceedings of the 6th ACM International Conference on Mobile Computing and Networking
(Mobicom), Boston, MA, August 2000, pp. 3243.
[16] P. Bahl and V. Padmanabhan. RADAR: an in-building RF-based user location tracking system.
In Infocom, Vol. 2, Tel Aviv, Israel, March 2000, pp. 575584.
[17] J. Hightower, R. Want, and G. Borriello. SpotON: an indoor 3Dlocation sensing technology based
on RF signal strength. UW CSE 00-02-02, University of Washington, Department of Computer
Science and Engineering, Seattle, WA, February 2000.
[18] J. Zhao and R. Govindan. Understanding packet delivery performance in dense wireless sensor
networks. In Proceedings of the First International Conference on Embedded Networked Sensor
Systems (SenSys), Los Angeles, CA, November 2003, pp.113
[19] A. Savvides, C.-C. Han, and M. Srivastava. Dynamic ne-grained localization in ad-hoc networks
of sensors. In Proceedings of the 7th ACM International Conference on Mobile Computing and
Networking (Mobicom), Rome, Italy, July 2001, pp. 166179.
[20] L. Girod and D. Estrin. Robust range estimation using acoustic and multimodal sensing. In
IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Maui, Hawaii,
October 2001.
[21] Y. Xu, J. Heidemann, and D. Estrin. Geography-informed energy conservation for ad-hoc routing.
In Proceedings of the 7th ACM International Conference on Mobile Computing and Networking
(Mobicom), Rome, Italy, 2001, pp. 7084.
[22] A. Varga. The OMNeT++ discrete event simulation system. In European Simulation Multiconfer-
ence (ESM2001), Prague, Czech Republic, June 2001.
[23] K. Whitehouse and D. Culler. Callibration as parameter estimation in sensor networks. In Proceed-
ings of the 1st ACM International Workshop on Wireless Sensor Networks and Application (WSNA),
Atlanta, GA, September 2002, pp. 5967.
[24] K. Langendoen and N. Reijers. Distributed localization in wireless sensor networks: a quantitative
comparison. Elsevier Computer Networks, 43: 499518, 2003.
37
Routing in Sensor
Networks
Shashidhar Gandham and
Ravi Musunuri
University of Texas at Dallas
Udit Saxena
Microsoft Corporation
37.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-1
37.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-2
Flat Routing Protocols Cluster-Based Routing Protocols
37.3 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-7
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37-7
37.1 Introduction
Sensor networks are expected to be deployed in large numbers for applications, such as environmental
monitoring, surveillance, security, and precession agriculture [14]. Each sensor node consists of a sensing
device, processor with limited computational capabilities, memory, and wireless transceiver. These nodes
are typically deployed in inaccessible terrain in an ad hoc manner. Once deployed, each sensor node
is expected to periodically monitor its surrounding environment and detect the occurrence of some
predetermined events. For example, in a sensor network deployed for monitoring forest res any sudden
surge in the temperature of the surrounding area would be an event of interest. Similarly, in a sensor
network deployed for surveillance any moving object in the surroundings would be an event of interest.
On detecting an event a sensor node is expected to report the details of the event to a base station associated
with the sensor network. In most cases, the base station might not be in direct reach of the reporting
nodes. Hence, sensor nodes need to forma multihop wireless network to reach the base station. Amedium
access control protocol and a routing protocol are essential in setting up such a wireless network. In this
chapter we present an overview of design challenges associated with routing in sensor networks and some
existing routing protocols.
It is known that each sensor node is powered by limited battery supplied energy. Nodes drain their
energy in carrying local tasks and in communicating with neighboring nodes. The amount of energy spent
in communication is known to be order of magnitudes higher than the amount spent in local tasks [5]. As
stated earlier, sensor nodes are expected to be deployed in inaccessible terrain and it might not be feasible
to replenish the energy available at each node. Thus, energy available at sensor nodes is an important
design constraint in routing.
37-1
37.2 Routing
Each sensor node is expected to monitor some environmental phenomenon and forward the corres-
ponding data toward the base station. To forward the data packets each node needs to have the routing
information. Here, we would like to state that the ow of packets is mostly directed from sensor nodes
toward the base station. As a result, each sensor node need not maintain explicit routing tables. Routing
protocols can in general be divided into at routing and cluster-based routing protocols.
37.2.1 Flat Routing Protocols
In at routing protocols the nodes in the network are considered to be homogeneous. Each node in
the network participates in route discovery, maintainance, and forwarding of the data packets. Here, we
describe few existing at routing protocols for the sensor networks.
37.2.1.1 Sequential Assignment Routing
Sequential Assignment Routing (SAR) [5] takes into consideration the energy and Quality of Service
(QoS) for each path, and the priority level of each packet for making routing decisions. Every node
maintains multiple paths to the sink to avoid the overhead of route recomputation due to the node or
link failure. Multiple paths are built by building multiple trees rooted at the one-hop sink neighbors.
Each tree is grown outward by successively adding more nodes to one-hop sink neighbors, while avoiding
nodes with lower QoS and energy reserves. Each sensor node can control, which of the neighbors can
be used for relaying a message. Each node associates two parameters, an additive QoS metric and energy
measure, with every path. Energy is measured by estimating the maximum number of packets that can
be routed without energy being depleted if this node is exclusively using the path. SAR then calculates a
weighted QoS metric as the product of the additive QoS metric and a weighted coefcient associated with
the priority level of the packet. The SAR algorithmattempts to minimize the average weighted QoS metric
in the lifetime of the network. A periodic recomputation of paths is triggered by the sink to account for
any changes in the topology. Failure recovery is done by a handshaking procedure between neighbors.
37.2.1.2 Directed Diffusion
Estrin et al. [2] proposed a diffusion-based scheme for routing queries from base station to sensor nodes
and forwarding corresponding replies. In directed diffusion, an attribute-based naming is used by the
sensor nodes. Each sensor names data that it generates using one or more attributes. A sink may query for
data by disseminating interests. Intermediate nodes propagate these interests. Interests establish gradients
of data toward the sink that expressed that interest. For example, a seismic sensor may generate a data:
type =seismic, id =12, location =NE, time stamp =01.01.01, footprint =vehicle/wheeled/over 40 tons.
A sink may send an interest of the form: type =seismic, location =NE. The intermediate nodes may
send an interest for data for vehicle data in the NE quadrant toward the approximate direction. The
strength of the gradient may be different toward different neighbors resulting in different amounts of
information ow.
37.2.1.3 Minimum Cost Forwarding Algorithm for Large Sensor Networks
The minimum cost forwarding approach proposed by Ye et al. [6] exploits the fact that the data ow in
sensor networks is in a single direction and is always toward the xed base station. Their method neither
requires sensor nodes to have unique identity nor maintain routing tables to forward the messages. Each
node maintains the least cost estimate from itself to the base station. Each message to be forwarded is
broadcasted by the node. On receiving a message, the node checks if it is on the least cost path between
the source sensor node and the base station. If so it would forward the message by broadcasting.
In principle, the concept behind minimum cost forwarding is similar to the gravity eld that drives
waterfalls from top of mountain to the ground. At each point water ows from a high post to a low
post along the shortest path. For this algorithm to work each node needs to have the least cost estimate
from itself to the base station. The base station broadcasts an advertisement message with the cost set to
Routing in Sensor Networks 37-3
zero. Every node initially has the estimate set to innity. On receiving an advertisement message, every
node checks if the estimate in the message plus the cost of link on which it is received, is less than the
current estimate. If so the current estimate and the estimate in the advertisement message is updated. If
the received advertisement message is updated with a new cost estimate, it is forwarded, else it is purged.
As a result of forwarding advertisement message immediately after updating, it was noticed by the
authors that some nodes will get multiple updates and do multiple forwards as lesser cost estimates
ow in. Furthermore, the nodes far away from the base station get more updates than those close to
the base station. To avoid this instability during the setup phase, a back off algorithm was proposed.
According to this back off algorithm on updating the current cost estimate, the advertisement message is
not forwarded until A C
node
units of time. Where, A is a constant determined through simulations and
C
node
is the cost of link on which the advertisement message was received.
37.2.1.4 Flow-Based Routing Protocol
In Reference 7, the authors proposed to model the sensor network as a ow network and have proposed
an Integer Linear Program (ILP) [8] based routing method. The objective of this ILP-based method is to
minimize the maximum energy spent by any sensor node during a period of time. Through simulation
results, the authors have shown that our ILP-based routing heuristic increases the lifetime of sensor
network signicantly.
In the above-mentioned study, the authors observed that the sensor nodes that are one-hop away from
a base station (sink) drain their energy much earlier than other nodes in the network. As a result, the base
station is disconnected from the network. To address this problem deployment of multiple, intermittently
mobile base stations was proposed. The operation time of sensor network was split into equal periods of
time referred to as rounds. Modications were made to the ow network model and the ILP such that,
solving it gives locations of base stations in addition to the routing information. The ILPwith modication
is given below:
Minimize E
max
jN(i)
x
ij

kN(i)
x
ki
= T, i V (37.1)
E
t
jN(i)
x
ij
+E
r
kN(i)
x
ki
RE
i
, i V (37.2)
l V
f
y
l
K
max
(37.3)
iV
s
x
ik
T|V
s
|y
k
, k V
f
(37.4)
E
t
jN(i)
x
ij
+E
r
kN(i)
x
ki
E
max
, i V (37.5)
x
ij
0, i V
s
, j V; y
k
{0, 1}, k V
f
(37.6)
In formulating the above ILP, the sensor network is represented as a graph G(V, E) where (1) V = V
s
V
f
where V
s
represents the sensor nodes and V
f
represents the feasible sites; (2) E V V represents the
set of wireless links. 01 integer variables y
l
are dened such that for each l V
f
, y
l
= 1 if a base station
is located at feasible site l ; 0 otherwise. N(i) = {j: (i, j) E}. K
max
and RE
i
represent the maximum
number of base stations available and residual energy of nodes, respectively. Given G(V, E), and K
max
,
the above ILP, denoted by BSL
mm
(G, , K
max
), minimizes the maximum energy spent, E
max
, by a sensor
node in a round. For a detailed explaination of the ILP, we refer the readers to Reference 7.
Apart from increasing the lifetime of network, the authors argued that multiple, mobile base stations
would decrease the average hop length taken by each packet and increase the robustness of the system.
37.2.1.5 Sensor Protocols for Information via Negotiation
Kulik and coworkers [9] a set of protocols to disseminate individual sensor information to all the sensor
nodes. Sensor Protocols for Information via Negotiation (SPIN) overcomes information implosion and
overlap by using negotiation and information descriptors (metadata). Classic ooding suffers from the
problem of implosion in that the information is sent to all nodes regardless of whether they have already
seen that information or not. Another problem is that of overlap of information where two pieces of
information might have some components in common, so it might be sufcient to just forward the
information after removing the common part. SPIN uses three kinds of messages to communicate:
ADV when a node has data to send, it advertises it using this message.
REQ a node sends this message when it wishes to receive some data.
DATA data message contain the data with a metadata header.
The details are as follows:
1. SPIN-PP. This protocol is designed for point-to-point communication, assuming that two nodes can
communicate with each other without interfering with other nodes communication. This protocol also
assumes that energy is not a constraint and packets are never lost. This protocol works on a hop-by-hop
basis. A node that has information to send advertises this by sending an ADV to its neighboring nodes.
The nodes who are interested in receiving this information express their interest by sending an REQ. The
originator of the ADV then sends the data to the nodes that sent an REQ. These nodes then send ADV
messages to their neighbors and the process repeats itself.
2. SPIN-EC. This protocol adds an energy heuristic to the previous protocol. Anode participates in the
process only if it can complete all the stages in the protocol without going below a low-energy threshold.
3. SPIN-BC. This protocol was dened for broadcast channels. The advantage is that all nodes within
hearing range canhear a broadcast while the disadvantage is that the nodes have to desist fromtransmitting
if the channel is already in use. Another difference from the previous protocols is that nodes do not
immediately send out REQ messages on hearing an ADV. Each node sets a random timer and on expiry
of that timer sends out the REQ message. The other nodes whose timer have not yet expired cancel it on
hearing the request thus preventing redundant copies of the request being sent again.
4. SPIN-RL. This protocol was designedfor lossy broadcast channels by incorporating twoadjustments.
First, each node keeps track of the advertisements it receives and re-requests data if a response from the
requested node is not received within a specied time interval. Second, nodes limit the frequency with
which they will resend data. Every node waits for a predetermined time period before servicing requests
for the same piece of data again.
Multihop at routing can also be subdivided according to the signal processing techniques. There
are two types of cooperative signal processing techniques: noncoherent and coherent. For noncoherent
processing, raw data is preprocessed at the node itself before forwarding it to the Central Node (CN)
for further processing. For coherent processing, the data is forwarded after minimum processing to the
CN. The processing at the node involves operations, such as time stamping. Thus for energy efciency,
algorithmic techniques assume importance for coherent processing since the data trafc is low, while path
optimality is important for coherent processing.
37.2.1.6 Geographic Routing Protocols
Geographic routing protocols are basedonthe assumptionthat eachnode is aware of geographical location
of its neighbors and the destination node. There are many known location determination algorithms
[10,11], which would enable sensor nodes to learn about their location once deployed. On determining
its position, each node can inform its neighbors about its location. In addition, the dataow in sensor
networks is mostly directed toward a base station whose position can be sent to the nodes on deployment.
The basic idea ingeographic routing protocols is toforwardpackets toa neighbor that is closer tothe des-
tination. Every node employs the same forwarding strategy until the packet reaches the destination node.
It is known that this simple packet-forwarding strategy suffers from local minimum phenomenon [12].
Packets might reach a node whose neighbors are all further away from the destination. Thus, they are
struck with no further node to which they can be forwarded.
Karp and Kung [12] proposed the right-hand rule to overcome the local minimum phenomenon. They
assume that the underlying connectivity graph is planar. When a packet gets struck at a node, they propose
to forward the packet along the face of the graph in counterclockwise direction. Face routing is employed
until the packet reaches a node that is closer to the destination. Fang et al. [13] show that local minimum
phenomenon can be addressed in nonplanar graphs too.
37.2.1.7 Parametric Probabilistic Routing
In parametric probabilistic routing protocol, proposed by Barrett et al. [14] each node forwards a packet
based on a probability density function. Barrett et al. proposed two variations of their protocol. In the
rst variation, referred to as a Destination Attractor, the probability with which a packet is forwarded
to a neighbor depends on the number hops the source node is from the destination and the number
hops the current node is from the destination. The basic idea behind this variation is to increase the
probability of retransmission if the packet is approaching the destination; and to decrease the probability
of retransmission if the packet is going away from the destination. The second variation, referred to as
Directed Transmission, uses number hops already traversed by the packet in addition to the two parameters
used by destination attractor. In directed transmission, nodes on the shortest path to the destination
retransmit with higher probability.
37.2.1.8 MinMinMax, an Energy Aware Routing Protocol
Gandham et al. [15] have formulated the energy aware routing during a round as described below: the
sensor network is represented as a graph G(V, E) where
1. V = V
s
V
b
where V
s
is the set of sensor nodes and V
b
is the set of base station(s).
2. E V V represents the set of wireless links.
A round is assumed to consist of T time frames and each sensor node generates one packet of data in
every time frame. At the beginning of a round the residual energy at a sensor node i is represented by RE
i
.
During a round, the total energy spent by sensor node i can be at most RE
i
, where (0 < 1) is
a parameter. The goal is to determine routing information so as to minimize the total energy spent in the
network such that the maximum energy spent by a node in a round is minimized.
It is known that the energy spent by a node is directly proportional to the amount of ow (number
of packets) passing through the node. Thus, minimizing the maximum energy spent by a node is same
as minimizing the maximum ow through a node. Exploiting this fact, energy aware routing is cast as a
variant of the maximum ow problem [16]. In the maximum ow problem [16], given a directed graph
G(V, E), supply node S
s
and demand node S
d
and capacity u
ij
of each link (i, j) E. We need to determine
the ow x
ij
on each arc (i, j) E such that the net outow from the supply node is maximized. We refer
the readers to Reference 15 for the details.
37.2.2 Cluster-Based Routing Protocols
In cluster-based routing protocols, special nodes referred to as cluster heads, discover and maintain routes
and noncluster head nodes join one of the clusters. All the data packets originating in the cluster are
forwarded toward the cluster head. The cluster head in turn will forward these packets toward destination
using the routing information. Here, we describe some cluster-based routing protocols fromthe literature.
37.2.2.1 Low-Energy Adaptive Clustering Hierarchy
Chandrakasan and coworkers [17] proposed the Low-Energy Adaptive Clustering Hierarchy (LEACH)
as an energy-efcient communication protocol for wireless sensor networks. Authors of LEACH claim
that this protocol will extend the life of wireless sensor networks by a factor of 8, when compared with
protocols based on multihop routing and static clustering. LEACH is a cluster-based routing algorithm
in which self-elected cluster heads collect data from all the sensor nodes in their cluster, aggregate the
collected data by data fusion methods, and transmit the data directly to the base station. These self-elected
cluster heads continue to be cluster heads for a period referred to as a round. At the beginning of each
round, every node determines if it can be a cluster head during the current round. If it decides to be a
cluster head for the current round it announces its decision to its neighbors. Other nodes that choose not
to be cluster heads opt to join one of the cluster heads on listening to these announcements, based on
predetermined parameters, such as signal to noise ratio.
LEACH is proposed for routing data in wireless sensor networks, which have a xed base station to
which the recorded data needs to be routed. All the sensor nodes are considered to be static, homogeneous,
and energy constrained. The sensor nodes are expected to sense the environment continuously and thus
have data to be sent at a xed rate. This assumption makes it unsuitable for sensor networks where a
moving source needs to be monitored. Furthermore, radio channels are assumed to be symmetric. The
term symmetric here means that energy required to transmit a particular message between two nodes is
same in either direction. A rst-order radio model is assumed to describe the transmission characteristics
of the sensor nodes. In this model energy required to transmit a signal has a xed part and a variable part.
The variable part is directly proportional to square of the distance. Some constant energy is required to
receive a signal by any receiving antenna. Based on these assumptions it is clear that having too many
intermediate nodes to route the data might consume more energy, on a global perspective, when compared
to direct transmission to the base station. This argument supports the decision to transmit the aggregated
data directly from cluster head to base station.
The key features of LEACH are localized coordination for cluster setup and operation, randomized
rotation of cluster heads and local fusion of data to reduce global communication costs. LEACH is
organized into rounds where each round starts with a setup phase followed by a longer steady-state data
transfer phase. Here we describe various subphases involved in both these phases:
1. Advertisement phase. Apredetermined fraction of nodes, say p, elect themselves as cluster heads. The
optimum value of p can be found from the plot between normalized energy dissipation and percentage of
nodes acting as cluster heads. For detailed description of this procedure we refer the reader to Reference 18.
The decision to be a cluster head is made by choosing a random number between 0 and 1. If the generated
number is less than a threshold T(n) then the node will be a cluster head for current round. The threshold
T(n) is given by the expression p/[1 p(r mod (1/p))]. This would ensure that every node would be a
cluster head once in 1/p rounds. Once the decision is made, cluster heads advertise their id and this is
done by employing CSMA MAC protocol.
2. Cluster setup phase. On listening to advertisements in the previous phase, noncluster head nodes
determine which cluster head to join by comparing signal to noise ratio from various cluster heads
surrounding it. Each node informs the cluster head of the cluster that it decides to join by employing
CSMA MAC protocol.
3. Schedule creation. On receiving all the messages, cluster head creates a TDMA schedule and
announces it to all the nodes in the cluster. In order to avoid interference between nodes in adjacent
clusters, cluster head determines the CDMA code to be used by all the nodes in its cluster. This CDMA
code to be used in the current round is transmitted along with the TDMA schedule.
4. Data transmission. Once the schedule is known each node will transmit the data during the time slot
allocated to it. When the cluster head receives data from all the nodes in its cluster it will run some data
fusion algorithms to aggregate the data. The resulting data is transmitted directly to the base station.
37.2.2.2 Threshold Sensitive Energy-Efcient Sensor Network Protocol
In Reference 19, the authors have classied sensor networks into proactive networks and reactive networks.
Nodes in proactive networks continuously monitor the environment and thus have the data to be sent at
a constant rate. LEACH suits such sensor networks in transmitting data efciently to the base station. In
case of the reactive sensor networks, nodes need to transmit the data only when an event of interest occurs.
Hence, all the nodes in the network do not have equal amount of data to be transmitted. Manjeshwar
and Agrawal [19] proposed Threshold Sensitive Energy-Efcient Sensor Network (TEEN) for routing in
reactive sensor networks.
TEEN employs the cluster formation strategy of LEACH but adopts a different strategy in data trans-
mission phase. TEEN makes use of two user-dened parameters hard threshold (Ht) and soft threshold
(St) to determine if it needs to transmit the value it sensed currently. When the monitored value exceeds Ht
for the rst time, it is stored in a variable and is transmitted during the time slot of the node. Subsequently,
if the monitored value exceeds the currently stored value by a magnitude of St then the node will transmit
the data. This transmitted value is stored for comparing in future.
37.2.2.3 Two-Level Clustering Algorithm
Estrin et al. [2] proposed a two-level clustering algorithmthat can be extended to build a cluster hierarchy.
In this algorithm every sensor at a particular level is associated with a radius or the number of hops
that its advertisements will reach. Sensors at a higher level are associated with higher radii. All sensors
start with level 0. Each sensor sends out periodic advertisements to other nodes that are within its
radius. The advertisements carry its current level, its parents identity (if any) and the remaining energy.
After transmitting the advertisement, each node will wait for a time proportional to its radius to receive
advertisements from other nodes. At the end of the wait time all level 0 nodes start a promotion timer that
is proportional to its remaining energy reserves and the number of level 0 nodes whose advertisements
it received. When the promotion timer expires the node promotes itself to level 1 and starts sending out
periodic advertisements. In these new advertisements it lists its potential children which are the level 0
nodes that it previously heard. A level 0 node then picks up its parent from one of the level 1 nodes, whose
advertisements included its identity. Once a level 0 node picks up its parent it cancels its promotion timer
and drops out of the race. At the end, each of the level 1 node starts a wait timer and waits for its potential
childrens acknowledgments. If no level 0 node selected it as its parent or its energy dropped below a
certain level it demotes itself to a level 0 node. All level 0 and level 1 nodes periodically enter the wait stage
to take into account any change in network conditions and reclustering takes place.
37.3 Conclusions
Inthis chapter we presenteda brief overviewof some knownrouting algorithminwireless sensor networks.
Both at and cluster-based routing algorithms were discussed.
References
networks. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal
Processing, IEEE, 2001, pp. 20332036.
[2] Estrin, D., Govindan, R., Heidemann, J., and Kumar, S. Next century challenges: scalable coordin-
ation in sensor networks. In Proceedings of the 5th Annual ACM/IEEE International Conference on
Mobile Computing and Networking, IEEE, 1999, pp. 263270.
[3] Pottie, G.J. and Kaiser, W.J. Wireless integrated network sensors. Communications of the ACM, 43,
5158, 2000.
[4] Pottie, G.J. Wireless sensor networks. In Proceedings of the Information Theory Workshop, 1998,
pp. 139140.
[5] Sohrabi, K., Gao, J., Ailawadhi, V., and Pottie, G.J. Protocols for self-organization of a wireless
sensor network. IEEE Personal Communications, 7, 1627, 2000.
[6] Ye, F., Chen, A., Liu, S., and Zhang, L. A scalable solution to minimum cost forwarding in large
sensor networks. In Proceedings of the 10th International Conference on Computer Communications
and Networks, 2001, pp. 304309.
[7] Gandham, Shashidhar Rao, Dawande, Milind, Prakash, Ravi, and Venkatesan, S. Energy efcient
schemes for wireless sensor networks with multiple mobile stations. In Proceedings of the IEEE
Globecom, IEEE, 2003.
[8] Nemhauser, G.L. and Wolsey, L.A. Integer Programming and Combinatorial Optimization. John
Wiley & Sons, New York, 1988.
[9] Heinzelman, W., Kulik, J., and Balakrishnan, H. Negotiation-based protocols for disseminating
information in wireless sensor networks. In Proceedings of the 5th Annual ACM/IEEE International
Conference on Mobile Computing and Networking, IEEE, 1999.
[10] Saikat Ray, Rachanee Ungrangsi, Francesco De Pellegrini, Ari Trachtenberg, andDavidStarobinski.
Robust location detection in emergency sensor networks. In Proceedings of the INFOCOM, IEEE,
2003.
[11] Nirupama Bulusu, John Heidemann, and Deborah Estrin. GPS-less low cost outdoor localization
for very small devices. Technical report 00-729, USC/ISI, April 2000.
[12] Karp, B. and Kung, H. GPRS: greedy perimeter stateless routing for wireless networks. In
Proceedings of the Mobicom, ACM, 2000.
[13] Qing Fang, Jie Gao, and Leonidas J. Guibas. Locating and bypassing routing holes in sensor
networks. In Proceedings of the INFOCOM, IEEE, 2004.
[14] Christopher L. Barrett, Stephan J. Eidenbenz, Lukas Kroc, Madhav Marathe, and James P. Smith.
Parametric probabilistic sensor network routing. In Proceedings of the WSNA03, 2003.
[15] Shashidhar Gandham, Milind Dawande, and Ravi Prakash. An integral ow-based energy-efcient
routing algorithm for wireless sensor networks. In Proceedings of the WCNC, IEEE, 2004.
[16] Ahuja, R.K. and Orlin, J.B. Afast and simple algorithmfor the maximumowproblem. Operations
Research, 37, 748759, 1989.
[17] Heinzelman, W.R., Chandrakasan, A., and Balakrishnan, H. Energy-efcient communication
protocol for wireless micro sensor networks. In Proceedings of the 33rd Annual Hawaii International
Conference on System Sciences, 2000, pp. 30053014.
[18] Heinzelman, W., Kulik, J., and Balakrishnan, H. Adaptive protocols for information dissemination
in wireless sensor networks. In Proceedings of the 5th Annual ACM/IEEE International Conference
on Mobile Computing and Networking, IEEE, 1999, pp. 174185.
[19] Manjeshwar, A. and Agrawal, D.P. TEEN: a routing protocol for enhanced efciency in wire-
less sensor networks. In International Proceedings of the 15th Parallel and Distributed Processing
Symposium, 2001, pp. 20092015.
38
Distributed Signal
Processing in Sensor
Networks
Omid S. Jahromi
Bioscrypt Inc.
Parham Aarabi
University of Toronto
38.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-1
38.2 Spectrum Estimation Using Sensor Networks . . . . . . . . . 38-3
Background Mathematical Formulation of the Problem
38.3 Inverse and Ill-Posed Problems . . . . . . . . . . . . . . . . . . . . . . . . . 38-6
Ill-Posed Linear Operator Equations Regularization
Methods for Solving Ill-Posed Linear Operator Equations
38.4 Spectrum Estimation Using Generalized Projections 38-9
38.5 Distributed Algorithms for Calculating Generalized
Projection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-11
The Ring Algorithm The Star Algorithm
38.6 Concluding Remark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-17
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38-17
38.1 Introduction
Sensors are vital means for scientists and engineers to observe physical phenomena. They are used to
measure physical variables such as temperature, pH, velocity, rotational rate, ow rate, pressure, and many
others. Most modern sensors output a discrete-time (digitized) signal that is indicative of the physical
variable they measure. Those signals are often imported into digital signal processing (DSP) hardware,
stored in les or plotted on a computer display for monitoring purposes.
In recent years there has been an emergence of a number of new sensing concepts which advocate
connecting a large number of inexpensive and small sensors in a sensor network. The trend to network
many sensors together has been reinforced by the widespread availability of cheap embedded processors
and easily accessible wireless networks. The building blocks of a sensor network, often called Motes, are
self-contained, battery-powered computers that measure light, sound, temperature, humidity, and other
environmental variables (Figure 38.1).
Motes can be deployed in large numbers providing enhanced spatio-temporal sensing coverage in ways
that are either prohibitively expensive or impossible using conventional sensing assets. For example, they
can allow monitoring of land, water, and air resources for environmental monitoring. They can also be
used to monitor borders for safety and security. In defence applications, sensor networks can provide
38-1
FIGURE 38.1 A wireless sensor node or Mote made by Crossbow Technology, Inc. in San Jose, CA.
enhanced battleeld situational awareness which can revolutionize a wide variety of operations from
armored assault on open terrain to urban warfare. Sensor networks have many potential applications in
biomedicine, factory automation, and control of transportation systems as well.
In principle, a distributed network of sensors can be highly scalable, cost effective, and robust with
respect to individual Motes failure. However, there are many technological hurdles that must be overcome
for sensor networks to become viable. For instance, Motes are inevitably constrained in processing speed,
storage capacity, and communication bandwidth. Additionally, their lifetime is determined by their ability
to conserve power. These constraints require new hardware designs and novel network architectures.
Sensor networks raise nontrivial theoretical issues as well. For example, newnetworking protocols must
be devised to allowthe sensor nodes to spontaneously create an impromptu network, dynamically adapt to
device failure, manage movement of sensor nodes, and react to changes in task and network requirements.
Froma signal processing point of view, the main challenge is the distributed fusion of sensor data across
the network. This is because individual sensor nodes are often not able to provide useful or comprehensive
information about the quantity under observation. Furthermore, the following constraints must be
considered while designing the information fusion algorithm:
1. Each sensor node is likely to have limited power and bandwidth capabilities to communicate with
other devices. Therefore, any distributed computation on the sensor network must be very efcient
in utilizing the limited power and bandwidth budget of the sensor devices.
2. Owing to the variable environmental conditions in which sensor devices may be deployed, one can
expect a fraction of the sensor nodes to be malfunctioning. Therefore, the underlying distributed
algorithms must be robust with respect to device failures.
Owing to the large and often ad hoc nature of sensor networks, it would be a formidable challenge
to develop distributed information fusion algorithms without rst developing a simple, yet rigorous and
exible, mathematical model. The aim of this chapter is to introduce one such model.
We advocate that information fusion in sensor networks should be viewed as a problem of nding a
solution point in the intersection of some feasibility sets. The key advantage of this viewpoint is that
the solution can be found using a series of projections onto the individual sets. The projections can be
computed locally at each sensor node allowing the fusion process to be done in a parallel and distributed
fashion.
To maintain clarity and simplicity, we will focus on solving a benchmark signal processing problem
(spectrumestimation) using sensor networks. However, the fusion algorithms that result fromour formu-
lations are very general and can be used to solve other sensor network signal processing problems as well.
Distributed Signal Processing in Sensor Networks 38-3
Notation: Vectors are denoted by capital letters. Boldface capital letters are used for matrices. Elements
of a matrix A are referred to as [A]
ij
. We denote the set of real M-tuples by R
M
and use the notation
R
+
for positive real numbers. The expected value of a random variable x is denoted by E {x }. The linear
convolution operator is denoted by . The spaces of Lebesgue-measurable functions are represented by
L
1
(a, b ), L
2
(a, b ), etc. The end of an example is indicated using the symbol .
38.2 Spectrum Estimation Using Sensor Networks
38.2.1 Background
Spectrum estimation is concerned with determining the distribution in frequency of the power of a random
process. Questions such as Does most of the power of the signal reside at low or high frequencies?
or Are there resonance peaks in the spectrum? are often answered as a result of a spectral analysis.
Spectral analysis nds frequent and extensive use in many areas of physical sciences. Examples abound in
oceanography, electrical engineering, geophysics, astronomy, and hydrology.
Let x (n ) denote a zero-mean Gaussian wide-sense stationary (WSS) random process. It is well known
that a complete statistical description of such a process is provided by its autocorrelation sequence (ACS)
R
x
(k )

= E {x (n )x (n + k )}
or, equivalently, by its power spectrum also known as power spectral density (PSD)
P
x
(e
j
) =
k =
R
x
(k )e
j k
The ACS sequence is a time-domain description of the second-order statistics of a random process. The
power spectrum provides a frequency domain description of the same statistics.
An issue of practical importance is, how to estimate the power spectrum of a time series given a nite-
length data record. This is not a trivial problem as reected in a bewildering array of power spectrum
estimation procedures, with each procedure claimed to have or show some optimum property.
1
The
reader is referred to the excellent texts [36] for analysis of empirical spectrum estimation methods.
Consider the basic scenario where a sound source (a speaker) is monitored by a collection of Motes
put at various known locations in a room (Figure 38.2). Because of reverberation, noise, and other
artifacts, the signal arriving at each Mote location is different. The Motes (which constitute the sensor
nodes in our network) are equipped with microphones, sampling devices, sufcient signal processing
hardware, and some communication means. Each Mote can process its observed data, come up with
some statistical inference about it and share the result with other nodes in the network. However, to save
energy and communication bandwidth, the Motes are not allowed to share their rawobserved data with each
other.
Now, how should the network operate so that an estimate of the power spectrum of the sound source
consistent with the observations made by all Motes is obtained? We will provide an answer to this question
in the sections that follow.
2
1
The controversy is rooted in the fact that power spectrum is a probabilistic quantity and these quantities cannot
be constructed using nite-size sample records. Indeed, neither the axiomatic theory [1] nor the frequency theory [2]
of probability species a constructive way for building probability measures from empirical samples.
2
The problem of estimating the power spectrum of a random signal, when the signal itself is not available but
some measured signals derived from it are observable, has been studied in Reference 7. The approach developed in
Reference 7, however, leads to a centralized fusion algorithm which is not suited to sensor network applications.
Microphone
Data acquisition and
processing module
Communications
module
Speech
source

Sensor
nodes
FIGURE 38.2 A sensor network monitoring a stationary sound source in a room.
38.2.2 Mathematical Formulation of the Problem
Let x (n ) denote a discrete version of the signal produced by the source and assume that it is a zero-mean
Gaussian WSS random process. The sampling frequency f
s
associated with x (n ) is arbitrary and depends
on the frequency resolution desired in the spectrum estimation process.
We denote by v
i
(n ) the signal produced at the front end of the ith sensor node. We assume that v
i
(n )
are related to the original source signal x (n ) by the model shown in Figure 38.3. The linear lter H
i
(z) in
this gure models the combined effect of room reverberations, microphones frequency response, and an
additional lter which the system designer might want to include. The decimator block that follows the
lter represents the (potential) difference between the sampling frequency f
s
associated with x(n) and the
actual sampling frequency of the Motes sampling device. Here, it is assumed that the sampling frequency
associated with v
i
(n) is f
s
/N
i
where N
i
is a xed natural number.
It is straightforward to show that the signal v
i
(n) in Figure 38.3 is also a WSS processes. The
autocorrelation coefcients R
v
i
(k) associated with v
i
(n) are given by
R
v
i
(k) = R
x
i
(N
i
k) (38.1)
where
R
x
i
(k) = (h
i
(k) h
i
(k)) R
x
(k) (38.2)
and h
i
(k) denotes the impulse response of H
i
(z). We can express R
v
i
(k) as a function of the source signals
power spectrum as well. To do this, we dene G
i
(z)

=H
i
(z)H
i
(z
1
) and then use it to write (38.2) in
Speech
source
x(n)
v
i
(n)
x(nD)
N
i
x
i
(n)
Processor
H
i
(z)
FIGURE 38.3 The relation between the signal v
i
(n) produced by the front end of the ith sensor and the original
source signal x(n).
the frequency domain:
R
x
i
(k) =
1
2
P
x
(e
j
)G
i
(e
j
)e
jk
d (38.3)
Combining (38.1) and (38.3), we then get
R
v
i
(k) =
1
2
P
x
(e
j
)G
i
(e
j
)e
jN
i
k
d (38.4)
The above formula shows that P
x
(e
j
) uniquely species R
v
i
(k) for all values of k. However, the reverse
is not true. That is, in general, knowing R
v
i
(k) for some or all values of k is not sufcient for characterizing
P
x
(e
j
) uniquely.
Recall that v
i
(n) is a WSS signal so all the statistical information that can be gained about it is conned
in its autocorrelation coefcients. One might use the signal processing hardware available at each sensor
node and estimate the autocorrelation coefcients R
v
i
(k) for some k, say 0 k L 1. Now, we may
pose the sensor network spectrum estimation problem as follows:
Problem38.1
Let Q
i,k
denote the set of all power spectra which are consistent with the kth autocorrelation coefcient
R
v
i
(k) estimated at the ith sensor node. That is, P
x
(e
j
) Q
i,k
if
1
2
P
x
(e
j
)G
i
(e
j
)e
jMk
d = R
v
i
(k),
P
x
(e
j
) 0,
P
x
(e
j
) = P
x
(e
j
),
P
x
(e
j
) L
1
(, ).
Dene Q
=
N
i=1
L1
k=0
Q
i,k
where N is the number of nodes in the network and L is the number of
autocorrelation coefcients estimated at each node. Find a P
x
(e
j
) in Q.
If we ignore measurement imperfections and assume that the observed autocorrelation coefcients
R
v
i
(k) are exact, then the sets Q
i,k
are nonempty and admit a nonempty intersection Q as well. In this
case Q contains innitely many P
x
(e
j
). When the measurements v
i
(n) are contaminated by noise or
R
v
i
(k) are estimated based on nite-length data records, the intersection set Q might be empty owing
to the potential inconsistency of the autocorrelation coefcients estimated by different sensors. Thus,
Problem 38.1 has either no solution or innitely many solutions. Problem which have such undesirable
properties are called ill-posed. Ill-posed problems are studied in Section 38.3.
38.3 Inverse and Ill-Posed Problems
The study of inverse problems has been one of the fastest-growing areas in applied mathematics in
the last two decades. This growth has largely been driven by the need of applications in both natural
sciences (e.g., inverse scattering theory, astronomical image restoration, and statistical learning theory)
and industry (e.g., computerized tomography, remote sensing). The reader is referred to References 8 to 11
for detailed treatments of the theory of ill-posed problems and to References 12 and 13 for applications
in inverse scattering and statistical inference, respectively.
By denition, inverse problems are concerned with determining causes for a desired or an observed
effect. Most often, inverse problems are much more difcult to deal with (from a mathematical point of
view) than their direct counterparts. This is because they might not have a solution in the strict sense
or solutions might not be unique or depend on data continuously. Mathematical problems having such
undesirable properties are called ill-posed problems and cause severe numerical difculties (mostly because
of the discontinuous dependence of solutions on the data).
Formally, a problem of mathematical physics is called well-posed or well-posed in the sense of Hadamard
if it fullls the following conditions:
1. For all admissible data, a solution exists.
2. For all admissible data, the solution is unique.
3. The solution depends continuously on the data.
A problem for which one or more of the above conditions are violated is called ill-posed. Note that, the
conditions mentioned do not make a precise denition for well-posedness. To make a precise denition
in a concrete situation, one has to specify the notion of a solution, which data are considered admissible,
and which topology is used for measuring continuity.
The study of concrete ill-posed problems often involves the questionhow can one enforce uniqueness
by additional informationor assumptions?Not muchcanbe saidabout this ina general context. However,
the aspect of lack of stability and its restoration by appropriate methods known as regularization methods
can be treated in sufcient generality. The theory of regularization is well developed for linear inverse
problems and will be introduced, very briey, in Section 38.3.1.
38.3.1 Ill-Posed Linear Operator Equations
Let the linear operator equation
Ax = y (38.5)
be dened by the continuous operator A that maps the elements x of a metric space E
1
onto elements
y of the metric space E
2
. In the early 1900s, noted French mathematician Jacques Hadamard observed
that under some (very general) circumstances the problem of solving the operator equation (38.5) is
ill-posed. This is because, even if there exists a unique solution x E
1
that satises the equality (38.5),
a small deviation on the right-hand side can cause large deviations in the solution. The following example
illustrates this issue.
Example 38.1 Let A denote a Fredholm integral operator of the rst kind. Thus, we dene
(Ax)(s)

=
b
a
K(s, t )x(t ) dt (38.6)
The kernel K(s, t ) is continuous on [a b] [a b] and maps a function x(t ) continuous on [a b] to a
function y(s) also continuous on [a b]. We observe that the continuous function
g
(s)

=
b
a
K(s, t ) sin(t ) dt (38.7)
which is formed by means of the kernel K(s, t ), possesses the property
lim
(s) = 0, for every s [a, b] (38.8)

The above property is a consequence of the fact that the Fourier series coefcients of a continuous function
tend to zero at high frequencies (see, e.g., [14, chapter 14, section I]). Now, consider the integral equation
Ax = y +g

(38.9)
where y is given and g
is dened in (38.7). Since the above equation is linear, it follows using (38.7) that
its solution x(t ) has the form
x(t ) = x
(t ) +sin(t ) (38.10)
where x
(t ) is a solution to the original integral equation Ax = y. For sufciently large , the right-hand
side of (38.9) differs from the right-hand side of (38.5) only by the small amount g
(s), while its solution

differs from that of (38.5) by the amount sin(t ). Thus, the problem of solving (38.5) where A is a
Fredholm integral operator of the rst kind is ill-posed.
One can easily verify that the problem of solving the operator equation (38.5) is equivalent to nding
an element x
E
1
such that the functional
R(x)

=Ax y
E
2
(38.11)
is minimized.
3
Note that the minimizing element x
E
1
always exists even when the original
equation (38.5) does not have a solution. In any case, if the right-hand side of (38.5) is not exact,
that is, if we replace y by y
such that y y
E
2
< where is a small value, a new element x
E
1
will
minimize the functional
R
(x)

=Ax y
E
2
(38.12)
However, the new solution x
is not necessarily close to the rst solution x
even if tends to zero. In

other words, lim
0
x
E
1
= 0 when the operator equation Ax = y is ill-posed.
38.3.2 Regularization Methods for Solving Ill-Posed
Linear Operator Equations
Hadamard [15] thought that ill-posed problems are a pure mathematical phenomenon and that all real-
life problems are well-posed. However, in the second half of the 20th century, a number of very important
real-life problems were found to be ill-posed. In particular, as we just discussed, ill-posed problems arise
when one tries to reverse the causeeffect relations to nd unknown causes from known consequences.
Even if the causeeffect relationship forms a one-to-one mapping, the problem of inverting it can be
3
To save in notation, we write a b
E
to denote the distance between the two elements a, b E whether the
metric space E is a normed space or not. If E is a normed space too, our notation is self-evident. Otherwise, it should
be interpreted only as a symbol for the distance between a and b.
ill-posed. The discovery of various regularization methods by Tikhonov, Ivanov, and Phillips in the early
60s made it possible to construct a sequence of well-posed solutions that converges to the desired one.
Regularization theory was one of the rst signs of existence of intelligent inference. It demonstrated that
where the self-evident methods of solving an operator equation might not work, the nonself-evident
methods of regularization theory do. The inuence of the philosophy created by the theory of regulariz-
ation is very deep. Both the regularization philosophy and the regularization techniques became widely
disseminated in many areas of science and engineering [10,11].
38.3.2.1 Tikhonovs Method
In the early 60s, it was discovered by A.N. Tikhonov [16,17] that, if instead of the functional R
(x) one
minimizes
R
reg
(x)

=Ax y
E
2
+()S(x) (38.13)
where S(x) is a stabilizing functional (that belongs to a certain class of functionals) and () is an appropri-
ately chosen constant (whose value depends on the noise level, ), then one obtains a sequence of solutions
x
that converges to the desired one as tends to zero. For the above result to be valid, it is required that:
1. The problem of minimizing R
reg
(x) be well-posed for xed values of and ().
2. lim
0
x
E
1
0 when () is chosen appropriately.
Consider a real-valued lower semicontinuous
4
functional S(x). We shall call S(x) a stabilizing functional
if it possesses the following properties:
1. The solution of the operator equation Ax = y belongs to the domain of denition D(S) of the
functional S.
2. S(x) 0, x D(S).
3. The level sets {x: S(x) c}, c = const., are all compact.
It turns out that the above conditions are sufcient for the problem of minimizing R
reg
(x) to be well-
posed [8, page 51]. Now, the important remaining problem is to determine the functional relationship
between and () such that the sequence of solutions obtained by minimizing (38.13) converges to the
solution of (38.11) as tends to zero. The following theorem establishes sufcient conditions on such a
relationship:
Theorem 38.1 [13, page 55] Let E
1
and E
2
be two metric spaces and let A : E
1
E
2
be a continuous and
one-to-one operator. Suppose that for y E
2
there exists a solution x D(S) E
1
to the operator equation
Ax = y. Let y
be an element in E
2
such that y y
E
2
. If the parameter () is chosen such that:
(i) () 0 when 0,
(ii) lim
0

2
/() < ,
Then the elements x
D(S) minimizing the functional

R
reg
(x) = Ax y
E
2
+()S(x)
converge to the exact solution x as 0.
If E
1
is a Hilbert space, the stabilizing functional S(x) may simply be chosen as x
2
, which, indeed,
is the original choice made by Tikhonov. In this case, the level sets of S(x) will only be weakly compact.
However, the convergence of the regularized solutions will be a strong one in view of the properties of
4
A function f : R
N
[, ] is called lower semicontinuous at X R
N
if for any t < f (X) there exists
> 0 such that for all y B(X, ), t < . The notation B(X, ) represents a ball with center at X and radius . This
denition generalizes to functional spaces by using the appropriate metric in dening B(X, ).
Hilbert spaces. The conditions imposed on the parameter () are, nevertheless, more stringent than
those stated in the above theorem.
5
38.3.2.2 The Residual Method
The results presented above are fundamentals in Tikhonovs theory of regularization. Tikhonovs theory,
however, is only one of several proposed schemes for solving ill-posed problems. An important variation
known as Residual Method was introduced by Phillips [18]. In Phillipss method one minimizes the
functional
R
P
(x)

=S(x)
subjected to the constraint
Ax y
E
2

where is a xed constant. The stabilizing functional S(x) is dened as in Section 38.3.2.1.
38.3.2.3 The Quasi-Solution Method
The quasi-solutionmethodwas developedby Ivanov [19,20]. Inthis method, one minimizes the functional
R
I
(x)

=Ax y
E
2
subjected to the constraint
S(x)
where is a xed constant. Again, the stabilizing functional S(x) is dened as in Tikhonovs method.
Note that, the three regularization methods mentioned contain one free parameter ( in Tikhonovs
method, for Phillips method, and in Ivanovs method). It has been shown [21] that these methods
are all equivalent in the sense that if one of the methods (say Phillips) for a given value of its parameter
(say
) produces a solution x
, then there exist corresponding values of parameters of the other two

methods that produce the same solution. We remark in passing that a smart choice of the free parameter is
crucial in obtaining a good (fast converging) solution using any of the regularization methods mentioned.
There exist several principles for choosing the free parameter in an optimal fashion [10, section 4.3;
11, chapter 2].
38.4 Spectrum Estimation Using Generalized Projections
The sensor network spectrum estimation problem (Problem 38.1) posed in Section 38.2.2 is essentially
nding a P(e
j
) in the intersection of the feasible sets Q
i,k
. It is easy to verify that the sets Q
i,k
are closed
and convex [7]. The problem of nding a point in the intersection of many nitely closed convex sets is
known as the convex feasibility problem and is an active area of research in applied mathematics.
An elegant way to solve a convex feasibility problem is to employ a series of generalized projections [22].
A generalized projection is essentially a regularization method with a generalized distance serving as
the stabilizing functional. A great advantage of using the generalized projections formulation is that
the solution P
Q can be found using a series of projections onto the intermediate sets Q

i,k
. These
intermediate projections can be computed locally at each sensor node thus allowing the computations to
be done simultaneously and in a highly distributed fashion.
A generalized distance is a real-valued nonnegative function of two vector variable D(X, Y) dened
in a specic way such that its value may represent the distance between X and Y in some generalized
sense. When dening generalized distances, it is customary not to require the symmetry condition.
5
In this case, () should converge to zero strictly slower than
2
. In more precise terms, lim
0

2
/() = 0
must hold.
Thus, D (X, Y ) may not be the same as D (Y , X ). Moreover, we do not insist on the triangle inequality
that a traditional metric must obey either.
Example 38.2 Let P
1
(e
j
) > 0 and P
2
(e
j
) > 0 be two power spectra in L
1
(, ). The functions
D
1
(P
1
, P
2
) =
(P
1
P
2
)
2
d
D
2
(P
1
, P
2
) =
P
1
ln
P
1
P
2
+ P
2
P
1
d
D
3
(P
1
, P
2
) =
P
1
P
2
ln
P
1
P
2
1
d
can be used to measure the generalized distance between P
1
(e
j
) and P
2
(e
j
). These functions are non-
negative and become zero if and only if P
1
= P
2
. Note that, D
1
is simply the Euclidean distance between
P
1
and P
2
. The functions D
2
and D
3
have roots in information theory and statistics. They are known as
the Kullback-Leibler divergence and Burg cross entropy, respectively.
By using a suitable generalized distance, we can convert our original sensor network spectrum
estimation problem (Problem 38.1) into the following minimization problem:
Problem 38.2
Let Q be dened as in Problem 38.1. Find P
x
(e
j
) in Q such that
P
= arg min
P Q
D (P, P
0
) (38.14)
where P
0
(e
j
) is an arbitrary power spectrum, say P
0
(e
j
) = 1, < .
When a unique P

exists, it is called the generalized projection of P
0
onto Q [23]. In general, a projection
of a given point onto a convex set is dened as another point which has two properties: rst, it belongs to
the set onto which the projection operation is performed and, second, it renders a minimal value to the
distance between the given point and any point of the set (Figure 38.4).
If the Euclideandistance ||XY|| is used inthis context thenthe projectionis called a metric projection.
In some cases, such as the spectrum estimation problem considered here, it turns out to be very useful to
introduce more general means to measure the distance between two vectors. The main reason is that the
functional form of the solution will depend on the choice of the distance measure used in the projection.
Often, a functional formwhich is easy to manipulate or interpret (for instance, a rational function) cannot
be obtained using the conventional Euclidean metric.
It can be shown that the distances D
1
and D
2
in Example 38.2 lead to well-posed solutions for P
. The
choice D
3
will lead to a unique solution given that ceratin singular power spectra are excluded from the
space of valid solutions [24]. It is not known whether D
3
will lead to a stable solution. As a result,
the well-posedness of Problem 38.2 when D
3
is used is not yet established.
6
6
Well-posedness of the minimization problem (38.14) when D is the Kullback-Leibler divergence D
2
has been
established in several works including References 25 to 29. Well-posedness results exist for certain classes of generalized
distance functions as well [29,30]. Unfortunately, the Burg cross entropy D
3
does not belong to any of these classes.
While Burg cross entropy lacks theoretical support as a regularizing functional, it has been used successfully to resolve
ill-posed problems in several applications including spectral estimation and image restoration (see, e.g., [31] and
references therein). The desirable feature of Burg cross entropy in the context of spectrum estimation is that its
minimization (subjected to linear constraints P
x
(e
j
) Q) leads to rational power spectra.
Y
Y
X*
Q
||XY||
Y
X
X*
Q
D(X,Y)
(b) (a)
FIGURE 38.4 Symbolic depiction of metric projection (a) and generalized projection (b) of a vector Y into a closed
convex set Q. In (a) the projection X
is selected by minimizing the metric ||X Y|| over all X Qwhile in (b) X
is found by minimizing the generalized distance D(X, Y) over the same set.
38.5 Distributed Algorithms for Calculating
Generalized Projection
As we mentioned before, a very interesting aspect of the generalized projections formulation is that the
solution P
Qcan be found using a series of projections onto the intermediate sets Q

i,k
. In this section,
we rst calculate the generalized projection of a given power spectrum onto the sets Q
i,k
for the sample
distance functions introduced in Example 38.2. Then, we propose a distributed algorithm for calculating
the nal solution P
from these intermediate projections.

Let P
[P
1
Q
i,k
;D
j
]
denote the power spectrum resulting from projecting a given power spectrumP
1
onto
the set Q
i,k
using a given distance functions D
j
. That is,
P
[P
1
Q
i,k
;D
j
]
=arg min
PQ
i,k
D
j
(P, P
1
) (38.15)
Using standard techniques from calculus of variations we can show that the generalized distances D
1
, D
2
,
and D
3
introduced in Example 38.2 result in projections of the form
P
[P
1
Q
i,k
;D
1
]
= P
1
(e
j
) G
i
(e
j
) cos(Mk)
P
[P
1
Q
i,k
;D
2
]
= P
1
(e
j
) exp(G
i
(e
j
) cos(Mk))
P
[P
1
Q
i,k
;D
3
]
= (P
1
(e
j
)
1
+ G
i
(e
j
) cos(Mk))
1
where , , and are parameters (Lagrange multipliers). These parameter should be chosen such that in
each case P
[P
1
Q
i,k
;D
j
]
Q
i,k
. That is,
1
2
P
[P
1
Q
i,k
;D
j
]
G
i
(e
j
)e
jMk
d = R
v
i
(k) (38.16)
The reader may observe that the above equation leads to a closed-form formula for but in general
nding and requires numerical methods. The projection formulae developed above can be employed
in a variety of iterative algorithms to nd a solution in the intersection of Q
i,k
. We discuss two example
algorithms below.
38.5.1 The Ring Algorithm
Ring Algorithm is a very simple algorithm: it starts with an initial guess P
0
for P
x
(e
j
) and then cal-
culates a series of successive projections onto the constraint sets Q
i,k
. Then, it takes the last projection,
now called P
(1)
, and projects it back onto the rst constraint set. Continuing this process will generate
The Ring Algorithm
Input: A distance function D
j
(P
1
, P
2
), an initial power spectrum P
0
(e
j
), the squared sensor frequency
responses G
i
(e
j
), and the autocorrelation estimates R
v
i
(k ) for k = 0, 1, . . . , L 1 and i = 1, 2, . . . , N.
Output: A power spectrum P
(e
j
).
Procedure:
1. Let m = 0, i = 1, and P
(m )
= P
0
.
2. Send P
(m )
to the ith sensor node.
At the ith sensor:
(i) Let k = 0 and dene
P
k
= P
(m )
.
(ii) Calculate
P
k
= P
[
P
k 1
Q
i,k
;D
j
]
for k = 1, 2, . . . , L 1.
(iii) If D (
P
L 1
,
P
0
) > then let
P
0
=

P
L 1
and go back to item (ii). Otherwise, let i = i + 1
and go to Step 3.
3. If (i mod N) = 1 then set m = m + 1 and reset i to 1. Otherwise, set P
(m )
=

P
L 1
and go back
to Step 2.
4. Dene P
(m )
=

P
L 1
. If D (P
(m )
, P
(m 1)
) > , go back to Step 2. Otherwise output P
= P
m
and stop.
a sequence of solutions P
(0)
, P
(1)
, P
(1)
, . . . which will eventually converge to a solution P

i,k
Q
i,k
[22].
Steps of the Ring Algorithm are summarized in the text box above. A graphical representation of this
algorithm is shown in Figure 38.5.
Example 38.3 Consider a simple four-sensor network similar to the one shown in Figure 38.5. Assume
that the down-sampling ratio in each Mote is equal to four. Thus, N
0
= N
1
= N
2
= N
3
= 4. Assume,
further, that the transfer functions H
0
(z ) to H
3
(z ) which relate the Motes front-end output v
i
(n ) to the
original source signal x (n ) are given as follows:
H
0
(z ) =
0.0753 + 0.1656z
1
+ 0.2053z
2
+ 0.1659z
3
+ 0.0751z
4
1.0000 0.8877z
1
+ 0.6738z
2
0.1206z
3
+ 0.0225z
4
H
1
(z ) =
0.4652 0.1254z
1
0.3151z
2
+ 0.0975z
3
0.0259z
4
1.0000 0.6855z
1
+ 0.3297z
2
0.0309z
3
+ 0.0032z
4
H
2
(z ) =
0.3732 0.8648z
1
+ 0.7139z
2
0.1856z
3
0.0015z
4
1.0000 0.5800z
1
+ 0.5292z
2
0.0163z
3
+ 0.0107z
4
H
3
(z ) =
0.1931 0.4226z
1
+ 0.3668z
2
0.0974z
3
0.0405z
4
1.0000 +0.2814z
1
+0.3739z
2
+0.0345z
3
0.0196z
4
The above transfer functions were chosen to show typical low-pass, band-pass, and high-pass char-
acteristics (Figure 38.6). They were obtained using standard lter design techniques. The input signal
whose power spectrum is to be estimated was chosen to have a smooth low-pass spectrum. We used the
Ring Algorithm with L = 4 and the Euclidean metric D
1
as the distance function to estimate the input
signals spectrum. The results are shown in Figure 38.7. As seen in this gure, the algorithm converges to a
solution which is in this case almost identical to the actual input spectrum in less than 100 rounds.
38.5.2 The Star Algorithm
The Ring Algorithm is completely decentralized. However, it will not converge to a solution if the feasible
sets Q
i,k
do not have an intersection (which can happen owing to measurement noise) or one or more
Speech
source
x(n)

Feasible sets Q
i,k
P
(m)
e
( jv)
P
(m)
e
( jv)
Input P
(m)
e
( jv)
P
(0)
e
( jv)
P
(m)
e
( jv)
P
(m)
e
( jv)
Output P
(m)
e
( jv)
FIGURE 38.5 Graphical depiction of the Ring Algorithm. For illustrative reasons, only three feasible sets Q
i,k
are shown in the inside picture. Also, it is shown that the output spectrum P
(m)
(e
j
) is obtained from the input
P
(m)
(e
j
) only after three projections. In practice, each sensor node has L feasible sets and has to repeat the
sequence of projections many times before it can successfully project the input P
(m)
(e
j
) into the intersection of its
feasible sets.
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Radian frequency (v)
|
H
i
(
v
)
|
FIGURE 38.6 Frequency response amplitude of the transfer functions used in Example 38.3. The curves show, from
left to right, |H
0
(e
j
)|, |H
1
(e
j
)|, |H
2
(e
j
)|, and |H
3
(e
j
)|.
sensors in the network are faulty. The Star Algorithm is an alternative distributed algorithm for fusing
individual sensors data. It combines successive projections onto Q
i,k
with a kind of averaging operation to
generate a sequence of solutions P
(m)
. This sequence will eventually converge to a solution P

i,k
Q
i,k
if
one exists. The Star Algorithm is fully parallel and hence much faster than the Ring Algorithm. It provides
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
Radian frequency (v) Radian frequency (v)
P
o
w
e
r

s
p
e
c
t
r
u
m

P
m
(
v
)
P
o
w
e
r

s
p
e
c
t
r
u
m

P
m
(
v
)
P
o
w
e
r

s
p
e
c
t
r
u
m

P
m
(
v
)
P
o
w
e
r

s
p
e
c
t
r
u
m

P
m
(
v
)

P
o
w
e
r

s
p
e
c
t
r
u
m

P
m
(
v
)

P
o
w
e
r

s
p
e
c
t
r
u
m

P
m
(
v
)
m=0
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
m=1
(a) (b)
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
m=4
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
m=10
(c) (d)
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
m=20
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
m=100
(e) (f)
FIGURE 38.7 Ring Algorithm convergence results. In each gure, the dashed curve shows the source signals actual
power spectrum while the solid curve is the estimate obtained by the Ring Algorithm after m rounds. Around means
projections have been passed through all the nodes in the network.
some degree of robustness to individual nodes failure as well. However, it includes a centralized step
which needs to be accommodated for when the systems network protocol is being designed. Steps of the
Star Algorithm are summarized in the text box further. A graphical representation of this algorithm is
Speech
source
x(n)
Feasible sets Q
i,k
Input
P
(m)
e
( jv)
Feasible sets Q
i,k
Feasible sets Q
i,k
...
P
(m+1)
P
i
(m)
FIGURE 38.8 The Star Algorithm. Again, only three feasible sets Q
i,k
are shown in the inside picture. In practice,
each sensor node has to repeat the sequence of projections and averaging many times before it can successfully
project the input P
(m)
(e
j
) supplied by the central node into the intersection of its feasible sets. The projection
result, which is called P
(m)
i
(e
j
) is sent back to the central node. The central node then averages all the P
(m)
i
(e
j
) it
has received and averages them to produce P
(m+1)
(e
j
). This is sent back to the individual nodes and the process
repeats.
The Star Algorithm
Input: Adistance function D
j
(P
1
, P
2
), an initial power spectrumP
0
(e
j
), the squared sensor frequency
responses G
i
(e
j
), and the autocorrelation estimates R
v
i
(k).
Output: A power spectrumP
(e
j
).
Procedure:
1. Let m = 0 and P
(0)
= P
0
.
2. Send P
(m)
to all sensor nodes.
At the ith sensor:
(i) Let n = 0 and dene

P
(n)
= P
(m)
.
(ii) Calculate

P
k
= P
[
P
(n)
Q
i,k
;D
j
]
for all k.
(iii) Calculate

P
(n+1)
= arg min
P
k
D(P,

P
k
).
(iv) If D(
p
(n+1)
,

P
(n)
) > go to item (ii) and repeat. Otherwise, dene P
(m)
i
=

P
(n+1)
and
send it to the central unit.
3. Receive P
(m)
i
from all sensor and calculate P
(m+1)
= arg min
P
i
D(P, P
(m)
i
).
4. If D(P
(m+1)
, P
(m)
) > , go to step 2 and repeat. Otherwise stop and output P
= p
(m+1)
.
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
P
o
w
e
r

s
p
e
c
t
r
u
m

P

m
(
v
)
0 0.5 1 1.5 2 2.5 3 3.5
0
0.2
0.4
0.6
0.8
1
1.2
1.4
0 0.5 1 1.5 2 2.5 3 3.5
0 0.5 1 1.5 2 2.5 3 3.5
0 0.5 1 1.5 2 2.5 3 3.5
0 0.5 1 1.5 2 2.5 3 3.5
P
o
w
e
r

s
p
e
c
t
r
u
m

P

m
(
v
)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
P
o
w
e
r

s
p
e
c
t
r
u
m

P

m
(
v
)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
P
o
w
e
r

s
p
e
c
t
r
u
m

P

m
(
v
)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
P
o
w
e
r

s
p
e
c
t
r
u
m

P

m
(
v
)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
P
o
w
e
r

s
p
e
c
t
r
u
m

P

m
(
v
)
m=0 m=1
m=4 m=10
m=20 m=100
(a) (b)
(c) (d)
(e) (f)
FIGURE 38.9 Star Algorithm results.
Example 38.4 Consider a simple ve-sensor network similar to the one shown in Figure 38.8. Assume
that the down-sampling ratio in each Mote is equal to four. Thus, again, N
0
= N
1
= N
2
= N
3
= 4.
Assume, further, that the transfer functions H
0
(z) to H
3
(z) which relate the Motes front-end output
v
i
(n) to the original source signal x(n) are the same as those introduced in Example 38.3. We simu-
lated the Star Algorithm with L = 4 and the Euclidean metric D
1
as the distance function to estimate
the input signals spectrum. The results are shown in Figure 38.9. Like the Ring Algorithm, the Star
Algorithm also converges to a solution which is almost identical to the actual input spectrum in less than
100rounds.
38.6 Concluding Remark
In this chapter we considered the problem of fusing the statistical information gained by a distributed
network of sensors. We provided a rigorous mathematical model for this problem where the solution
is obtained by nding a point in the intersection of nitely many closed convex sets. We investigated
distributed optimization algorithms to solve the problem without exchanging the raw observed data
among the sensors.
The information fusion theory presented in this chapter is by no means complete. Many issues regarding
both the performance and implementation of the two algorithms we introduced need to be investigated.
Other algorithms for solving the problem of nding the solution in the intersection of the feasible sets are
possible as well. We hope that our results point out the way toward more complete theories and help to
give shape to the emerging eld of sensor processing for sensor networks.
MATLAB codes implementing the algorithms mentioned in this chapter are maintained online at
www.multirate.org.
Acknowledgments
The authors would like to thank Mr. Mayukh Roy for his help in drawing some of the gures. They are
also very grateful to the Editor, Dr. Richard Zurawski, for his patience and cooperation during the long
process of writing this chapter.
References
[1] H. Jeffreys, Theory of Probability, 3rd ed., Oxford University Press, London, 1967.
[2] R. von Mises, Mathematical Theory of Probability and Statistics, Academic Press, New York, 1964.
[3] S.M. Kay, Modern Spectrum Estimation: Theory and Applications, Prentice Hall, Upper Saddle
River, NJ, 1988.
[4] D.B. Percival and A.T. Walden, Statistical Digital Signal Processing and Modeling, Cambridge
University Press, London, 1993.
[5] M.H. Hayes, Statistical Signal Processing and Modeling, John Wiley and Sons, New York, 1996.
[6] B. Buttkus, Spectral Analysis and Filter Theory in Applied Geophysics, Springer-Verlag, Berlin,
2000.
[7] O.S. Jahromi, B.A. Francis, and R.H. Kwong, Spectrum estimation using multirate observations,
IEEE Transactions on Signal Processing, 52(7), 18781890, July 2004. Preprint available from
www.multirate.org (to appear).
[8] A.N. Tikhonov and V.Y. Arsenin, Solutions of Ill-Posed Problems, V.H. Winston & Sons,
Washington, DC, 1977.
[9] V.V. Vasin and A.L. Ageev, Ill-Posed Problems with A Priori Information, VSP, Utrecht,
The Netherlands, 1995.
[10] H.W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems, Kluwer Academic
Publishers, Dordrecht, The Netherlands, 1996.
[11] A.N. Tikhonov, A.S. Leonov, and A.G. Yagola, Nonlinear Ill-Posed Problems, Chapman & Hall,
London, 1998, 2 Vols.
[12] K. Chadan, D. Colton, L. Pivrinta, and W. Rundell, An Introduction to Inverse Scattering and
Inverse Spectral Problems, SIAM, Philadelphia, 1997.
[13] V. Vapnik, Statistical Learning Theory, John Wiley & Sons, New York, 1999.
[14] F. Jones, Lebesgue Integration on Euclidean Space, Jones and Bartlett Publishers, Boston, MA, 1993.
[15] J. Hadamard, Lectures on Cauchys Problemin Linear Partial Differential Equations, Yale University
Press, New Haven, CT, 1923.
[16] A.N. Tikhonov, Onsolving ill-posedproblems andthe methodof regularization, Doklady Akademii
Nauk SSSR, 151, 501504, 1963 (in Russian), English translation in Soviet Math. Dokl.
[17] A.N. Tikhonov, On the regularization of ill-posed problems, Doklady Akademii Nauk SSSR, 153,
4952, 1963 (in Russian), English translation in Soviet Math. Dokl.
[18] D.L. Phillips, A technique for numerical solution of certain integral equations of the rst kind,
Journal of the Association for Computing Machinery, 9, 8497, 1962.
[19] V.K. Ivanov, Integral equations of the rst kindandthe approximate solutionof aninverse potential
problem, Doklady Akademii Nauk SSSR, 142, 9971000, 1962 (in Russian), English translation in
Soviet Math. Dokl.
[20] V.K. Ivanov, On linear ill-posed problems, Doklady Akademii Nauk SSSR, 145, 270272, 1962
(in Russian), English translation in Soviet Math. Dokl.
[21] V.V. Vasin, Relationship of several variational methods for approximate solutions of ill-posed
problems, Mathematical Notes, 7, 161166, 1970.
[22] Y. Censor and S.A. Zenios, Parallel Optimization: Theory, Algorithms, and Applications, Oxford
University Press, Oxford, 1997.
[23] H.H. Bauschke andJ.M. Borwein, Onprojectionalgorithms for solving convex feasibility problems,
SIAMReview, 38, 367426, 1996.
[24] J.M. Borwein and A.S. Lewis, Partially-nite programming in L
1
and the existence of maximum
entropy estimates, SIAMJournal of Optimization, 3, 248267, 1993.
[25] M. Klaus and R.T. Smith, A Hilbert space approach to maximum entropy reqularization,
Mathematical Methods in Applied Sciences, 10, 397406, 1988.
[26] U. Amato and W. Hughes, Maximum entropy regularization of Fredholm integral equations of
the rst kind, Inverse Problems, 7, 793808, 1991.
[27] J.M. Borwein and A.S. Lewis, Convergence of best maximum entropy estimates, SIAMJournal of
Optimization, 1, 191205, 1991.
[28] P.P.B. Eggermont, Maximum entropy regularization for Fredholm integral equations of the rst
kind, SIAMJournal of Mathematical Analysis, 24, 15571576, 1993.
[29] M. Teboulle and I. Vajda, Convergence of best -entropy estimates, IEEE Transactions on
Information Theory, 39(1), 297301, 1993.
[30] A.S. Leonev, A generalization of the maximal entropy method for solving ill-posed problems,
Siberian Mathematical Journal, 41, 716724, 2000.
[31] N. Wu, The MaximumEntropy Method, Springer-Verlag, Berlin, 1997.
39
Sensor Network
Security
Guenter Schaefer
Fachgebiet Telematik/Rechnernetze
Technische Universitaet Ilmenau
Berlin
39.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-2
39.2 DoS and Routing Security. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-4
39.3 Energy Efcient Condentiality and Integrity . . . . . . . . . 39-7
39.4 Authenticated Broadcast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-11
39.5 Alternative Approaches to Key Management . . . . . . . . . . 39-13
39.6 Secure Data Aggregation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-19
39.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39-22
This chapter gives an introduction to the specic security challenges in wireless sensor networks and some
of the approaches to overcome them that have been proposed so far. As this area of research is very active
at the time of writing, it is to be expected that more approaches are going to be proposed as the eld gets
more mature, so this chapter should be understood as a snapshot rather than a denitive account of the
eld.
When thinking of wireless sensor network security, one major question that comes to mind is: what are
the differences between security in sensor networks and general network security? Well, in both cases
one usually aims to ensure certain security objectives (also called security goals). In general, the following
objectives are pursuit: authenticity of communicating entities and messages (data integrity), condentiality,
controlled access, availability of communication services, and nonrepudiation of communication acts [1].
And basically, these are the same objectives that need to be ensured also in wireless sensor networks (with
maybe the exception of nonrepudiation which is of less interest at the level on which sensor networks
operate). Also, in both cases cryptographic algorithms and protocols [2] are the main tool to be deployed
for ensuring these objectives. So, from a high level point of view, one could come to the conclusion that
sensor network security does not add much to what we already know from network security in general,
and thus the same methods could be applied in sensor networks as in classical xed or wireless networks.
However, closer consideration reveals various differences that have their origins in specic characterist-
ics of wireless sensor networks, so that straightforwardapplicationof knowntechniques is not appropriate.
In this chapter we, therefore, rst point out these characteristics and give an overviewof the specic threats
and security challenges in sensor networks. The remaining sections of the chapter then deal in more detail
with the identied challenges, that are: Denial of Service (DoS) and routing security, energy efcient cond-
entiality and integrity, authenticated broadcast, alternative approaches to key management, and secure data
aggregation.
39-1
39.1 Introduction and Motivation
The main characteristics of wireless sensor networks can be explained by summarizing that they are
envisaged to be:
Formed by tens to thousands of small, inexpensive sensors that communicate over a wireless
interface.
Connected via base stations to traditional networks/hosts running applications interested in the
sensor data.
Using multi-hop communications among sensors in order to bridge the distance between sensors
and base stations.
Considerably resource constrained owing to limited energy availability.
To get an impression of the processing capabilities of a wireless sensor node, one should have the following
example of a sensor node in mind: a node running a 8-bit CPU with 4 MHz clock frequency, 4 KB free
of 8 KB ash read-only memory, 512 bytes SRAM main memory, a 19.2 Kbit/sec radio interface and the
node being powered by battery.
Typical applications envisaged for wireless sensor networks are environment monitoring (earthquake or
re detection, etc.), home monitoring and convenience applications, site surveillance (intruder detection),
logistics and inventory applications (tagging and locating goods, containers, etc.), as well as military
applications (battleground reconnaissance, troop coordination, etc.). The fundamental communication
pattern to be used in such a network consists of an application demanding some named information in a
specic geographical area. Upon this request, one or more base stations broadcast the request, and wireless
sensors relay the request and generate answers to it if they contribute to the requested information. The
answers are then processed and aggregated as they ow through the network toward the base station(s).
Figure 39.1 shows an examplary sensor network topology as currently designated for such applications.
The sensor network itself consists of one or more base stations that may be able to communicate among
each other by some high-bandwidth link (e.g., IEEE 802.11). The base stations furthermore communicate
with sensor nodes over a low-bandwidth link. As not all sensor nodes can communicate directly with the
base station, multi-hop communication is used in the sensor network to relay queries or commands sent
Int ernet
...
Cl assi cal Inf rast ruct ure Sensor Net work
Sensor node Low-power radio link
Base station High-bandwidth radio link
...
Classical infrastructure Sensor network
FIGURE 39.1 A general sensor network topology example.
Sensor Network Security 39-3
by the base station to all sensors, as well as to send back the answers from sensor nodes to the base station.
If multiple sensors contribute to one query, partial results may be aggregated as they ow toward the base
station. In order to communicate results or report events to an application residing outside the sensor
network, one or more base stations may be connected to a classical infrastructure network.
As the above description already points out, there are signicant differences between wireless sensor and
so-called ad hoc networks, to which they are often compared. Both types of networks can be differentiated
more specically by considering the following characteristics [3]:
Sensor networks show distinctive application specic characteristics, for example, depending on
its application, a sensor network might be very sparse or dense.
The interaction of the network with its environment may cause rather bursty trafc patterns.
Consider, for example, a sensor network deployed for detecting/predicting earthquakes or re
detection. Most of the time, there will be little trafc, but if an incident happens the trafc load
will increase heavily.
The scale of sensor networks is expected to vary between tens to thousands of sensors.
Energy is even more scarce than in ad hoc networks as sensors will be either battery powered or
powered by environmental phenomena (e.g., vibration).
Self-congurability will be an important feature of sensor networks. While this requirement also
exists for ad hoc networks, its importance is even higher in sensor networks, as, for example, human
interaction during conguration might be prohibitive, the geographic position of sensor nodes has
to be learnt, etc.
Regarding dependability and Quality-of-Service (QoS), classical QoS notions such as throughput,
jitter, etc., are of little interest in sensor networks, as the main requirement in such networks is the
plain delivery of requested information, and most envisaged applications only pose low-bandwidth
requirements.
As sensor networks follow a data centric model, sensor identities are of little interest, and new
addressing schemes, for example, based on semantics or geography, are more appealing.
The requiredsimplicity of sensor nodes interms of operating system, networking software, memory
footprint, etc., is much more constraining than in ad hoc networks.
So far, we have mainly described sensor networks according to their intrinsic characteristics, and regarding
their security, we have only stated that principally the same security objectives need to be met as in other
types of networks. This leads to the question, what makes security in sensor networks a genuine area of
network security research?
To give a short answer, there are three main reasons for this. First, sensor nodes are deployed under
particularly harsh conditions from a security point of view, as there will often be a high number of
nodes distributed in a (potentially hostile) geographical area, so that it has to be assumed that at least
some nodes may get captured and compromised by an attacker. Second, the severe resource constraints
of sensor nodes in terms of computation time, memory, and energy consumption demand for very
optimized implementation of security services, and also lead to a very unfair power balance between
potential attacker (e.g., equipped with a notebook) and defender (cheap sensor node). Third, the specic
property of sensor networks to aggregate (partial) answers to a request as the information ows from the
sensors toward the base station calls for new approaches for ensuring the authenticity of sensor query
results, as established end-to-end security approaches are not appropriate for this.
Consequently, the following security objectives prove to be challenging in wireless sensor networks:
Avoiding and coping with sensor node compromise. This includes measures to partiallyhidethe location
of sensor nodes at least on the network layer, so that an attacker should ideally not be able to use network
layer information in order to locate specic sensor nodes. Furthermore, sensor nodes should as far as
possible be protected from compromise through tamper-proong measures, where this is economically
feasible. Finally, as node compromise cannot be ultimately prevented, other sensor network security
mechanisms should degrade gracefully in case of single node compromises.
Maintaining availability of sensor network services. This requires a certain level of robustness against
so-called DoS attacks, protection of sensor nodes frommalicious energy draining and ensuring the correct
functioning of message routing.
Ensuring condentiality and integrity of data. Data retrieved from sensor networks should be protected
from eavesdropping and malicious manipulation. In order to attain these goals in sensor networks, both
efcient cryptographic algorithms and protocols as well as an appropriate key management are required,
and furthermore the specic communication pattern of sensor networks (including data aggregation) has
to be taken into account.
In the following sections, we will discuss these challenges in more detail and present rst approaches that
have been proposed to meet them.
39.2 DoS and Routing Security
Denial of Service attacks aim at denying or degrading a legitimate users access to a service or network
resource, or at bringing down the servers offering such services themselves.
Froma high level point of view, DoS attacks can be classied into the two categories resource destruction
and resource allocation. In a more detailed examination the following DoS attacking techniques can be
identied:
1. Disabling services by:
Breaking into systems (hacking)
Making use of implementation weaknesses as buffer overrun, etc
Deviation from proper protocol execution
2. Resource depletion by causing:
Expensive computations
Storage of state information
Resource reservations (e.g., bandwidth)
High trafc load (requires high overall bandwidth from attacker)
Generally speaking, these attacking techniques can be applied to protocol processing functions at different
layers of the protocol architecture of communication systems. While some of the attacking techniques can
be defended by a combination of established means of good system management, software engineering,
monitoring and intrusion detection, the attacking techniques protocol deviation, and resource depletion
require dedicated analysis for specic communication protocols.
Insensor networks, twoaspects raise specic DoS concerns: rst, breaking intosensor nodes is facilitated
by the fact that it might be relatively easy for an attacker to physically capture and manipulate some of the
sensor nodes distributed in an area, and second, energy is a very scarce resource in sensor nodes, so any
opportunity for an attacker to cause a sensor node to wake up and perform some processing functions is
a potential DoS vulnerability.
In 2002, Wood and Stankovic [4] published an article on DoS threats in sensor networks in which they
mainly concentrated on protocol functions of the rst four Open System Interconnection (OSI) layers.
Table 39.1 gives an overview of their ndings and potential countermeasures proposed.
On the physical layer, jamming of the wireless communication channel represents the principal attack-
ing technique. Spread-spectrum techniques are by nature more resistant against this kind of attack, but
can nevertheless not guarantee the availability of physical layer services. In case the bandwidth available
in an area is reduced by a DoS attack, giving priority to more important messages could help to maintain
at least basic operations of a sensor network. While jamming mainly disturbs the availability of sensor
nodes to communicate, it has second DoS relevant side effect: as a consequence of worse channel con-
ditions, sensor nodes need more energy to exchange messages. Depending on protocol implementation,
this could even lead to energy exhaustion of some nodes, if they tirelessly tried to send their messages
instead of waiting for better channel conditions. Therefore, from a DoS avoidance point of view, lower
TABLE 39.1 DoS-Threats in Wireless Sensor Networks [4]
Network layer Attacks Countermeasures
Physical Tampering Tamper-proong, hiding
Jamming Spread-spectrum, priority messages,
lower duty cycle, region mapping,
mode change
Link Collision Error-correcting code
Exhaustion Rate limitation
Unfairness Small frames
Network Neglect and greed Redundancy, probing
Homing Encryption (only partial protection)
Misdirection Egress ltering, authorization, monitoring
Black holes Authorization, monitoring, redundancy
Transport Flooding Client puzzles
Desynchronization Data origin authentication
duty cycles could be a benecial protocol reaction to bad channel conditions. Furthermore, the routing
protocol (see also later) should avoid to direct messages into jammed areas, and ideally, cooperating sensor
nodes located at the edge of a jammed area could collaborate to map jamming reports and reroute trafc
around this area. If sensor nodes posses of multiple modes of communication (e.g., wireless and infrared
communications), changing the mode is also a potential countermeasure. Finally, even if not directly
related to communications, capturing and tampering with sensor nodes can also be classied as a physical
layer threat. Tamper-proong of nodes is one obvious measure to avoid further damage resulting from
misuse of captured sensor nodes. A traditional preventive measure to at least render capturing of nodes
more difcult, is to hide them.
On the link layer, Wood and Stankovic identify (malicious) collisions and unfairness as potential threats
andpropose as classical measures the use of error-correcting codes andsmall frames. While one couldargue
that both threats (and respective countermeasures) are not actually security specic but also known as
conventional problems (and strategies for overcoming them), their deliberate exposure for DoS attacks
could nevertheless lead to temporal unavailability of communicationservices, and ultimately to exhaustion
of sensor nodes. For the latter threat the authors propose rate limitation as a potential countermeasure
(basically the same idea as lower duty cycle mentioned in the physical layer discussion).
Considering the network layer, threats can be further subdivided into forwarding- and routing-related
threats. Regarding forwarding, the main threats are neglect and greed, that is, sensor nodes that might only
be interested in getting their own packets transfered in the network without correctly participating in the
forwarding of other nodes packets. Such behavior could potentially be detected by the use of probing
packets and circumvented by using redundant communication paths. However, both measures increase
the network overhead and thus do not come for free. If packets contain the geographical position of nodes
in cleartext, this could be exploited by an attacker for homing (locating) specic sensor nodes in order to
physically capture and compromise them. As a countermeasure against this threat, Wood and Stankovic
propose encryption of message headers and content between neighboring nodes. Regarding routing-
related threats, deliberate misdirection of trafc could lead to higher trafc load, as a consequence to
higher energy consumption in a sensor network, and potentially also to unreachability of certain network
parts. Potential countermeasures against this threat are egress ltering, that is, checking the direction
in which messages will be routed, authorization verication of routing-related messages, monitoring of
routing- andforwarding-behavior of nodes by neighboring nodes, andredundant routing of messages over
multiple paths that in the ideal case do not share common intermediate nodes. The same countermeasures
can also be applied in order to defend against so-called black hole attacks, in which one node or part of the
network attracts a high amount of trafc (e.g., by announcing short routes to the base station) but does
not forward this trafc.
On the transport layer, the threats ooding with connection requests and desynchronization of sequence
numbers are identied in Reference 4. Both attack techniques are known from classical Internet commu-
nications and might also be potentially applied to sensor networks, in case such networks are going to
make use of transport layer connections. Established countermeasures to defend them are so-called client
puzzles [5] and authentication of communication partners.
Recapitulating the given discussion, it can be seen that especially the network layer exhibits severe DoS
vulnerabilities and proves to be the most interesting layer for potential attackers interested in degrading
the availability of sensor network services. This is mostly owing to the fact that in this layer the essential
forwarding- and routing-functionality is realized, so that an attacker can cause signicant damage with
rather moderate means (e.g., in comparison to jamming a large area). In the following, we will, therefore,
further elaborate on this layer and at the same time extend our discussion in general threats on forwarding-
and routing-functions including attacks beyond pure DoS interests.
In Reference 6, Karlof and Wagner give an overview on attacks and countermeasures regarding secure
routing in wireless sensor networks. From a high level point of view they identify the following threats:
Insertion of spoofed, altered, or replayed routing information with the aim of loop construction,
attracting, or repelling trafc, etc.
Forging of acknowledgments which may trick other nodes to believe that a link or node is either
dead or alive when in fact it is not.
Selective forwarding which may be realized eitherin pathorbeneath pathby deliberate jamming,
and which allows to control what information is forwarded and what information is suppressed.
Creation of so-called sinkholes, that is attracting trafc to a specic node, for example, to prepare
selective forwarding.
Simulating multiple identities (Sybil attacks) which allows to reduce effectiveness of fault-tolerant
schemes like multi-path routing.
Creation of so-called wormholes by tunneling messages over alternative low-latency links, for
example, to confuse the routing protocol, create sinkholes, etc.
Sending of so-called hello oods (more precisely: hello shouting), in which an attacker sends or
replays a routing protocols hello packets with more energy in order to trick other nodes into the
belief that they are neighbors of the sender of the received messages.
In order to give an example for such attacks, Figure 39.2 [7] illustrates the construction of a breadth-
rst spanning tree, and Figure 39.3 [6] shows the effect of two attacks on routing schemes that use the
breadth-rst search tree idea to construct their forwarding tables.
One example of a sensor network operating system that builds a breadth-rst spanning tree rooted
at the base station is TinyOS. In such networks, an attacker disposing of one or two laptops can either
send out forged routing information or launch a wormhole attack. As can be seen in Figure 39.3, both
attacks lead to entirely different routing trees and can be used to prepare further attacks such as selective
forwarding, etc.
In order to defend against the abovementioned threats, Karlof and Wagner discuss various methods.
Regarding forging of routing information or acknowledgments, data origin authentication and conden-
tiality of link layer PDUs (Protocol Data Units) can serve as an effective countermeasure. While the rst
naive approach of using a single group key for this purpose exhibits the rather obvious vulnerability that
a single node compromise would result in complete failure of the security, a better, still straightforward
approach is to let each node share a secret key with a base station and to have base stations act as trusted
third parties in key negotiation (e.g., using the OtwayRees protocol [8]).
Combined with an appropriate key management, the abovementioned link layer security measures
could also limit the threat potential of the attack of simulating multiple identities: by reducing the
number of neighbors a node is allowed to have, for example, through enforcement during key distribution,
authentic sensor nodes could be protected fromaccepting too many neighborhood relations. Additionally,
by keeping track of authentic identities and associated keys, the ability of potentially compromised nodes
to simulate multiple identities could be restricted. However, the latter idea requires some kind of global
BS
BS BS
BS
(a) (b)
(c) (d)
FIGURE 39.2 Breadth-rst search: (a) BS sends beacon, (b) rst answers to beacon, (c) answers to rst answers, and
(d) resulting routing tree.
Example routing tree Forging routing updates Wormhole attack
FIGURE 39.3 Attacks on breadth-rst search routing.
knowledge, that often can only be realized efciently by a centralized scheme which actively involves a
base station in the key distribution protocol.
When it comes to hello shouting and wormhole/sinkhole attacks, however, pure link layer security
measures cannot provide sufcient protection, as they cannot completely protect against replay attacks.
Links should, therefore, be checked in both directions before making routing decisions in order to defend
against simple hello shouting attacks. Detection of wormholes actually proves to be difcult and rst
approach to this problem requires rather tight clock synchronization [9]. Sinkholes might be avoided by
deploying routing schemes like geographical routing, that do not rely on constructing forwarding tables
according to distance measured in hops to destination. Selective forwarding attacks might be countered
with multi-path routing. However, this requires redundancy in the network and results in higher network
overhead.
39.3 Energy Efcient Condentiality and Integrity
The preceding discussion of potential countermeasures against DoS attacks and general attacks on routing
in wireless sensor networks has shown that the security services, condentiality and integrity, prove to be
valuable mechanisms against various attacks. Obviously, they are also effective measures to protect applic-
ation data (e.g., commands and sensor readings) against unauthorized eavesdropping and manipulation,
respectively. In this section, we will therefore examine their efcient implementation in resource restricted
sensor networks.
In their paper, SPINS: Security Protocols for Sensor Networks, Perrig et al. [10] discuss the requirements
and propose a set of protocols for realizing efcient security services for sensor networks. The main chal-
lenges in the design of such protocols arise out of tight implementation constraints in terms of instruction
set, memory, CPU speed, a very small energy budget in low-powered devices, and the fact that some
nodes might get compromised. These constraints opt out some well established alternatives: asymmetric
cryptography [1113] is generally considered to be too expensive as it results in high computational cost
and long ciphertexts and signatures (sending and receiving is very expensive!). Especially, public key
management based on certicates exceeds the sensor nodes energy budget, and key revocation is almost
impossible to realize under the restricted conditions in sensor networks. Even symmetric cryptography
implementation turns out to be nonstraightforward owing to architectural limitations and energy con-
straints. Furthermore, the key management for authenticating broadcast-like communications calls for
new approaches, as simple distribution of one symmetric group key among all receivers would not allow
to cope with compromised sensor nodes.
Perrig et al. therefore propose two main security protocols:
The Sensor Network Encryption Protocol (SNEP) for realizing efcient end-to-end security between
nodes and base stations.
A variant of the Timed Efcient Stream Loss-Tolerant Authentication Protocol (TESLA), called
TESLA, for authenticating broadcast communications, that will be further discussed in
Section 39.4.
The main goal in the development of SNEP was the efcient realization of end-to-end security services for
two-party communication. SNEP provides the security services data condentiality, data origin authentic-
ation, and replay protection. The considered communication patterns are node to base station (e.g., sensor
readings) and base station to individual nodes (e.g., specic requests). Securing messages from a base
station to all nodes (e.g., routing beacons, queries, reprogramming of the entire network) is the task of
the TESLA protocol to be discussed in Section 39.4. The main design decisions in the development of
SNEP were to avoid the use of asymmetric cryptography, to construct all cryptographic primitives out
of a single block cipher, and to exploit common state in order to reduce communication overhead where
this is possible.
SNEPs basic trust model assumes that two communicating entities A and B share a common master
key K
A,B
. Initially, the base station shares a master key with all nodes and node-to-node master keys can
be negotiated with the help of the base station (see later). From such a master key two condentiality keys
CK
A,B
, CK
B,A
(one per direction), two integrity keys IK
A,B
, IK
B,A
, and a random seed RK
A,B
are derived
according to the following equations:
CK
A,B
= F
X
A,B
(39.1)
CK
B,A
= F
X
A,B
(39.2)
IK
A,B
= F
X
A,B
(39.3)
IK
B,A
= F
X
A,B
(39.4)
RK
A,B
= F
X
A,B
(39.5)
The principal cryptographic primitive of SNEP is the RC5 algorithm [14]. Three parameters of this
algorithm can be congured: the word length w[bit], the number of rounds r, and the key size b[byte],
and the resulting instantiation of the algorithm is denoted as RC5-w/r/b. What makes RC5 specically
suitable for implementation in sensor nodes is the fact that it can be programmed with a few lines of code
and that the main algorithm only makes use of three simple and efcient instructions: twos complement
addition + of words (mod 2w ), Bit-wise XOR of words, and cyclic rotation <<<. Figure 39.4 illustrates
// Algorithm: RC5-Encryption
// Input: A, B = plaintext stored in two words
// S[0, 2r+1] = an array filled by a key setup
// procedure
// Ouput: A, B = ciphertext stored in two words
A := A + S[0];
B := B + S[1];
for i := 1 to r
A := ( (A B) <<< B) + s[2i];
B := ( (B A) <<< A) + s[2i + 1];
FIGURE 39.4 The RC5 encryption algorithm.
TABLE 39.2 Plaintext Requirements for Differential Attacks on RC5
Number of rounds 4 6 8 10 12 14 16
Differential attack 2
7
2
16
2
28
2
36
2
44
2
52
2
61
(chosen plaintext)
Differential attack 2
36
2
41
2
47
2
51
2
55
2
59
2
63
(known plaintext)
the encryption function. The corresponding decryption function can be easily obtained by basically
reading the code in reverse. Prior to en- or decryption with RC5, an array s [0, 2r + 1] has to be lled by
a key preparation routine that is a little bit more tricky, but also uses only simple instructions.
Regarding the security of the RC5 algorithm, Kaliski and Yin [15] report in 1998 that the best known
attacks against RC5 with a blocklength of 64-bit have plaintext requirements as listed in Table 39.2.
According to the information given in Reference 10 (RAM requirements, etc.), Perrig et al. seem to
plan for RC5 with 8 rounds and 32-bit words (leading to a blocklength of 64-bit), so that a differential
cryptanalysis attack would require about 2
28
chosen plaintexts and about 2
47
known plaintexts and CPU
effort in the same order of magnitude. Taking into account progress in PC technology, this should be
considered on the edge of being secure (if an attacker can collect that many plaintexts). Nevertheless, by
increasing the number of rounds the required effort could be raised to 2
61
or 2
63
, respectively. Even higher
security requirements can by principle only be ensured by using a block cipher with a larger block size.
In SNEP, encryption of messages is performed by using the RC5 algorithm in an operational mode
called counter mode, that XORs the plaintext with a pseudo-random bit sequence which is generated by
encrypting increasing counter values (see also Figure 39.5). The encryption of message Msg with key K
and counter value Counter is denoted as: {M }
K,Counter
For computing Message Authentication Codes (MACs), SNEP uses the well established Cipher Block
Chaining Message Authentication Code (CBC-MAC) construction. This mode encrypts each plaintext
block P
1
, . . . , P
n
with an integrity key IK , XORing the ciphertext of the last encryption result C
i 1
with
the plaintext block P
i
prior to the encryption step. The result of the last encryption step is then taken as
the message authentication code (see also Figure 39.6).
Depending on whether encryption of message data is required or not, SNEP offers two message
formats:
1. The rst format appends an RC5CBCMAC computed with the integrity key IK
A,B
over the
message data:
A B : Msg | RC5 CBC(IK
A,B
, Msg)
RC5
P
2
Counter +1
CK
A,B
C
2
RC5
P
1
Counter
CK
A,B
C
1
RC5
P
3
Counter +2
CK
A,B
C
3
FIGURE 39.5 Encryption in counter mode.
...
RC5
encrypt
C
1
IK
P
1
P
2
RC5
encrypt
C
2
IK
+
P
n
RC5
encrypt
C
n
IK
+
C
n-1
MAC (64 bits)
FIGURE 39.6 Computing a MAC in cipher block chaining mode.
2. The second format encrypts the message and appends a MAC in whose computation the counter
value is also included:
A B : {Msg}
CK
A,B
,Counter
|RC5CBC(IK
A,B
, Counter, {Msg}
CK
A,B
,Counter
)
Please note that the counter value itself is not transmitted in the message, so that common state
between sender and receiver is exploited in order to save transmission energy and bandwidth.
Furthermore, random numbers are generated by encrypting a (different) counter, and the RC5CBC
construction is also used for key derivation, as the key deriving function mentioned above is realized as:
F
X
A,B
(n) := RC5CBC(X
A,B
, n)
In order to be able to successfully decrypt a message, the receivers decryption counter needs to be
synchronized with the senders encryption counter. An initial counter synchronization can be achieved
by the following protocol, in which both entities A and B communicate their individual counter value
for encryption C
A
and C
B
to the other party, and authenticate both values by exchanging two MACs
computed with their integrity keys IK
A,B
and IK
B,A
, respectively:
A B: C
A
B A: C
B
| RC5CBC(IK
B,A
, C
A
, C
B
)
A B: RC5CBC(IK
A,B
, C
A
, C
B
)
In case of a message loss, counters get out of synchronization. By trying out a couple of different counter
values, a few message losses can be tolerated. However, as this consumes energy, after trying out a couple
of succeeding values, an explicit resynchronization dialog is initiated by the receiver A of a message. The
dialog consists of sending a freshly generated random number N
A
to B, who answers with his current
counter C
B
and a MAC computed with his integrity key over both the random number and the counter
value:
A B: N
A
B A: C
B
| RC5CBC(IK
B,A
, N
A
, C
B
)
As encrypted messages are only accepted by a receiver if the counter value used in their MACcomputation
is higher thanthe last accepted value, the implementationof the condentiality service inSNEPto a certain
degree also provides replay protection. If for a specic request Req an even tighter time synchronization is
needed, the request can also contain a freshly generated random number N
A
that will be included in the
computation of the MAC of the answer message containing the response Rsp:
A B: N
A
, Req
B A: {Rsp}
CK
B,A
,C
B
|RC5CBC(IK
B,A
, N
A
, C
B
, {Rsp}
CK
B,A
,C
B
)
In order to establish a shared secret SK
A,B
between two sensor nodes A and B with the help of base station
BS, SNEP provides the following protocol:
A B: N
A
| A
B BS: N
A
| N
B
| A | B |RC5CBC(IK
B,BS
| N
A
| N
B
| A | B)
BS A: {SK
A,B
}
K
BS,A
|RC5CBC(IK
BS,A
| N
A
| B | {SK
A,B
}
K
BS,A
)
BS B: {SK
A,B
}
K
BS,B
|RC5CBC(IK
BS,B
| N
B
| A | {SK
A,B
}
K
BS,B
)
In this protocol, A rst sends a random number N
A
and his name to B, who in turn sends both values
together with his own randomnumber N
B
and name B to the base station. The base station then generates
a session key SK
A,B
and sends it to both sensor nodes in two separate messages, which are encrypted with
the respective key the base station shares with each node. The random numbers N
A
and N
B
allow both
sensor nodes to verify the freshness of the returned message and the key contained in it.
Regarding the security properties of this protocol, however, it has to be remarked that ina strict sense the
protocol as formulated in Reference 10 does neither allow A nor B to performconcurrent key negotiations
with multiple entities, as in such a case they would not be able to securely relate the answers to the correct
protocol run (please note that the name of the peer entity is not transmitted in the returned message
but only included in the MAC computation). Furthermore, neither A nor B knows, if the other party
received the key and trusts in its suitability, which is commonly regarded as an important objective of a
key management protocol [16]. Finally, the base station cannot deduce anything about the freshness of
messages and can therefore not differentiate between fresh and replayed requests for a session key.
39.4 Authenticated Broadcast
Authenticated broadcast is required if one message needs to be sent to all (or many) nodes in a sensor
network, and the sensor nodes have to be able to verify the authenticity of the message. Examples for
this communication pattern are authenticated query messages, routing beacon messages, or commands
to reprogram an entire network. As it has to be ensured, that recipients of such a message should not be
able to make use of their verifying key for forging authenticated messages, an asymmetric mechanism has
to be deployed. Classical asymmetric cryptography, however, is considered to be too expensive in terms of
computation, storage, and communication requirements for sensor nodes.
One basic idea for obtaining asymmetry while at the same time deploying a symmetrical cryptographic
algorithm is to send a message that has been authenticated with a key K
i
and to disclose this key at a later
point in time, so that the authenticity of the message can be veried. Of course, from the moment in
which the key disclosure message has been sent, a potential attacker could use this key to create MACs for
forged messages. Therefore, it is important that all receivers have at least loosely synchronized clocks and
only use a key K
i
to verify messages that have been received before the key disclosure message was sent.
However, it must also be ensured that a potential attacker cannot succeed in tricking genuine nodes into
accepting bogus authentication keys generated by himself. One elegant way to achieve this is the inverse
use of a chain of hash codes for obtaining integrity keys, basically a variation of the so-called one-time
password idea [17].
The TESLA protocol uses a reversed chain of hash values to authenticate broadcast data streams [18].
The TESLAprotocol proposed to be used in sensor networks is a minor variation of the TESLAprotocol,
with the basic difference being the cryptographic scheme used to authenticate the initial key. While TESLA
uses asymmetric cryptography for this, TESLA deploys the SNEP protocol, so that the base station
calculates for each sensor node one individual MAC that authenticates the initial key K
0
. Furthermore,
while TESLA discloses the key in every packet, TESLA discloses the key only once per time interval in
order to reduce protocol overhead, and only base stations authenticate broadcast packets because sensor
nodes are not capable of storing entire key chains.
In order to setup a sender, rst the length n of the key chain to be computed is chosen and the last key
of the key chain K
n
is randomly generated. Second, the entire hash key chain is computed according to
the equation K
n1
:= H(K
n
), stored at the sender, and the key K
0
is communicated and authenticated
to all participating sensor nodes. For this, each sensor node A sends a random number N
A
to the base
station and the base station answers with a message containing its current time, the currently disclosed
key K
i
(in the initial case: i = 0), the time period T
i
in which K
i
was valid for authenticating messages, the
interval length T
Int
, the number of intervals the base station waits before disclosing a key, and a MAC
computed with the integrity key K
BS,A
over these values:
A BS: N
A
| A
BS A: T
BS
| K
i
| T
i
| T
Int
| |RC5CBC(IK
BS,A
| N
A
| T
BS
| K
i
| T
i
| T
Int
| )
After this preparatory phase, broadcasting authenticated packets is then realized as follows:
Time is divided in uniform length intervals T
i
and all sensor nodes are loosely synchronized to
the clock of the base station.
In time interval T
i
, the sender authenticates packets with key K
i
.
The key K
i
is disclosed in time interval i + (e.g., = 2).
Figure 39.7 illustrates this reverse use of the chain of hash values for authenticating packets. In order to
check the authenticity of a received packet, a sensor node rst has to store the packet together with T
i
and
wait until the respective key has been disclosed by the base station. Upon disclosure of the appropriate
key K
i
the authenticity of the packet can be checked.
Of course, it is crucial to discard all packets that have been authenticated with an already disclosed key
for this scheme to be secure.This requires at least a loose time synchronization with an appropriate value
Use key
Interval
Packet
Disclose
K
1
K
2
K
3
K
4
K
5
K
6
K
7
K
8
T
1
T
2
T
3
T
4
T
5
T
6
T
7
T
8
H H H H H H H
t
P
1
MAC
2
P
2
MAC
3
P
3
MAC
5
P
4
MAC
5
P
5
MAC
7
K
1
K
2
K
3
K
4
K
5
K
6
K
7
FIGURE 39.7 An example of TESLA operation.
of that needs to be selected in accordance with the maximum clock drift. However, as nodes cannot
store many packets, key disclosure cannot be postponed for a long time so that the maximum clock drift
should not be too big.
If a sensor node should need to send a broadcast packet, it would send a SNEP protected packet to the
base station, which in turn would then send an authenticated broadcast packet. The main reason for this is
that sensor nodes do not have enough memory for storing key chains and can, therefore, not authenticate
broadcast packets on their own.
39.5 Alternative Approaches to Key Management
Key management is often said to be the hardest part of implementing secure communications, as on one
hand legitimate entities need to hold or be able to agree on the required keys, and on the other hand,
a suite of security protocols cannot offer any protection if the keys fall in the hands of an attacker. The
SNEP protocol suite as described in Section 39.3 includes a simple and rather traditional key management
protocol, that enables two sensor nodes to obtain a shared secret key with the help of a base station. In this
section, we will treat the subject of key management in more depth and reviewalternative approaches to it.
Key management comprises of the following tasks [1]:
Key generation is the creation of the keys that are used. This process must be executed in a random
or at least pseudo-random-controlled way, because hackers will otherwise be able to execute the
process themselves and in a relatively short time, will discover the key that was used for security.
Pseudo-random-controlled key generation means that keys are created according to a deterministic
approach but each possible key has the same probability of being created fromthe method. Pseudo-
random generators must be initialized with a real random value so that they do not always produce
the same keys. If the process of key generation is not reproducible, it is referred to as really random
key generation.
The task of key distribution consists of deploying generated keys in the place in a system where they
are needed. In simple scenarios the keys can be distributed through direct (e.g., personal) contact.
If larger distances are involved and symmetric encryption algorithms are used, the communication
channel again has to be protected through encryption. Therefore, a key is needed for distributing
keys. This necessity supports the introduction of what is called key hierarchies.
When keys are stored, measures are needed to make sure that they cannot be read by unauthorized
users. One way to address this requirement is to ensure that the key is regenerated from an easy
to remember but sufciently long password (usually an entire sentence) before each use, and
therefore is only stored in the memory of the respective user. Another possibility for storage is
manipulation-safe crypto-modules that are available on the market in the form of processor chip
cards at a reasonable price.
Key recovery is the reconstruction of keys that have been lost. The simplest approach is to keep
a copy of all keys in a secure place. However, this creates a possible security problem because an
absolute guarantee is needed that the copies of the keys will not be tampered with. The alternative
is to distribute the storage of the copies to different locations, which minimizes the risk of fraud-
ulent use so long as there is an assurance that all parts of the copies are required to reconstruct
the keys.
Key invalidation is an important task of key management, particularly with asymmetric crypto-
graphic methods. If a private key is known, then the corresponding public key needs to be identied
as invalid. In sensor networks, key invalidation is expected to be a quite likely operation, as sensor
nodes may be relatively easy to capture and compromise.
The destruction of no longer required keys is aimed at ensuring that messages ciphered with them
also cannot be decrypted by unauthorized persons in the future. It is important to make sure
that all copies of the keys have really been destroyed. In modern operating systems this is not a
trivial task since storage content is regularly transferred to hard disk through automatic storage
management and the deletion in memory gives no assurance that copies of the keys no longer
exist. In the case of magnetic disk storage devices and so-called EEPROMs (Electrically Erasable
Programmable Read-Only Memory), these have to be overwritten or destroyed more than once to
guarantee that the keys stored on them can no longer be read, even with sophisticated technical
schemes.
From the listed tasks, most key management protocols address the task of key distribution and sometimes
also concern key generation. Approaches to distributing keys in traditional networks, however, do not
work well in wireless sensor networks. Methods based on asymmetric cryptography require very resource
intensive computations and are, therefore, often judged as not being appropriate for sensor networks.
Arbitrated key management based on predetermined keys, such as the key management protocol of SNEP,
on the other hand, assume predetermined keys at least between the base station and sensor nodes. This
requires predistribution of these keys before deployment of the sensor network and also has some security
implications in case of node compromise.
There are a couple of particular requirements to key management schemes for sensor networks resulting
from their specic characteristics [19]:
Vulnerability of nodes to physical capture and node compromise. Sensor nodes may be deployed in difcult
situation to protect/hostile environments and can therefore fall into the hands of an attacker. Because of
tight cost constraints, nodes will often not be tamper-proof, so that cryptographic keys might be captured
by an attacker. This leads to the requirement, that compromise of some nodes and keys should not
compromise the overall networks security (graceful degradation).
Lack of a priori knowledge of deployment conguration. In some applications, sensor networks will be
installed via random scattering (e.g., from an airplane), so that neighborhood relations are not known
a priori. Even with manual installation, preconguration of sensor nodes would be expensive in large net-
works. This leads tothe requirement that sensor networks key management shouldsupport forautomatic
conguration after installation.
Resource restrictions. As mentioned earlier, nodes of a sensor network only possess limited memory
and computing resources, as well as very limited bandwidth and transmission power. This puts tight
constraints on the design of key management procedures.
In-network processing. Over-reliance on a base station as source of trust may result in inefcient com-
munication patterns (cf. data aggregation in Section 39.6). Also, it turns base stations into attractive
targets (which they are in any case!). Therefore, centralistic approaches like the key management protocol
of SNEP should be avoided.
Need for later addition of sensor nodes. Compromise, energy exhaustion or limited material/calibration
lifetime may make it necessary to add new sensors to an existing network. However, legitimate nodes that
have been added to sensor network should be able to establish secure relationships with existing nodes.
Erasure of master keys after initial installation (cf. the LEAP approach described later) does not allow this.
In the following, we will describe two new alternatives to traditional key management approaches that
have been proposed for sensor networks: the neighborhood-based initial key exchange protocol Localized
Encryption and Authentication Protocol (LEAP), and the approach of probabilistic key distribution.
The LEAP [20] enables automatic and efcient establishment of security relationships in an initial-
ization phase after installation of the nodes. It supports key establishment for various trust relationships
between:
Base station and sensor with so-called individual keys
Sensors that are direct neighbors with pairwise shared keys
Sensors that form a cluster with cluster keys
All sensors of a network with a group key
In order to establish individual keys prior to deployment, every sensor node u is preloaded with an
individual key K
m
u
known only to the node and the base station. The base station s generates these keys
from a master key K
m
s
and the node identity u according to the equation K
m
u
:= f (K
m
s
, u). Generating
all node keys from one master key is supposed to save memory at the base station, as the individual keys
need not be stored at the base station but can be generated on-the-y when they are needed.
In scenarios in which pairwise shared keys cannot be preloaded into sensor nodes because of installation
by random scattering but neighboring relationships remain static after installation, LEAP provides for a
simple key establishment procedure for neighboring nodes. For this, it is assumed that there is a minimum
time interval T
min
during which a node can resist against attacks. After being scattered in the eld, sensor
nodes establish neighboring relations during this time interval based on an initial group key K
I
that has
been precongured into all sensor nodes before deployment. First, every node u computes its master key
K
u
= f (K
I
, u). Then, every node discovers its neighbors by sending a message with its identity u and a
random number r
u
and collecting the answers:
u : u | r
u
v u: v | MAC(K
v
, r
u
| v)
As u can also compute K
v
, it can directly check this MAC and both nodes compute the common shared
secret K
u,v
:= f (K
v
, u). After expiration of the timer T
min
, all nodes erase the initial group key K
I
and all
computed master keys so that only the pairwise shared keys are kept. This scheme can be augmented with
all nodes forwarding also the identities of their neighbors, enabling a node to compute pairwise shared
keys with nodes that are one hop away.
In order to establish a cluster key with all its immediate neighbors, a node randomly generates a cluster
key K
c
u
and sends it individually encrypted to all neighbors v
1
, v
2
, . . . :
u v
i
: E(K
u,v
i
, K
c
u
)
All nodes v
i
decrypt this message with their pairwise shared key K
u,v
i
and store the obtained cluster key.
When a node is revoked, a new cluster key is distributed to all remaining nodes.
If a node u wants to establish a pairwise shared key with a node c that is multiple hops away, it can
do so by using other nodes it knows as proxies. In order to detect suitable proxy nodes v
i
, u broadcasts a
query message with its own node id and that of c. Nodes v
i
knowing both nodes u and c will answer to
this message:
u : u | c
v
i
u: v
i
Assuming that node u has received m answers, it then generates m shares sk
1
, . . . , sk
m
of the secret key
K
u,c
to be established with c and sends them individually over the appropriate nodes v
i
:
u v
i
: E(K
u,v
i
, sk
i
) | f (sk
i
, 0)
v
i
c: E(K
v
i
,c
, sk
i
) | f (sk
i
, 0)
The value f (sk
i
, 0) allows the nodes v
i
and c to verify if the creator of such a message actually knew the
key share sk
i
, as otherwise it would not have been able to compute this value (the function f needs to
be a one-way function for this to be secure). After receiving all values sk
i
, node c computes K
u,c
:=
sk
1
sk
m
.
In order to establish a new group key K
g
, the base station s randomly generates a new key and sends it
encrypted with its own cluster key to its neighbors:
s v
i
: E(K
c
s
, K
g
)
All nodes receiving such a message forward the new group key encrypted with their own cluster key to
their neighbors.
Node revocation is performed by the base station and uses TESLA. All nodes, therefore, have to be
preloaded with an authentic initial key K
0
, and loose time synchronization is needed in the sensor network.
In order to revoke a node u, the base station s broadcasts the following message in time interval T
i
using
the TESLA key K
i
valid for that interval:
s : u | f (K
g
, 0) | MAC(K
i
, u | f (K
g
, 0))
The value f (K
g
, 0) later on allows all nodes to verify the authenticity of a newly distributed group key K
g
.
This revocation becomes valid after disclosure of the TESLA key K
i
.
A couple of remarks to some security aspects of LEAP have to be mentioned at this point:
As every node u knowing K
I
may compute the master key K
v
of every other node v, there is little
additional security to be expected from distinguishing between these different master keys. Especially,
all nodes need to hold K
I
during the discovery phase in order to be able to compute the master keys
of answering nodes. The authors of Reference 20 give no reasoning as to why they think that this differ-
entiation of master keys should attain any additional security. As any MAC construction that deserves its
name should not leak information about K
I
in a message authentication code MAC(K
I
, r
u
| v), it is hard
to see any benet in this (is it crypto snake oil?).
The synchronization of the time interval for pairwise key negotiation is critical. However, the authors of
Reference 20 give no hint on how the nodes should know when this time interval starts, or if there should
be a signal and if there should, what to do if a node misses this signal or sleeps during the interval? It is
clear that if any node is compromised before erasure of K
I
the approach fails to provide protection against
disclosure of pairwise shared keys.
It does not become clear, what is the purpose of the random value (nonce) in the pairwise shared
key establishment dialog. Pairwise shared keys are only established during T
min
, and most probably, all
neighbors will answer to the rst message anyway (including the same nonce from this message). This
random value is not even included in the computation of K
u,v
, and so the only thing that can be defended
against it, is an attacker that sends replayed replies during T
min
, but these would not result in additional
storage of keys K
u,v
or anything else than having to parse and discard these replays.
The cluster key establishment protocol does not allow a node to check the authenticity of the received
key, as every attacker could send some binary data that is decrypted tosomething. This would overwrite
an existing cluster key K
c
u
with garbage, leading to a DoS vulnerability. By appending a MAC this could
be avoided. However, an additional replay protection would be required in this case in order to avoid
overwriting with old keys.
Furthermore, after expiration of the initial time interval T
min
, it is no longer possible to establish pairwise
shared keys among neighbors, so that the LEAP approach does not support later addition/exchange of
sensor nodes.
In 2002, Eschenauer and Gligor [21] proposed a probabilistic key management scheme that is based on
the simple observation that, on one hand, sharing one key K
G
among all sensors leads to weak security,
and on the other hand, sharing individual keys K
i,j
among all nodes i, j requires too many keys in large
sensor networks (n
2
n keys for n nodes). The basic idea of probabilistic key management is to randomly
give each node a so-called key ring containing a relatively small number of keys from a large key pool,
and to let neighboring nodes discover the keys they share with each other. By properly adjusting the size
of the key pool and the key rings, a sufcient degree of shared key connectivity for a given network size
can be attained.
The basic scheme published in Reference 21 consists of three phases:
Key predistribution
Shared key discovery
Path key establishment
The key predistribution consists of ve steps that are processed ofine. First, a large key pool P with
about 2
17
to 2
20
keys and accompanying key identiers is generated. Then, for each sensor k keys are
randomly selected out of P without replacement, in order to establish the sensors key ring. Every sensor
is loaded with its key ring comprising the selected keys and their identiers. Furthermore, all sensor
identiers and the key identiers of their key ring are loaded into a controller node. Finally, a shared
key for secure communication with each sensor s is loaded into the controller node ci, according to the
following rule: if K
1
, . . . , K
k
denote the keys on the key ring of sensor s, the shared key K
ci,s
is computed
as K
ci,s
:= E(K
1
K
k
, ci).
The main purpose of the key predistribution is to enable any two sensor nodes to identify a common
key with a certain probability. This probability, that two key rings KR1, KR2 share at least one common
key, can be computed as follows:
Pr(KR1 & KR2 share at least one key) = 1 Pr(KR1 & KR2 share no key)
The number of possible key rings is:
P
k
=
P!
k!(P k)!
The number of possible key rings after k keys have been drawn from the key pool without replacement is:
P k
k
=
(P k)!
k!(P 2k)!
Thus, the probability that no key is shared is the ratio of the number of key rings without a match divided
by the total number of key rings. Concluding the probability of at least one common key is:
Pr(at least one common key) = 1
k!(P k)!(P k)!
P! k!(P 2k)!
After being installed, all sensor nodes start discovering their neighbors within the wireless communication
range, and any two nodes, wishing to nd out if they share a key, simply exchange lists of key ids on their
key ring. Alternatively, each node s could broadcast a list:
s : | E(K
1
, ) | | E(K
k
, )
A node receiving such a list would then have to try all its keys, in order to nd out matching keys (with
a high probability). This would hide from an attacker which node holds which key ids but requires more
computational overhead from each sensor node. The shared key discovery establishes a (random graph)
topology in which links exist between nodes that share at least one key. It might happen that one key is
used by more than one pair of nodes.
In the path key establishment phase, path keys are assigned to pairs of nodes (s
1
, s
n
) that do not share
a key but are connected by two or more links so that there is a sequence of nodes which share keys and
connect s
1
to s
n
. The article [21], however, does not contain any clear information on how path keys
are computed or distributed. It only states that they do not need to be generated by the sensor nodes.
Furthermore, it is mentioned, that the designof the DSNensures that, after the shared key discovery phase
is nished, a number of keys on any ring are left unassigned to any link. However, it does not become
clear from Reference 21 how two nodes can make use of these unused keys for establishing a path key.
If a node is detected to be compromised, all keys on its ring need to be revoked. For this, the controller
node generates a signature key K
e
and sends it individually to every sensor node si, encrypted with the
key K
ci,si
:
ci si: E(K
ci,si
, K
e
)
Afterwards, it broadcasts a signed list of all identiers of keys that have to be revoked:
ci : id
1
| id
2
| | id
k
| MAC (K
e
, id
1
| id
2
| | id
k
)
Every node receiving this list has to delete all listed keys from his key ring. This removes all links to the
compromised node plus some more links from the random graph. Every node that had to remove some
of its links tries to reestablish as much as possible of them by starting a shared key discovery and a path
key establishment phase.
Chan et al. [19] proposed a modication to the basic random pre-distribution scheme described so
far by requiring to combine multiple shared keys. In this variant, two nodes are required to share at least
q keys on their rings, in order to establish a link. So, if K
1
, . . . , K
q
are the common keys of nodes u and v
(with q
q), then the link key is computed as follows:

K
u,v
:= h (K
1
, . . . , K
q
)
On one hand, it becomes harder with this scheme for an attacker to make use of one or multiple key ring(s)
obtained by node compromise, and this increase in difculty is exponential in q. On the other hand, the
size of the key pool |P | has to be decreased in order to have a high enough probability that two nodes
share enough keys on their rings in order to establish a link. This gives an attacker a higher percentage of
compromised keys per compromised nodes. In Reference 19, a formula is derived which determines how
to compute the key pool size so that any two nodes share enough keys with a given probability >p. This
scheme is called the q-composite scheme.
In the same paper, Chan et al. proposed a second scheme, called multi-path key reinforcement. The basic
idea of this scheme is to strengthen an already established key by combining it with random numbers
that are exchanged over alternative secure links. After the discovery phase of the basic scheme has been
completed and enough routing information can be exchanged so that a node u knows all (or at least
enough) disjoint paths p
1
, . . . , p
j
to a node v, node u generates j random values v
1
, . . . , v
j
and sends each
value along another path to node v. After having received all j values, node v computes the new link key:
K
u,v
= K
u,v
v
1
v
j
Clearly, more the paths used, the harder it gets for an attacker to eavesdrop on all of them. However,
the probability for an attacker to be able to eavesdrop on a path increases with the length of the path,
so that utilizing more but longer paths does not necessarily increase the overall security to be attained
by the scheme. In Reference 19, the special case of 2-hop multi-path key reinforcement is analyzed
probabilistically. Furthermore, the paper also describes a third approach called random pairwise key
scheme, that hands out keys to pairs of nodes which also store the identity of the respective peer node
holding the same key. The motivation behind this approach is to allow for node-to-node authentication
(see Reference 19 for details).
Concerning security, the following remarks on probabilistic key management should be noted. The
nice property of having a rather high probability that any two given nodes share at least one key (e.g.,
p = 0.5, if 75 keys out of 10,000 keys are given to every node), also plays in the hands of an attacker
who compromises a node, and an attacker that has compromised more than one node has an even
higher probability of holding at least one key with any given node. This problem also exists with the
q-composite scheme, as the key pool size is reduced in order to ensure a high enough probability that any
two nodes share at least q keys. This especially concerns the attackers ability to perform active attacks, as
eavesdropping attacks are less probable because the probability that the attacker holds exactly the key that
two other nodes are using is rather small (and even a lot smaller in the q-composite scheme). Furthermore,
keys of compromised nodes are supposed to be revoked, but as to how to detect compromised nodes is
FIGURE 39.8 Aggregating data in a sensor network.
still an open question, how to know in a sensor network which nodes and keys should be revoked? Finally,
the presented probabilistic schemes do not support node-to-node authentication (with the exception of
the random pairwise key scheme).
39.6 Secure Data Aggregation
As already mentioned in the introduction, data from different sensors is supposed to be aggregated on its
way toward the base station (see also Figure 39.8). This raises the question, how to ensure authenticity and
integrity of aggregated data? If every sensor would add a MAC to its answer in order to ensure data origin
authentication, all (answer, MAC)-tuples would have to be sent to the base station in order to enable
checking of the authenticity. This shows that individual MACs are not suitable for data aggregation.
However, if only the aggregating node added one MAC, a subverted node could send arbitrary data
regardless of the data sent by sensors.
At GlobeCom03, Du et al. [22] proposed a scheme that allows a base station to check the integrity
of an aggregated value based on endorsements provided by so-called witness nodes. The basic idea of
this scheme is that multiple nodes perform data aggregation and compute a MAC over their result. This
requires individual keys between each node and the base station. In order to allow for aggregated sending
of data, some nodes act as so-called data fusion nodes, aggregating sensor data and sending it toward the
base station. As a data fusion node could be a subverted or malicious node, his result needs to be endorsed
by witness nodes. For this, neighboring nodes receiving the same sensor readings compute their own
aggregated result, compute a MAC over this result and send it to the data fusion node. The data fusion
node computes a MACover his own result and sends it together with all received MACs to the base station.
Figure 39.9 illustrates this approach.
In more detail, the scheme described in Reference 22 is as follows:
1. The sensor nodes S
1
, S
2
, . . . , S
n
collect data from their environment and make binary decisions
b
1
, b
2
, . . . , b
n
(e.g., re detected) based on some detection rules.
2. Every sensor node sends his/her decision to the data fusion node F which computes an aggregated
decision SF.
3. Neighboring witness nodes w
1
, w
2
, . . . , w
m
also receive the sensor readings and compute their own
fusion results s
1
, s
2
, . . . , s
m
. Every w
i
computes a message authentication code MAC
i
with key k
i
it
shares with the base station, MAC
i
:= h(s
i
, w
i
, k
i
), and sends it to the base station.
S
1
S
2
S
n
...
Witness 1 Witness 1 Data fusion node
Phenomenon
b
1
b
2
b
n
MAC
1
MAC
2
x
1 x
2
x
n
u
0
Base station
FIGURE 39.9 Overview of the witness based approach [22].
4. Concerning the verication at the base station, Du et al. proposed two variants. The rst one is an
m + 1 out of m + 1 voting scheme and works as follows:
The data fusion node F computes his message authentication code:
MAC
F
:= h(SF, F, k
F
, MAC
1
MAC
2
MAC
m
)
F sends to base station: (SF, F, w
1
, . . . , w
m
, MAC
F
).
The base station computes all MAC
i
= h(SF, w
i
, k
i
) and the authentication code to be expected
from F:
MAC
F
:= h(SF, F, k
F
, MAC
1
MAC
2
MAC
m
)
The base station then checks if MAC
F
= MAC
F
and otherwise discards the message.
If the set (w
1
, . . . , w
m
) remains unchanged, the identiers of the w
i
needonly tobe transmittedwith
the rst MAC
F
in order to save transmission bandwidth. There is, however, one major drawback
with this scheme: if one witness deliberately sends a wrong MAC
i
, the aggregated data gets refused
by the base station (representing a DoS vulnerability).
5. In order to overcome the DoS vulnerability of the rst scheme, Du et al. [22] also proposed an n
out of m + 1 voting scheme:
F sends to the base station: (SF, F, MAC
F
, w
1
, MAC
1
, . . . , w
m
, MAC
m
).
The base station checks if at least n out of m + 1 MACs match, that is at least n 1 MAC
i
match MAC
F
.
This scheme is more robust against erroneous or malicious witness nodes, but requires a higher
communication overhead as m MACs must be sent to the base station.
Du et al. [22] analyzed the minimumlength of the MACs in order to ensure a certain tolerance probability
2
that an invalid result is accepted by a base station. For this, they assume that each MAC has the length
k, there are m witnesses, no witness colludes with F and F needs to guess the endorsements MAC
i
for at
least n 1 witnesses. As the probability of correctly guessing one MAC
i
is p = 1/2
k
, the authors compute
the chance of correctly guessing at least n 1 values to:
P
S
=
m
i=n1
m
i
p
i
(1 p)
mi
After some computation they yield:
m(k/2 1)
From this, Du et al. conclude that it is sufcient if mk 2( +m), and give an example how to apply this.
If = 10 so that the probability of accepting an invalid result is 1/1024, and there are m = 4 witnesses,
k should be chosen so that k 7. This observation is supposed to enable economizing transmission
effort.
In case a data fusion node is corrupted, Du et al. propose to obtain a result as follows: if the verication
at the base station fails, the base station is supposed to poll witness stations as data fusion nodes, and to
continue trying until the n out of m + 1 scheme described above succeeds. Furthermore, the expected
number of polling messages T(m + 1, n) to be transmitted before the base station receives a valid result
is computed.
Regarding the security of the proposed scheme, however, it has to be considered if an attacker actually
needs to guess MACs in order to send an invalid result? As all messages are transmitted in clear, an
eavesdropper E could easily obtain valid message authentication codes MAC
i
= h(s
i
, w
i
, k
i
). If E later on
wants to act as a bogus data fusion node sending an (at this time) incorrect result s
i
he can replay MAC
i
to
support this value. As Reference 22 assumes a binary decision result, an attacker only needs to eavesdrop
until he has received enough MAC
i
supporting either value of s
i
. Thus, the scheme completely fails to
provide adequate protection against attackers forging witness endorsements.
The main reason for this vulnerability is the missing verication of the actuality of a MAC
i
at the base
station. One could imagine as a quick x letting the base station regularly send out random numbers r
B
that have to be included in the MAC computations. In such a scheme, every r
B
should only be accepted
for one result, requiring the generation and exchange of large random numbers. A potential alternative
could make use of timestamps, which would require synchronized clocks.
However, there are more open issues with this scheme. For example, it is not clear what should happen
if some witness nodes cannot receive enough readings? Also, it is not clear why the MAC
i
are not sent
directly from the witness nodes to the base station? This would at least allow for a direct n out of m + 1
voting scheme avoiding the polling procedure described earlier in case of a compromised data fusion node.
Furthermore, the sufx mode MAC construction h(message, key) selected by the authors is considered to
be vulnerable [2, note 9.65].
Afurther issue is, howto defend against an attacker ooding the network withforged MAC
i
(forged
meaning arbitrary garbage that looks like a MAC)? This wouldallowanattacker tolauncha DoSattackas an
honest fusionnode couldnot knowwhichvalues tochoose. One morehotxfor this couldbe using a local
MAC among neighbors to authenticate the MAC
i
. Nevertheless, this would imply further requirements
(e.g., shared keys among neighbors, replay protection), and the improved scheme nevertheless would
not appear to be mature enough to rely on it.
Some more general conclusions that can be drawn fromthis are that rst, optimization (e.g., economiz-
ing on MAC size, message length) can be considered as being one of the attackers best friends, and second
in security, we often learn (more) from failures. Nevertheless, the article of Du et al. allows to discuss the
need and the difculties of constructing a secure data aggregation scheme, that does not consume too
many resources and is efcient enough to be deployed in sensor networks. As such it can be considered as
a valuable contribution despite its security deciencies.
39.7 Summary
Wireless sensor networks are an upcoming technology with a wide range of promising applications. As
in other networks, however, security is crucial for any serious application. Prevalent security object-
ives in wireless sensor networks are condentiality and integrity of data, as well as availability of sensor
network services being threatened by DoS attacks, attacks on routing, etc. Severe resource constraints
in terms of memory, time, and energy, and an unfair power balance between attackers and sensor
nodes makes attaining these security objectives particularly challenging. Approaches proposed for wire-
less ad hoc networks which are based on asymmetric cryptography are generally considered to be too
resource consuming. This chapter has reviewed basic considerations on protection against DoS and
attacks on routing, and given an overview of rst approaches proposed so far. For ensuring cond-
entiality and integrity of data the SNEP and TESLA protocols were discussed, and considering key
management the LEAP protocol and probabilistic key management has been reviewed. At present there
are only few works on how to design security functions suitable for the specic communication pat-
terns in sensor networks (especially with respect to data aggregation). The witness based approach
described in Reference 22 with its aws reveals the difculties in designing an appropriate protocol
for this.
References
[1] Schfer, G. Security in Fixed and Wireless Networks. John Wiley & Sons, New York, 2003
[2] Menezes, A., van Oorschot, P., and Vanstone, S. Handbook of Applied Cryptography. CRC Press
LLC, Boca Raton, FL, 1997.
[3] Karl, H. and Willig, A. A Short Survey of Wireless Sensor Networks. TKN Technical report series,
TKN-03-018, Technical University, Berlin, Germany, 2003.
[4] Wood, A. and Stankovic, J. Denial of Service in Sensor Networks. IEEE Computer, 35, 5462, 2002.
[5] Tuomas Aura, Pekka, N., and Leiwo, Jussipekka. DOS-Resistant Authentication with Client Puzzles.
In Proceedings of the Security Protocols Workshop 2000, Vol. 2001 of Lecture Notes in Computer
Science. Springer, Cambridge, UK, April 2000.
[6] Karlof, C. and Wagner, D. Secure Routing in Wireless Sensor Networks: Attacks and Countermeas-
ures. AdHoc Networks Journal, 1, 293315, 2003.
[7] Wood, A. Security in Sensor Networks. Sensor Networks Seminar, University of Virginia,
USA, 2001.
[8] Otway, D. and Rees, O. Efcient and Timely Mutual Authentication. ACM Operating Systems
Review, 21(1), 810, 1987.
[9] Hu, Y., Perrig, A., and Johnson, D. Wormhole Detection in Wireless Ad Hoc Networks. Technical
report TR01-384, Rice University, USA, June 2002.
[10] Perrig, A., Szewcyk, R., Tygar, J., Wen, V., and Culler, D. SPINS: Security Protocols for Sensor
Networks. Wireless Networks, 8, 521534, 2002.
[11] Dife, W. and Hellman, M.E. New Directions in Cryptography. Transactions of IEEE Information
Theory, IT-22, 644654, 1976.
[12] Rivest, R.L., Shamir, A., and Adleman, L.A. A Method for Obtaining Digital Signatures and Public
Key Cryptosystems. Communications of the ACM, 21(2), 120126, 1978.
[13] ElGamal, T. A Public Key Cryptosystem and a Signature Scheme Based on Discrete Logarithms.
IEEE Transactions on Information Theory, 31, 469472, 1985
[14] Baldwin, R. and Rivest, R. The RC5, RC5CBC, RC5CBCPad, and RC5CTS Algorithms. RFC
2040, IETF, Status: Informational, October 1996. ftp://ftp.internic.net/rfc/rfc2040.txt
[15] Kaliski, B.S. and Yin, Y.L. On the Security of the RC5 Encryption Algorithm. RSA Laboratories
Technical report, TR-602, Version 1.0, 1998.
[16] Gong, L., Needham, R.M., and Yahalom, R. Reasoning About Belief in Cryptographic Protocols.
In Symposium on Research in Security and Privacy. IEEE Computer Society, IEEE Computer Society
Press, Washington, May 1990, pp. 234248.
[17] Haller, N., Metz, C., Nesser, P., and Straw, M. A One-Time Password System. RFC 2289, IETF,
Status: Draft Standard, February 1998. ftp://ftp.internic.net/rfc/rfc2289.txt
[18] Perrig, A. and Tygar, J.D. Secure Broadcast Communication in Wired and Wireless Networks. Kluwer
[19] Chan, H., Perrig, A., and Song, D. Random Key Predistribution Schemes for Sensor Networks.
In Proceedings of the IEEE Symposium on Security and Privacy. Berkeley, California, 2003,
pp. 197213.
[20] Zhu, S., Setia, S., and Jajodia, S. LEAP: Effcient Security Mechanisms for Large-Scale Distributed
Sensor Networks. In Proceedings of the 10th ACM Conference on Computer and Communication
Security. Washington, DC, USA, 2003, pp. 6272.
[21] Eschenauer, L. and Gligor, V.D. A Key Management Scheme for Distributed Sensor Networks.
In Proceedings of the 9th ACM Conference on Computer and Communication Security. Washington,
DC, USA, 2002, pp. 4147.
[22] Du, W., Deng, J., Han, Y., and Varshney, P.A. Witness-Based Approach for Data Fusion Assurance
in Wireless Sensor Networks. In Proceedings of the IEEE 2003 Global Communications Conference
(Globecom2003). San Francisco, CA, USA, 2003, pp. 14351439.
40
Software
Development for
Large-Scale Wireless
Sensor Networks
Jan Blumenthal,
Frank Golatowski,
Marc Haase, and
Matthias Handy
40.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-1
40.2 Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-2
Architectural Layer Model Middleware and Services for
Sensor Networks Programming Aspect versus Behavioral Aspect
40.3 Current Software Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-5
TinyOS MAT TinyDB SensorWare MiLAN
EnviroTrack SeNeTs
40.4 Simulation, Emulation, and Test of Large-Scale
Sensor Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-16
TOSSIM A TinyOS SIMulator EmStar Sensor Network
Applications (SNA) Test and Validation Environment
40.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-25
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40-25
40.1 Introduction
The increasing miniaturization of electronic components and advances in modern communication tech-
nologies enable the development of high-performance spontaneously networked and mobile systems.
Wireless microsensor networks promise novel applications in several domains. Forest re detection,
battleeld surveillance, or telemonitoring of human physiological data are only in the vanguard of plenty
of improvements encouraged by the deployment of microsensor networks. Hundreds or thousands of
collaborating sensor nodes form a microsensor network. Sensor data is collected from the observed area,
locally processed or aggregated, and transmitted to one or more base stations.
Sensor nodes can be spread out in dangerous or remote environments whereby new application elds
can be opened. A sensor node combines the abilities to compute, communicate, and sense. Figure 40.1
shows the structure of a typical sensor node consisting of processing unit, communication module (radio
interface), and sensing and actuator device.
Figure 40.2 shows a scenario taken from the environmental application domain: leakage detection of
dykes. During oods, sandbags are used to reinforce dykes. Piled along hundreds of kilometers around
40-1
Central unit
(processor, memory)
Sensor
Communication
module
Sensor
Actuator
Battery
FIGURE 40.1 Structure of a sensor node.
River
Sandbag with sensor Base station
FIGURE 40.2 Example of sensor network application: leakage detection.
lakes or rivers, sandbag dykes keep waters at bay and bring relief to residents. Sandbags are stacked against
sluice gates and parts of broken dams to block off the tide. To nd out spots of leakage each sandbag
is equipped with a moisture sensor and transmits sensor data to a base station next to the dyke. Thus,
leakages can be detected earlier and reinforcement actions can be coordinated more efciently.
Well-known research activities in the eld of sensor networks are UCLAs WINS [1], Berkeleys Smart
Dust [2], WEBS [3], and PicoRadio [4]. An example of European research activities is the EYES-Project
[5]. Detailed surveys on sensor networks can be found in [6] and [7]. This chapter focuses on innov-
ative architectures and basic concepts of current software development solutions for wireless sensor
networks.
40.2 Preliminaries
Central unit of a sensor node is a low-power microcontroller that controls all functional parts. Software
for such a microcontroller has to be resource aware on one hand. On the other hand, several Quality-of-
Service (QoS) aspects have to be met by sensor node software, such as latency, processing time for data
fusion or compression, or exibility regarding routing algorithms or MAC techniques.
Conventional software development for microcontrollers usually covers hardware abstraction layer
(HAL), operating systemand protocols, and application layer. Often, software for microcontrollers is lim-
ited to an application-specic monolithic software block that is optimized for performance and resource
usage. Abstracting layers, such as HAL or operating system, are often omitted due to resource constraints
and low-power aspects.
Microcontrollers are often developed and programmed for a specic, well-dened task. This limitation
of the application domain leads to high-performance embedded systems even with strict resource con-
straints. Development and programming of such systems is too much effort. Furthermore, an application
developed for one microcontroller is in most cases not portable to any other one, so that it has to be
reimplemented fromscratch. Microcontroller and application forman inseparable unit. If the application
domain of an embedded system changes often, the whole microcontroller is replaced instead of writing
and downloading a new program.
For sensor nodes, application-specic microcontrollers are preferred instead of general-purpose micro-
processors. This is because of the small size and the lowenergy consumptionof those controllers. However,
requirements concerning a sensor node exceed the main characteristics of a conventional microcontroller
Software Development 40-3
and its software. The main reason for this is the dynamic character of a sensor nodes task. Sensor nodes
can adopt different tasks, such as sensor data acquisition, data forwarding, or information processing.
The task assigned to a node with its deployment is not xed until the end of its life-cycle. Depending
on, for instance, location, energy level, or neighborhood of a sensor node, a task change can become
advantageous or even necessary.
Additionally, software for sensor nodes should be reusable. An application running on a certain sensor
node should not be tied to a specic microcontroller but to some extent be portable onto different
platforms to enhance interoperability of sensor nodes with different hardware platforms. Not limited
to software development for wireless sensor networks is the general requirement for a straightforward
programmability and, as a consequence, a short development time.
It is quite hard or even impossible to meet the requirements mentioned above with a monolithic applic-
ation. Hence, at present there is much research effort in the areas of middleware and service architectures
for wireless sensor networks. A middleware for wireless sensor networks should encapsulate required
functionality in a layer between operating system and application. Incorporating a middleware layer has
the advantage that applications get smaller and are not tied to a specic microcontroller. At the same time,
development effort for sensor node applications (SNAs) reduces since a signicant part of the function-
ality moves from application to middleware. Another research domain tends to service architectures for
wireless sensor networks. A service layer is based on mechanisms of a middleware layer and makes its
functionality more usable.
40.2.1 Architectural Layer Model
Like in other networking systems, the architecture of a sensor network can be divided into different
layers (see Figure 40.3). The lower layers are the hardware and HAL. The operating system layer and
protocols are above the hardware-related layers. The operating system provides basic primitives, such
as multithreading, resource management, and resource allocation that are needed by higher layers. Also
access to radio interface and input/output operations to sensing devices are supported by basic operating
system primitives. Usually in node-level operating systems these primitives are rudimentary and there is
no separation between user and kernel mode. On top of the operating system layer reside middleware,
service, and application layer.
In recent years, much work has been done to develop sensor network node devices (e.g., Berkeley
motes [8]), operating systems and algorithms, for example, for location awareness, power reduction,
data aggregation, and routing. Today researchers are working on extended software solutions including
middleware and service issues for sensor networks. The main focus of these activities is to simplify
application development process and to support dynamic programming of sensor networks.
The overall development process of sensor node software usually ends with a manual download of an
executable image over direct wired connections or over-the-air interface to target node.
Applications
Hardware abstraction layer
Operating systems and protocols
Middleware
Services
Hardware
FIGURE 40.3 Layered software model.
After deployment of nodes, it is nearly impossible to improve or adapt new programs to the target
nodes. But this feature is necessary in future wireless sensor networks to adapt the behavior of sensor
network dynamically through new injected programs or capsules, a possibility that exists in MAT [24].
The task assigned to a node with its deployment is not xed until the end of its life-cycle. Depending
on, for instance, location, energy level, or neighborhood of a sensor node, a task change can become
advantageous or even necessary.
40.2.2 Middleware and Services for Sensor Networks
In sensor networks, design and development of solutions for higher-level middleware functionality and
creation of service architectures are an open research issue. Middleware for sensor networks has two
primary goals:
Support of acceptable middleware applicationprogramming interfaces (APIs), which abstracts
and simplies low-level APIs to ease application software development and to increase portability.
Distributed resource management and allocation.
Besides the native network functions, such as routing and packet forwarding, future software architectures
are required to enable location and utilization of services. A service is a program that can be accessed
through standardized functions over a network. Services allowa cascading without previous knowledge of
each other, and thus enable the solution of complex tasks. A typical service used during the initialization
of a node is the localization of a data sink for sensor data. Gateways or neighboring nodes can provide
this service. To nd services, nodes use a service discovery protocol.
40.2.3 Programming Aspect versus Behavioral Aspect
Wireless sensor networks do not have to consist of homogeneous nodes. In reality, a network composed
of several groups of different sensor nodes is imaginable. This fact changes the software development
approach and points out new challenges as they are well known fromthe distributed systems domain. In
an inhomogeneous wireless sensor network, nodes contain different low-level systemAPIs however with
similar functions. From a developers point of view it is hard to create programs, since APIs are mostly
incompatible. To overcome the mentioned problems in heterogeneity and complexity, new software
programming techniques are required. One attempt to accomplish this aspect is the denition of an
additional API or an additional class library on top of each systemAPI. But they are all limited by some
means or other, for example, platform independency, exibility, quantity, programming language. All
approaches to achieve an identical API on different systems are covered by the programming aspect
(Figure 40.4).
Software for sensor networks
Programming aspect Behavioral aspect
Systemwide API
Splitting the complexity of APIs
Hiding heterogeneity of distributed systems
Separation of interface and implementation
Optimization of interfaces
Access to remote resources
without previous knowledge
Adaptation of software to
dynamical changes
Task change
Evolution of network over time
FIGURE 40.4 Two aspects of software for wireless sensor networks.
The programming aspect enables the developer to easily create programs on different hardware and
software platforms. But an identical API on all platforms does not necessarily take the dynamics of the
distributed systeminto account. Ideally, the application does not notice any dynamic systemchanges. This
decoupling is termed as behavioral aspect and covers:
Access to remote resources without previous knowledge, for example, remote procedure calls
(RPCs) and discovered services.
Adaptations withinthe middleware layer todynamic changes inthe behavior of a distributedsystem,
caused by incoming or leaving resources, mobility of nodes, or changes of the environment.
The ability of the network to evolve over time including modications of the systems task, exchange
or adaptation of running software parts, and mobile agents.
40.3 Current Software Solutions
This chapter presents ve important software solutions for sensor networks. It starts with the most mature
development TinyOS and depending software packages. It continues with SensorWare followed by two
promising concepts, MiLAN and EnviroTrack. The section nalizes with an introduction to SeNeTs that
features interface optimization.
40.3.1 TinyOS
TinyOS is a component-based operating system for sensor networks developed at UC Berkeley. TinyOS
can be seen as an advanced software framework [8] that has a large user community due to its open source
character and its promising design. The framework contains numerous prebuilt sensor applications and
algorithms, for example, multihopadhoc routing andsupports different sensor node platforms. Originally
it was developed for Berkeleys Mica Motes. Programmers experienced with the C programming language
can easily develop TinyOS applications written in a proprietary language called NesC [9].
The design of TinyOS is based on the specic sensor network characteristics: small physical size, low-
power consumption, concurrency-intensive operation, multiple ows, limited physical parallelism and
controller hierarchy, diversity in design and usage, and robust operation to facilitate the development
of reliable distributed applications. The main intention of the TinyOS developers was retaining energy,
computational and storage constraints of sensor nodes by managing the hardware capabilities effectively,
while supporting concurrency-intensive operation in a manner that achieves efcient modularity and
robustness [10]. Therefore, TinyOS is optimized in terms of memory usage and energy efciency. It
provides dened interfaces between the components that reside in neighboring layers. A layered model is
40.3.1.1 Elemental Properties
TinyOS utilizes an event model instead of a stack-based threaded approach, which would require more
stack space and multitasking support for context switching, to handle high levels of concurrency in a
very small amount of memory space. Event-based approaches are the favorite solution to achieve high
Hardware abstraction
Acting Sensing Communication
Application (user components)
Main (includes scheduler)
FIGURE 40.5 Software architecture of TinyOS.
performance in concurrency intensive applications. Additionally, the event-based approach uses CPU
resources more efciently and therefore takes care of the most precious resource, the energy.
An event is serviced by an event handler. More complex event handling can be done by a task. The event
handler is responsible for posting the task to the task scheduler. Event and task scheduling is performed
by a two-level scheduling structure. This kind of scheduling provides that events, associated with a small
amount of processing, can be performed immediately, while longer running tasks can be interrupted by
events. Tasks are handled rapidly, however, no blocking or polling is permitted.
The TinyOS system is designed to scale with the technology trends supporting both, smaller designs
and crossover of software components into hardware. The latter provides a straightforward integration of
software components into hardware.
40.3.1.2 TinyOS Design
The architecture of a TinyOS systemconguration is shown in Figure 40.6. It consists of the tiny scheduler
and a graph of components. Components satisfy the demand for modular software architectures. Every
component consists of four interrelated parts: a command handler, an event handler, an encapsulated
xed-size and statically allocated frame, and a bundle of simple tasks. The frame represents the internal
state of the component. Tasks, commands, and handlers execute in the context of the frame and operate
on its state. In addition, the component declares the commands it uses and the events it signals. Through
this declaration, modular component graphs can be composed. The composition process creates layers
of components. Higher-layer components issue commands to lower-level components and these signal
events to higher-level components. To provide an abstract denition of the interaction of two components
via commands and events, the bidirectional interface is introduced in TinyOS.
h
h
h
h
t s
FIGURE 40.6 TinyOS architecture in detail.
Commands are nonblocking requests made to lower-layer components. A command provides feedback
to its caller by returning status information. Typically, the command handler puts the command paramet-
ers into the frame and posts a task into the task queue for execution. The acknowledgment whether the
command was successful, can be signaled by an event. Event handlers are invoked by events of lower-layer
components, or when directly connected to the hardware, by interrupts. Similar to commands, the frame
will be modied and tasks are posted. Both, commands and tasks, perform a small xed amount of work
similar to interrupt service routines. Tasks perform the primary work. They are atomic, run to completion,
and can only be preempted by events. Tasks are queued in a First In First Out (FIFO) task scheduler to
perform an immediate return of event or command handling routines. Due to the FIFO scheduling, tasks
are executed sequentially and should be short. Alternatively to the FIFO task scheduler, priority-based or
deadline-based schedulers can be implemented into the TinyOS framework.
TinyOS distinguishes three categories of components. Hardware abstraction components map physical
hardware into the component model. Mostly, this components export commands to the underlying
hardware and handle hardware interrupts. Synthetic hardware components extend the functionality
of hardware abstraction components by simulating the behavior of advanced hardware functions, for
example, bit-to-byte transformation functions. For future hardware releases, these components can dir-
ectly cast into hardware. High-level software components perform application-specic tasks, for example,
control, routing, data transmission, calculation on data, and data aggregation.
An interesting aspect of the TinyOS framework is the similarity of the component description to the
description of hardware modules in hardware description languages, for example, VHDL or Verilog.
A hardware module, for example, in VHDL is dened by an entity with input and output declarations,
status registers to hold the internal state, and a nite state machine controlling the behavior of the
module. In comparison, a TinyOS component contains commands and events, the frame, and a behavioral
description. These similarities simplify the cast of TinyOS components to hardware modules. Future sensor
node generations can benet from this similarity in describing hardware and software components.
40.3.1.3 TinyOS Application
A TinyOS application consists of one or more components. These components are separated into modules
and congurations. Modules implement application-specic code, whereas congurations wire different
components together. By using a top-level conguration, wired components can be compiled and linked
to form an executable. The interfaces between the components declare a set of commands and events
which provide an abstract description of components. The application developer has to implement the
appropriate handling routine into the component.
Figure 40.7 shows the component graph of a simple TinyOS application, that turns an LED on and
off depending on the clock. The top-level conguration contains the application-specic components
(ClockC, LedsC, BlinkM) and an operating-system-specic component providing the tiny task-scheduler
and initialization functions. The Main component encapsulates the TinyOS specic components fromthe
application. StdControl, Clock, and Leds are the interfaces used in this application. While BlinkMcontains
the application code, ClockCand LedsCare again congurations encapsulating further component graphs
controlling the hardware clock and the LEDs connected to the controller. TinyOS provides a variety of
additional extensions, such as the virtual machine (VM) MAT and the database TinyDB for cooperative
data acquisition.
40.3.2 MAT
MAT [24] is a byte-code interpreter for TinyOS. It is a tiny communication-centric VM designed as a
component for the system architecture of TinyOS. MAT is located in the component graph on top of
several system components, represented by sensor components, network component, timer component,
and nonvolatile storage component.
The developer motivation for MAT was to solve novel problems in sensor network management and
programming, in response to changing tasks, for example, exchange of the data aggregation function
BlinkM
ClockC LedsC
Main
StdControl
Clock Leds
Clock LED
Hardware
TinyOS
component graph
FIGURE 40.7 Simple TinyOS application.
or routing algorithm. However, the associated inevitable reprogramming of hundreds or thousands of
nodes is restricted to energy and storage resources of sensor nodes. Furthermore, the network is limited in
bandwidth and network activity as a large energy draw. MAT attempts to overcome these problems, by
propagating so-called code capsules through the sensor network. The MAT VM provides the possibility
to compose a wide range of sensor network applications by the use of a small set of higher-level primitives.
In MAT, these primitives are one-byte instructions and they are stored into capsules of 24 instructions
together with identifying and versioning information.
40.3.2.1 MAT Architecture
MAT is a stack-based architecture that allows a concise instruction set. The use of instructions hides the
asynchronous character of native TinyOS programming; because instructions are executed successively as
several TinyOS tasks.
The MAT VM shown in Figure 40.8 has three execution contexts: Clock, Send, and Receive, which can
run concurrently at instruction granularity. Clock corresponds to timer events and Receive to message
receive events, signaled from the underlying TinyOS components. Send can only be invoked from the
Clock or Receive context. Each context holds an operand stack for handling data and a return stack
for subroutines calls. Subroutines allow programs to be more complex as a single capsule can provide.
Therefore, MAT has four spaces for subroutine code.
The code for the contexts and the subroutines is installed dynamically at runtime by code capsules.
One capsule ts into the code space of a context or subroutine. The capsule installation process supports
self-forwarding of capsules to reprogram a whole sensor network with new capsules. It is the task of the
sensor network operator to inject code capsules in order to change the behavior of the network.
The program execution in MAT starts with a timer event or a packet receive event. The program
counter jumps to the rst instruction of the corresponding context (Clock or Receive) and executes until
it reaches the Halt instruction. Each context can call subroutines for expanding functionality. The Send
context is invoked from the other contexts to send a message in response to a sensor reading or to route
an incoming message.
Send Receive Clock
Code
O
p
e
r
a
n
d
s
t
a
c
k
R
e
t
u
r
n
s
t
a
c
k
P
C
Code
O
p
e
r
a
n
d
s
t
a
c
k
R
e
t
u
r
n
s
t
a
c
k
P
C
O
p
e
r
a
n
d
s
t
a
c
k
R
e
t
u
r
n
s
t
a
c
k
P
C
C
o
d
e
Sub
3
Sub
2
Sub
1
Sub
0
C
o
d
e
C
o
d
e
C
o
d
e
Shared variable
Code
Network Timer Sensor Logger
TinyOS framework
MAT
Execution contexts Subroutines
Subroutine call
FIGURE 40.8 MAT architecture.
The MAT architecture provides separation of contexts. One context cannot access the state of another
context. There is only one single shared variable among the three contexts that can be accessed by special
instructions. The context separation qualies MAT to fulll the traditional role of an operating system.
Compared to native TinyOS applications, the source code of MAT applications is much shorter.
40.3.3 TinyDB
TinyDB is a query processing systemfor extracting information froma network of TinyOS sensor nodes
[11]. TinyDB provides a simple, SQL-like interface to specify the kind of data to be extracted from the
network along with additional parameters, for example, the data refresh rate. The primary goal of TinyDB
is to prevent the user from writing embedded C programs for sensor nodes or composing capsules of
instructions regarding to MAT. The TinyDB framework allows data-driven applications to be developed
and deployed much more quickly as developing, compiling, and deploying a TinyOS application.
Given a query specifying the data interests, TinyDB collects the data from sensor nodes in the envir-
onment, lters and aggregates the data, and routes it to the user autonomously. The network topology in
TinyDB is a routing tree. Query messages ood down the tree and data messages ow back up the tree
participating in more complex data query processing algorithms.
The TinyDB system is divided into two subsystems: sensor node software and a Java-based client
interface on a PC. The sensor node software is the heart of TinyDB running on each sensor node. It
consists of:
Sensor catalog and schema manager responsible for tracking the set of attributes, or types of
readings and properties available on each sensor.
Query processor, utilizing the catalog to fetch the values of local attributes, to receive sensor readings
from neighboring nodes, to combine and aggregate the values together, to lter, and to output the
values to parents.
Small, handle-based dynamic memory manager.
Network topology manager to deal with the connectivity of nodes and to effectively route data and
query subresults through the network.
The sensor node part of TinyDB is installed on top of TinyOS on each sensor node as an application. The
Java-based client interface is used to access the network of TinyDB nodes from a PC physically connected
to a bridging sensor node. It provides a simple graphical query-builder and a result display. The Java API
simplies writing PC applications that query and extract data from the network.
40.3.4 SensorWare
SensorWare is a software framework for wireless sensor networks that provides querying, dissemination,
and fusion of sensor data as well as coordination of actuators [12]. A SensorWare platform has less stringent
resource restrictions. The initial implementation runs on iPAQ handhelds (1 MB ROM/128 KB RAM).
The authors intended to develop a software framework regardless of present sensor node limitations.
SensorWare developed at the University of California, Los Angeles, aims at the programmability of
an existing sensor network after its deployment. The functionality of sensor nodes can be dynamically
modied through autonomous mobile agent scripts. SensorWare scripts can be injected into the network
nodes as queries and tasks. After injection, scripts can replicate and migrate within the network. Motivation
for the SensorWare development was the observation that the distribution of updates and the download
of complete images to sensor nodes are impractical for the following reasons. First, in a sensor network,
a special sensor node may not be addressable because of missing node identiers. Second, the distribution
of complete images through a sensor network is highly energy consuming. Besides that, other nodes are
affected by a download when multihop connections are necessary.
Updating complete images does not correspond to the low-power requirements of sensor networks. As
a consequence, it is more practicable to distribute only small scripts. In the following section, the basic
architecture and concepts of SensorWare are described in detail.
40.3.4.1 Basic Architecture and Concepts
SensorWare consists of a scripting language and a runtime environment. The language contains various
basic commands that control and execute specic tasks of sensor nodes. These tasks include, for example,
communication with other nodes, collaboration of sensor data, sensor data ltering, or moving scripts to
other nodes. The language comprises necessary constructs to generate appropriate control ows.
SensorWare utilizes Tcl as scripting language. However, SensorWare extends Tcls core commands.
These core extension commands are joined in several API groups, such as Networking-API, Sensor-API,
Mobility-API (see Figure 40.9).
SensorWare is event based. Events are connected to special event handlers. If an event is signaled an
event handler serves the event according to its inherent state. Furthermore, an event handler is able to
generate new events and to alter its current state by itself.
The runtime environment shown in Figure 40.10 contains xed and platform-specic tasks. Fixed tasks
are part of each SensorWare application. It is possible to add platform-specic tasks depending on specic
application needs. The script manager task receives new scripts and forwards requests to the admission
FIGURE 40.9 SensorWare scripting language.
Script manager
(e.g., state tracking,
create new scripts)
Resource handling
Radio/
net-working
CPU and time
service
Sensing
Admission control
and policing of
resource usage
FIGURE 40.10 SensorWare runtime environment.
Applications
and
services
Injection of
scripts by user
Message
exchanging
Applications
and
services
HW abstraction layer
RTOS
Scripts Scripts
SensorWare
Code
migration
SensorWare
Hardware
Sensor node1
HW abstraction layer
RTOS
Hardware
Sensor node2
FIGURE 40.11 Sensor node architecture.
control task. The admission control task is responsible for script admission decisions and checks the
overall energy consumption. Resource handlers manage different resources of the network.
Figure 40.11 shows the architecture of sensor nodes with included SensorWare software. The Sensor-
Ware layer uses operating systemfunctions to provide the runtime environment and control scripts. Static
node applications coexist with mobile scripts. To realize dynamic programmability of a deployed sensor
network a transient user can inject scripts into the network. After injection scripts are replicated within
the network, the script code migrates between different nodes. SensorWare ensures that no script is loaded
twice onto a node during the migration process.
40.3.5 MiLAN
Middleware Linking Applications and Networks (MiLANs) is a middleware concept introduced by Mark
Perillo and Wendi B. Heinzelman fromthe University of Rochester [13,14]. The main idea is to exploit the
redundancy of information provided by sensor nodes. The performance of a cooperative algorithm in a
distributed sensor network application depends on the number of involved nodes. Because of the inherent
redundancy of a sensor network where several sensor nodes provide similar or even equal information,
evaluating all possible sensor nodes leads to high energy and network costs. Therefore, a sensor network
application has to choose an appropriate set of sensor nodes to fulll application demands.
Each application should have the ability to adopt its behavior in respect to the available set of compon-
ents and bandwidth within the network. This can be achieved by a parameterized sensor node selection
process with different cost values. These cost values are described by the following cost equations:
Application performance: The minimumrequirements for network performance are calculatedfrom
the needed reliability of monitored data. F
R
= { S
i
: j J R(S
i
, j) r
j
} where F
R
stands for the
allowable set of possible sensor node combinations, S
i
represents the available sensor nodes, and
R(S
i
) is their reliability.
Network costs: Denes a subset of sensor nodes that meet the network constraints. The network
feasible set F
N
= {S
i
: N(S
i
) n
0
} where N(S
i
) represents the total cost and n
0
the maximal data
rate the network can support.
Application performance and network costs are combined to the overall feasible set: F = F
R
F
N
.
Energy: Describes the energy dissipation of the network: C
P
(S
i
) =
s
j
S
i
C
P
(S
j
), where C
P
(S
j
) is
the power cost to node S
j
.
It is up to the application to decide how these equations are weighted. This decision-making process is
completely hidden fromthe application. Thus, the development process is simplied signicantly. MiLAN
uses two strategies to achieve the objective to balance QoS and energy costs:
Turning off nodes with redundant information
Using of energy-efcient routing
The MiLAN middleware is located between network and application layer. It can interface a great variety
of underlying network protocols, such as Bluetooth and 802.11. MiLAN uses an API to abstract from
network layer but gives the application access to low-level network components. A set of commands
identies and congures the network layer.
40.3.6 EnviroTrack
EnviroTrack is a TinyOS-based application developed at the University of Virginia that solves a fun-
damental distributed computing problem, environmental tracking of mobile entities [25]. Therefore,
EnviroTrack provides a convenient way to program sensor network applications that track activities in
their physical environment. The programming model of EnviroTrack integrates objects living in physical
time andspace intothe computational environment of the applicationthroughvirtual objects, calledtrack-
ing objects. A tracking object is represented by a group of sensor nodes in its vicinity and is addressed by
context labels. If anobject moves inthe physical environment, thenthe corresponding virtual object moves
too because it is not bound to a dedicated sensor node. Regarding the tracking of objects, EnviroTrack
does not assume cooperation fromthe tracked entity.
Before a physical object or phenomenon can be tracked, the programmer has to specify its activities
and corresponding actions. This specication enables the system to discover and tag those activities and
to instantiate tracking objects. For example, to track an object warmer than 100C, the programmer
species a Boolean function, temperature >100C, a critical number or mass of sensor nodes, which fulll
the Boolean function within a certain time (a fact that is often referred to as freshness of information).
These parameters of a tracking object are called aggregate state. All sensor nodes matching this aggregate
state join a group. The network abstraction layer assigns a context label to this group. Using this label,
different groups can be addressed independent of the set of nodes currently assigned to it. If the tracked
object moves, nodes join or leave the group because of the changed aggregate state but the label resides
persistent. This group management enables context-specic computation.
The EnviroTrack programming system consists of:
EnviroTrack compiler. In EnviroTrack programs, a list of context declarations is dened. Each denition
includes an activation statement, an aggregate state denition, and a list of objects attached to the deni-
tions. The EnviroTrack compiler includes C program templates. The whole project is then built using the
TinyOS development tools.
Group management protocol. All sensors associated to a group are maintained by this protocol. A group
leader is selected out of the group members when the critical mass of nodes and freshness of the approx-
imate aggregate state is reached. The group management protocol ensures that only a single group leader
per group exists. The leader sends a periodical heartbeat to inform its members that the leader is alive.
Additionally, the heartbeat signal is used to synchronize the nodes and to inform nodes that are not part
of the group, but fulll the sensing condition.
Object naming and directory services. These services maintain all active objects and their locations. The
directory service provides a way to retrieve all objects of a given context type. It also assigns names to
groups so they can be accessed easily. It handles dynamical joining and leaving of group members.
Communication and transport services. The Migration Transport Protocol (MTP) is responsible for the
transportation of data packets between nodes. All messages are routed via group leader nodes. Group
leader nodes identify the context group of the target node and the position of its leader using the directory
service. The packet is then forwarded to the leader of the destination group. All leadership information
provided by MTP packets is stored in the leaders on a least recently used basis to keep the leader up-to-date
and to reduce directory lookups.
EnviroTrack enables the construction of an information infrastructure for the tracking of environmental
conditions. It manages dynamic groups of redundant sensor nodes and attaches computation to external
events in the environment. Furthermore, EnviroTrack implements noninterrupted communication
between dynamically changing physical locales dened by environmental events.
40.3.7 SeNeTs
SeNeTs is a middleware architecture for wireless sensor networks. It is developed at the University of
Rostock [15]. The SeNeTs middleware is primarily designed to support the developer of a wireless sensor
network during the predeployment phase (programming aspect). SeNeTs supports the creation of small
and energy-saving programs for heterogeneous networks. One of the key features of SeNeTs is the optim-
ization of APIs. The required conguration, optimization, and compilation of software components is
processed by a development environment. Besides the programming aspect, the middleware supports the
behavioral aspect as well such as task change or evolution over time.
40.3.7.1 SeNeTs Architecture
SeNeTs is based on the software layer model introduced in Chapter 2. To increase exibility and enhance
scalability of sensor node software, it separates small functional blocks as shown in Figure 40.12. In
addition, the operating systemlayer is separated into a node-specic operating systemand a driver layer,
which contains at least one sensor driver and several hardware drivers, such as timer driver and RF driver.
The node-specic operating system handles device-specic tasks, for example, boot-up, initialization of
hardware, memory management, and process management as well as scheduling. Host middleware is the
superior software layer. Its main task is to organize the cooperation of distributed nodes in the network.
Middleware core handles four optional components, which can be implemented and exchanged according
to the nodes task. Modules are additional components that increase the functionality of the middleware.
Typical modules are routing modules or security modules. Algorithms describe the behavior of modules.
Node-specific operating system
Hardware
Sensor driver
Services Modules Algorithms
Host middleware
Middleware core
VM
Sensor
Hardware drivers
FIGURE 40.12 Structure of a node application.
Distributed middleware
Administration terminal
Node A Node B Node C
Sensor network application
Operating system
Host middleware
Hardware
Operating system
Host middleware
Hardware
Operating system
Host middleware
Hardware
FIGURE 40.13 Structure of a sensor network.
For example, the behavior of a security module can vary in the case the encryption algorithm changes.
The services component contains the required software to perform local and cooperative services. This
component usually cooperates with service components of other nodes to fulll its task. VMs enable an
execution of platform independent programs installed at runtime.
Figure 40.13 shows the expansion of the proposed architecture to a whole sensor network from the
logical point of view. Nodes can only be contacted through services of the middleware layers. The distrib-
uted middleware coordinates the cooperation of services within the network. It is logically located in the
network layer but physically exists in the nodes. All layers together in conjunction with their conguration
compose the sensor network application. Thus, nodes do not perform any individual tasks. The adminis-
tration terminal is an external entity to congure the network and evaluate results. It can be connected to
the network at any location.
All functional blocks of the described architecture are represented by components containing real source
code and a description about dependencies, interfaces, and parameters in XML. One functional block can
be rendered by alternative components. All components are predened in libraries.
40.3.7.2 Interface Optimization
One of the key features in SeNeTs is interface optimization. Interfaces are the descriptions of functions
between two software parts. As illustrated in Figure 40.14, higher-level applications using services and
Hardware
Drivers
Node-specific operating system
Middleware
Application
Degree of
abstract
interfaces
Degree of
hardware-
dependent
interfaces
FIGURE 40.14 Interfaces within the software-layer model.
Middleware
ARM
Temperature
sensor
Sensor
driver
Node-specific
operating system
modules
Middleware
Node-specific
operating system
modules
Sensor
drivers
Processor Sensor
Interface
optimization
FIGURE 40.15 Interface optimization.
middleware technologies require abstract software interfaces. The degree of hardware-dependent inter-
faces increases in lower software layers. Hardware-dependent interfaces are characterized by parameters
to congure hardware components directly in contrast to abstract software interfaces whose parameters
describe abstractions of the underlying system.
Software components require a static software interface to the application in order to minimize cus-
tomization effort for other applications and to support compatibility. The use of identical components
in different applications leads to a higher number of complex interfaces in these components. This is
caused by component programming focused on supporting most possible use cases of all possible applic-
ations whereby each application uses only a subpart of the functionality of a component. Reducing the
remaining overhead is the objective of generic software and can be done by interface optimization during
compile time.
Interface optimizations result in proprietary interfaces within a node (Figure 40.15). Parts of the
software cannot be exchanged without sensible effort. In a sensor node, the software is mostly static except
programs for VMs. Accordingly, static linking is preferred. Statically linked software in conjunction with
interface optimization leads to faster and smaller programs.
In SeNeTs, interfaces are customized to the application in contrast to common approaches used in
desktop computer systems. These desktop systems are characterized by writing huge adaptation layers.
The interface optimization can be propagated through all software layers and, therefore, saves resources.
As an example of an optimization, a function OpenSocket(int name, int mode) identies the network
interface with its rst parameter and the opening mode in the second parameter. However, a node that has
only one interface openedwithconstant mode once or twice does not needthese parameters. Consequently,
knowledge of this information at compile time can be used for optimizing, for example, by:
Inlining the function
Eliminating both parameters fromthe delivery process
TABLE 40.1 Types of Interface Optimization
Optimization Description
Parameter Parameters which are not used in one of the called
elimination subfunctions can be removed.
Static If a function is still called with same parameters, these parameters can be
parameters dened as constants or static variables in the global namespace. Thus, the parameter
delivery to the function can be removed.
Parameter The sequence order of parameters is optimized in order to pass parameters through cascading
ordering functions with same or similar parameters. It is particularly favorable in systems using
processor registers instead of the system stack to deliver parameters to subfunctions.
Parameter In embedded systems, many data types are not byte-aligned, for example, bits to congure
aggregation hardware settings. If a function has several non-byte-aligned parameters,
these parameters may be combined.
Another possibility is to change the semantics of data types. A potential use case is the denition of
accuracy of addresses that results in changing data types width. In SeNeTs, there are several types of
interface optimizations proposed, which are given in Table 40.1.
Some optimizations such as static parameters are sometimes counterproductive in particular, if
register-oriented parameter delivery is used. This is caused by the use of offset addresses once at para-
meter delivery instead of absolute addresses embedded in the optimized function. Consequently, the
introduced optimizations strongly depend on:
Processor and processor architecture
Type of parameter delivery (stack or register oriented)
Memory management (small, huge, size of pointers)
Objective of optimization (memory consumption, energy consumption, compact code, etc.)
Sensor network application
40.3.7.3 Development Process
Figure 40.16 shows the development process of sensor node software in SeNeTs. First, for each functional
block the components have to be identied and included into the project. During design phase, the chosen
components are interconnected and congured depending on developers settings. Then, interface as well
as parameter optimization is performed. The nal source codes are generated and logging components
can be included to monitor runtime behavior. The generated source codes are compiled and the executable
is linked together. During evaluation phase, the created node application can be downloaded to the node
and executed. Considering the monitoring results, a new design cycle can be started to improve project
settings. As a result of the design ow, optimized node application software is generated. The node
application now consists of special tailored parts only needed by the specic application of the node.
Optionally, software components in a node can be linked together statically or dynamically. Statically
linking facilitates an optimization of interfaces between several components within a node. A dynamic
link process is used for components exchanged during runtime, for example, algorithms downloaded
fromother nodes. This procedure results in system-wide interfaces with signicant overhead and prevents
interface optimization.
40.4 Simulation, Emulation, and Test of Large-Scale
Sensor Networks
Applications and protocols for wireless sensor networks require novel programming techniques and new
approaches for validation and test of sensor network software. In practice, sensor nodes have to operate
Design and edit
Compiling and linking
Evaluation
Components
Node software
Interface optimization and
source code generation
FIGURE 40.16 Development process of node software.
in an unattended manner. The key factor of this operation is to separate unnecessary information from
important ones as early as possible in order to avoid communication overhead. In contrast, during
implementation and test phases, developers need to obtain as much information as possible from the
network. A test and validation environment for sensor network applications has to ensure this.
Consider a sensor network with thousands of sensor nodes. Furthermore, consider developing a data
fusion and aggregation algorithmthat collects sensor information fromnodes and transmits themto few
base stations. During validation and test, developers often have to change application code, recompile,
and upload a new image onto the nodes. These updates often result in ooding of the network using the
wireless channel. However, this would dissipate a lot of time and energy. But how could we ensure that
every node runs the most recent version of our application?
Pure simulation produces important insights. However, modeling the wireless channel is difcult.
Simulation tools often employ simplied propagation models in order to reduce computational efforts
for large-scale networks. Widely used simulation tools, such as NS2 [16], use simplied network protocol
stacks and do not simulate at bit level. Furthermore, code used in simulations often cannot be reused on
real sensor node hardware; why should developers implement applications and protocols twice?
In contrast to simulation, implementation on a target platform is often complicated. The targeted
hardware itself may be still in development stage. Perhaps there are a few prototypes, but developers need
hundreds of them for realistic test conditions. Moreover, prototype hardware is very expensive and far
away fromthe targeted1 cent/node. Consequently, a software environment is required that combines the
scaling power of simulations with real application behavior. Moreover, the administration of the network
must not affect sensor network applications. Afterwards, three current software approaches are presented.
40.4.1 TOSSIM A TinyOS SIMulator
Fault analysis of distributed sensor networks or their particular components is quite expensive and time
consuming especially when sensor networks consist of hundreds of nodes. For that purpose, a simulator
providing examination of several layers (e.g., communication layer, routing layer) is an efcient tool for
sensor application development.
TinyOS SIMulator (TOSSIM) is a simulator for wireless sensor networks based on the TinyOS frame-
work. As described in References 17 and 18, the objectives of TOSSIM are scalability, completeness,
delity, and bridging. Scalability means TOSSIMs ability to handle large sensor networks with many
nodes in a wide range of congurations. The reactive nature of sensor networks requires not only the
simulation of algorithms but also the simulation of complete sensor network applications. Therefore,
TOSSIM achieves completeness in covering as many system interactions as possible. TOSSIM is able to
simulate thousands of nodes running entire applications. The simulators delity becomes important for
capturing subtle timing interactions on a sensor node and between nodes. A signicant attribute is the
revealing of unanticipated events or interactions. Therefore, TOSSIM simulates the TinyOS network stack
down to bit level. At last, TOSSIM bridges the gap between an academic algorithm simulation and a real
sensor network implementation. Therefore, TOSSIM provides testing and verication of application code
that will run on real sensor node hardware. This avoids programming algorithms and applications twice,
for simulation and for deployment. The TOSSIM components are integrated into the standard TinyOS
compilation tool chain, which supports the direct compilation of unchanged TinyOS applications into
the TOSSIMframework.
Figure 40.17 shows a TinyOS application divided into hardware independent and hardware dependent
components. Depending onthe target platform, the appropriate hardware dependent modules are selected
in the compilation step. This permits an easy extension regarding newsensor node platforms. At the same
time, this is the interface to the TOSSIM framework. Compared with a native sensor node platform,
TOSSIM is a sensor node emulation platform supporting multiple sensor node instances running on
standard PC hardware. Additionally, the TOSSIM framework includes a discrete event queue, a small
number of reimplemented TinyOS hardware abstraction components, mechanisms for extensible radio
and Analog-to-Digital Converter (ADC) models, and communication services for external programs to
interact with a simulation.
The core of the simulator is the event queue. Because TinyOS utilizes an event-based scheduling
approach, the simulator is event driventoo. TOSSIMtranslates hardware interrupts into discrete simulator
events. The simulator event queue emits all events that drive the execution of a TinyOS application. In
contrast to real hardware interrupts, events cannot be preempted by other events and therefore are not
nested.
The hardware emulation of sensor node components is performed by replacing a small number of
TinyOS hardware components. These include the ADC, the clock, the transmit strength variable poten-
tiometer, the EEPROM, the boot sequence component, and several components of the radio stack. This
enables simulations of a large number of sensor node congurations.
The communication services are the interface to PC applications driving, monitoring, and actuating
simulations by communicating with TOSSIM over TCP/IP. The communication protocol was designed
at an abstract level and enables developers to write their own systems that hook into TOSSIM. TinyViz is
an example of a TOSSIMvisualization tool that illustrates the possibilities of TOSSIMs communication
services. It is a Java-based graphical user interface providing visual feedback on the simulation state to
control running simulations, for example, modifying ADC readings and radio loss properties. A plug-in
interface for TinyViz allows developers to implement their own application-specic visualization and
control code.
TOSSIMdoes not model radio propagation, power draw, or energy consumption. TOSSIMs is limited
by the interrupts that are timed by the event queue. They are nonpreemptive.
In conclusion, TOSSIM is an event-based simulation framework for TinyOS-based sensor networks.
The open-source framework and the communication services permit an easy adaptation or integration of
simulation models and the connection to application-specic simulation tools.
40.4.2 EmStar
EmStar is a software environment for developing anddeploying applications for sensor networks consisting
of 32-bit embedded Microserver platforms [19,20]. EmStar consists of libraries, tools, and services.
TinyOS application
ADC Clock RFM
H
a
r
d
w
a
r
e

d
e
p
e
n
d
e
n
t
s
y
s
t
e
m

c
o
m
p
o
n
e
n
t
s
MICA mote hardware
platform
TinyOS application
ADC Clock RFM
TOSSIM platform
TinyOS application
ADC Clock RFM
RENE mote hardware
platform
ADC
model
Event queue
Component
graphs
Communication
services
R
a
d
i
o
m
o
d
e
l
APP
TEMP PHOTO AM
CRC
BYTE
RFM CLOCK ADC
ADC event
TOSSIM
implementations
.........
.........
FIGURE 40.17 Comparison of TinyOS and TOSSIM system architecture.
Libraries implement primitives of interprocess communication. Tools support simulation, emulation,
and visualization of sensor network applications. Services provide network functionality, sensing, and
synchronization. EmStars target platforms are so-called Microservers, typically iPAQ or Crossbow Stargate
devices. EmStar does not support Berkeley Motes as platform, however, can easily interoperate with
Motes. EmStar consists of various components. Table 40.2 gives the name and a short description of
each component. The last row of the table contains hypothetical, application-specic components; all
others are EmStar core components. Figure 40.18 illustrates the cooperation of EmStar components in a
sample application for environmental monitoring. The dark-gray boxes represent EmStar core modules.
Hypothetical application-specic modules are lled light-gray. The sample application collects data from
an audio sensor and therewith tries to detect the position of an animal in collaboration with neighboring
sensor nodes.
40.4.2.1 EmStar Tools and Services
EmStar provides tools for the simulation, emulation, and visualization of a sensor network and its
operation. EmSim runs virtual sensor nodes in a pure simulation environment. EmSim models both
radio and sensor channels. EmCee runs the EmSim core, however, uses real radios instead of modeled
channels. Both EmSimand EmCee use the same EmStar source code and associated conguration les as
TABLE 40.2 EmStar Components
Component Description
Emrun Management and watchdog process (responsible for start-up, monitoring,
and shut-down of EmStar modules)
Emproxy Gateway to a debugging and visualization system
udpd, linkstats, neighbors, Network protocol stack for wireless connections
MicroDiffusion
timehist, syncd, audiod Audio sampling service
FFT, detect, collab_detect Hypothetical modules, responsible for Fast Fourier Transformation,
and (collaborative) event detection
collab_detect
FFT timehist
linkstats
syncd
EmRun
Neighbors
udpd
ADC
audiod
EmProxy
Detect
MicroDiffusion
EmLog/*
802.11 NIC
Sensor/anim
Link/ls0/neighbors
Sensor/audio/fft
sync/hist
Sensor/audio/0 sync/params
Link/ls0
Link/udp0
Clients
Status
Gradients Data
Sensor
FIGURE 40.18 EmStar sample application consisting of EmStar core modules (dark-gray) and hypothetical
application-specic modules (light-gray).
a real deployed EmStar system. This aspect alleviates the development and debugging process of sensor
network applications.
EmView is a visualization tool for EmStar systems that uses a UDP protocol to request status updates
fromsensor nodes. In order to obtain sensor node or network information, EmView queries an EmProxy
server that runs as part of a simulation or on a real node. EmRun starts, stops, and manages an EmStar
system. EmRun supports process respawn, in-memory login, fast startup, and graceful shutdown.
EmStar services comprise link and neighborhood estimation, time synchronization, and routing. The
Neighborhood service monitors links and maintains lists of active, reliable nodes. EmStar applications can
use these lists to get informed about topology changes. The LinkStats service provides applications with
more detailed information about link reliability than the Neighborhood service, however, produces more
packet overhead. Multipath-routing algorithms can benet fromthe LinkStats service by weighting their
path choices with LinkStats information. The TimeSync service is used to convert timestamps between
different nodes. Additionally, EmStar supports several routing protocols, but allows the integration of
new routing protocols as well.
40.4.2.2 EmStar IPC Mechanism
Communication between EmStar modules is managed by so-called FUSD-driven devices (FUSD
Framework for User-Space Devices), a microkernel extension to Linux. FUSD allows devicele callbacks
to be proxied into user space and implemented by user-space programs instead of kernel code. Besides
intermodule communication, FUSD allows interaction of EmStar modules and users. FUSD drivers are
implementedinuser space, however, cancreate device les withthe same semantics as kernel-implemented
device les. Applications can use FUSD-driven devices to transport data or expose state.
Several device patterns exist for EmStar systems that are frequently needed in sensor network applic-
ations. Example device patterns comprise a status device pattern exposing the current state of a module,
a packet device pattern providing a queued multiclient packet interface, a command device pattern that
modies conguration les and triggers actions, and a query device pattern implementing a transactional
RPC mechanism.
40.4.3 Sensor Network Applications (SNA) Test and Validation
Environment
In SeNeTs, SNAs run distributed on independent hosts such as PCs, PDAs, or evaluation boards of
embedded devices [21]. The parallel execution decouples applications and simulation environment. The
quasi-parallel and sequential processing of concurrently triggered events in simulations is disadvantage-
ous compared with real-world programs. SeNeTs prevents this very unpleasant effect. Without SeNeTs,
this effect results in sequenced execution of parallel working SNAs and corrupted simulation output.
To summarize, realistic simulations of sensor networks are complicated.
40.4.3.1 System Architecture
The development and particularly the validation of distributed applications are hard to realize. Especially
systems with additional logging and controlling facilities affect the primary behavior of applications.
Suppose, a logging message is transmitted, then an application message transmitted may be delayed.
Exceptionally in wireless applications with limited channel capacity, the increased communication leads to
a modiedtiming behavior andas a consequence todifferent results. The channel capacity is denedas
1
n
,
since n symbolizes the number of nodes. Due to the degrading channel capacity in large sensor networks,
the transport medium acts as a bottleneck [22]. Thus, in wireless sensor networks with thousands of
nodes, the bottleneck effect becomes the dominant part.
To eliminate the bottleneck effect, SeNeTs contains two independent communications channels as illus-
trated in Figure 40.19. The primary communication channel is dened by the sensor network application.
Base
station
App 6
App 5
App 7
App 8
App 2
App 1
App 3
App 4
Secondary transmission channel
Primary
transmission
channel
FIGURE 40.19 Communication channels in SeNeTs.
Commands
Graphical
user interface
Scripts
Network
server
Host: spice
Application
server
Host: speedy
Application
server
Host: rtl
Node 1
Node 4
Node 3
Node 2
Visualization, evaluation Application configuration Network configuration Node applications
FIGURE 40.20 SeNeTs components using the secondary transmission channel.
It uses the communication method required by SNAs, for example, Bluetooth or ZigBee. The secondary
communication channel is an administration channel only used by SeNeTs components. This channel
transmits controlling and logging messages. It is independent of the primary communication channel and
uses a different communication method, for example, Ethernet or Ultrasound. The separation into two
communication channels simplies the decoupling of application modules and administration modules
after testing.
The parallel execution of applications on different host systems requires a cascaded infrastructure
to administrate the network. Figure 40.20 displays important modules in SeNeTs: node applications,
application servers (ASs), a network server (NS), and optional evaluation or visualization modules. All of
these modules are connected via the secondary transmission channel.
40.4.3.2 Network Server
The NS administrates sensor networks and associated sensor nodes. The NS starts, stops, or queries SNAs.
In an SeNeTs network, exactly one NS exists. However, this NS is able to manage several sensor networks
simultaneously. Usually, the NS runs as service of the operating system.
An NS opens additional communication ports. External programs, such as scripts, websites, or telnet
clients, can connect to these ports to send commands. These commands may be addressed and forwarded
to groups or stand-alone components. Furthermore, the NS receives logging messages from applications
containing their current state. Optional components, such as graphical user interfaces, can install callbacks
to receive this information.
40.4.3.3 Application Server
The AS manages instances of node applications on one host (Figure 40.20). It acts as bridge between node
applications and the NS. Usually, at least one AS exists within the SeNeTs network. Ideally, only one node
application should be installed on an AS to prevent quasi-parallel effects during runtime.
The AS runs independent of the NS. It connects to the NS via a pipe to receive commands. Each
command is multiplexed to one of the connected node applications. Moreover, if the pipe to the NS
breaks, node applications will not be affected besides losing logging and controlling facilities. Later, the
NS can establish the pipe again.
Generally, an AS starts as service together with the hosts operating system. At startup, it requires
conguration parameters of the nodes hardware. With these parameters, the AS assigns hardware to node
applications. Suppose a host system that comprises two devices representing sensor nodes as principally
shown in Figure 40.20. Then, the AS requires device number, physical position of the node, etc. to
congure the dynamically installed node applications at runtime.
40.4.3.4 SeNeTs Application
Applications for wireless sensor nodes are usually designed based on a layered software model as depicted
in Figure 40.21(a) [15]. On top of the nodes hardware, a specialized operating system is set up such as
Operating system Sensor driver
Services Modules Algorithms
Sensor node application
Middleware management
Sensor node
application
Simulator SeNeTs Node
Target type
Environment
emulation
(a) (b)
SeNeTs
Adaptation
Logging
Controlling
SeNeTs hardware
abstraction
Debugging
SeNeTs sensor node application
FIGURE 40.21 (a) Software layer model of a sensor node application, (b) software layer model of a SeNeTs
application.
TinyOS [23]. A sensor driver contains software to initialize the measurement process and to obtain sensor
data. Above the operating system and the sensor driver, middleware components are located containing
services to aggregate data or to determine the nodes position. The aforementioned modular design
allows:
Abstraction of hardware, for example, sensors, communication devices, memory, etc.
Adaptation of the nodes operating system
Addition of optional components, for example, logging and conguration
The SeNeTs Adaptation is a set of components which are added or exchanged to wrap the SNA.
Figure 40.21(b) represents the SeNeTs Adaptation layer consisting of at least a logging component, a con-
trolling unit, a HAL, and an optional environment encapsulation module. These additional components
provide substantial and realistic test and controlling facilities.
An application composed of an SNA and SeNeTs Adaptation components is called SeNeTs Application
(SeA). The SNA is not changed by added components. Generally, it is not necessary to adapt the SNA to
SeNeTs interfaces. However, supplementary macros can be added to interact with the linked components.
An SeA runs as process of the host system. Due to the architecture of an SNA with its own operating
system, the SeA runs autonomously without interaction to other processes of the host system. At startup,
the SeA opens a pipe to communicate with the AS. After the test phase, all SeNeTs components can be
removed easily by recompilation of all node applications. SeNeTs specic components and logging calls
are automatically deactivated due to compiler switches.
40.4.3.5 Environment Management
Sensor network applications require valid environment data, such as temperature or air pressure. Under
laboratory conditions, this information is not or partly not available. Therefore, environment data must be
emulated. SeNeTs provides these environment data to the node application by the environment emulation
module (Figure 40.21[b]). All environment emulation modules are controlled by the environment man-
agement of the NS which contains all predened or congured data (Figure 40.22). This data comprises
positions of other nodes, distances to neighboring nodes, etc. If required, other data types may be added.
In the AS, the environment data cache module stores all environment information required by each node
application to reduce network trafc.
Optionally, position-based ltering is provided by the environment emulation component of SeNeTs.
Especially, if large topologies of sensor nodes should be emulated under small-sized laboratory conditions,
this ltering approach is essential. Suppose real and virtual positions of nodes are known, a mapping from
physical address to virtual address is feasible. A node application only receives messages fromnodes that
Network server
Host: spice
Command
execution
Network
management
Logging
Environment
management
Application server
Host: rtl
Environment
data cache
Reachability list
air pressure
temperature
Application
management
FIGURE 40.22 Environment management in SeNeTs.
(a)
A B
C D
(b)
A B C D
FIGURE 40.23 (a) Physically arranged sensor nodes (black dots). All nodes are in transmission range (surrounding
circles) of each other. (b) Virtually arranged nodes with appropriate transmission ranges. Nodes are no longer able to
communicate without routing.
are virtually in transmission range. All other messages are rejected by the SeNeTs Adaptation components.
This is accomplished by setting up a lter in the primary communication channel.
One application scenario that illustrates position-based ltering is ood prevention. Here, sensor
nodes are deployed in sandbags, piled along a dyke of hundreds of meters or even kilometers. These nodes
measure the humidity and detect potential leakages. Testing this scenario under real-world conditions
is not practical and very expensive. However, the evaluation under real-world conditions of software
regarding communication effort, self-organization of the network, routing, and data aggregation is most
important.
Figure 40.23 illustrates the difference between laboratory and real world. Figure 40.23(a) represents
laboratory conditions where all nodes are in transmission range of each other. Figure 40.23(b) sketches the
ood prevention scenario under real conditions. In Figure 40.23(a), the nodes A to D are in transmission
range of each other. Therefore in contrast to the real-world scenario, no routing is required. Next,
data aggregation yields wrong results, because nodes are not grouped as they would in reality. Thus, if
physically arranged nodes in test environment do not meet the requirements of the real world, the results
are questionable.
Assume, node A sends a message to node D, then all nodes receive the message due to the physical
vicinity in the test environment (Figure 40.23[a]). Node C and D receive the message, but they are not in
the virtual transmission range of node A. Thus, the environment emulation module rejects these messages.
As a result, SeNeTs prevents a direct transmission from node A to node D. Messages can be transmitted
only by using routing nodes B and C (Figure 40.23[b]). In short, the emulation of the sensor network
software becomes more realistic.
40.5 Summary
At the present time, TinyOS is the most mature operating system framework for sensor nodes. The
component-based architecture of TinyOS allows an easy composition of SNAs. New components can be
added easily to TinyOS to support novel sensing or transmission technologies or to support upcoming
sensor node platforms. MAT addresses the requirement to change a sensor nodes behavior at runtime by
introducing a VM on top of TinyOS. Via transmitting capsules containing high-level instructions, a wide
range of SNAs can be installed dynamically into a deployed sensor network. TinyDB was developed to
simplify data querying from sensor networks. On top of TinyOS, it provides an easy to use SQL interface
to express data queries and addresses the group of users nonexperienced with writing embedded C code
for sensor nodes. TOSSIM is a simulator for wireless sensor networks based on the TinyOS framework.
EnviroTrack is an object-based programming model to develop sensor network applications for track-
ing activities in the physical environment. Its main feature is dynamical grouping of nodes depending on
environmental changes described by predened aggregate functions, critical mass, and freshness horizon.
SensorWare is a software framework for sensor networks employing lightweight and mobile control scripts
that allow the dynamic deployment of distributed algorithms into a sensor network. In comparison to
the MAT framework, the SensorWare runtime environment supports multiple applications to run con-
currently on one SensorWare node. The MiLAN middleware provides a framework to optimize network
performance, which needed sensing probability and energy costs based on equations. It is the program-
mers decision to weight these equations. EmStar is a software environment for developing and deploying
applications for sensor networks consisting of 32-bit embedded Microserver platforms. SeNeTs is a new
approach to optimize the interfaces of sensor network middleware. SeNeTs aims at the development of
energy-saving applications and the resolving of component dependencies at compile time.
References
[1] G.J. Pottie and W.J. Kaiser, Wireless integrated network sensors, Communications of the ACM, 43,
5158, 2000.
[2] J.M. Kahn, R.H. Katz, and K.S.J. Pister, Next century challenges: mobile networking for smart dust,
in Proceedings of the ACM MobiCom99, Washington, USA, 1999, pp. 271278.
[3] D. Culler, E. Brewer, and D. Wagner, A platform for WEbS (wireless embedded sensor actuator
systems), Technical report, University of California, Berkeley, 2001.
[4] J. Rabaey et al., PicoRadio supports ad hoc ultra-low power wireless networking, IEEE Computer,
33(7), 4248, 2000.
[5] EYES Energy-efcient sensor networks, URL: http://eyes.eu.org
[6] I.F. Akyildiz, W. Su, Y. Sankarasubramaniam, and E. Cayirci, A survey on sensor networks, IEEE
Communications Magazine, 40(8), 102114, 2002.
[7] P. Rentala, R. Musunuri, S. Gandham, and U. Saxena, Survey on sensor networks, Technical report
UTDCS-10-03, University of Texas, 2003.
[8] J. Hill et al., Systemarchitecture directions for networked sensors, in Proceedings of the Ninth Inter-
national Conference on Architectural Support for Programming Languages and Operating Systems,
Cambridge, MA, USA, November 2000.
[9] D. Gay, P. Levis, R.V. Behren, M. Welsh, E. Brewer, and D. Culler, The nesC language: a holistic
approach to networked embedded systems, in Proceedings of the Conference on Programming
Language Design and Implementation (PLDI), San Diego, CA, June 2003.
[10] D. Culler, TinyOS a component-based OS for the networked sensor regime, URL:
http://webs.cs.berkeley.edu/tos/, 2003.
[11] S. Madden, J. Hellerstein, and W. Hong, TinyDB: in-network query processing in TinyOS, Intel
Research, IRB-TR-02-014, October 2002.
[12] A. Boulis and M.B. Srivastava, A framework for efcient and programmable sensor networks,
in Proceedings of the Fifth IEEE Conference on Open Architectures and Network Programming
(OPENARCH 2002), New York, June 2002.
[13] A. Murphy and W. Heinzelman, MiLAN: middleware linking applications and networks,
Technical report, University of Rochester, Computer Science Department, URL: http://
hdl.handle.net/1802/305, January 2003.
[14] M. Perillo and W. Heinzelman, Providing application qos through intelligent sensor management,
in Proceedings of the First IEEE International Workshop on Sensor Network Protocols and Applications
(SNPA03), Anchorage, AK, USA, May 2003.
[15] J. Blumenthal, M. Handy, F. Golatowski, M. Haase, and D. Timmermann, Wireless sensor
networks new challenges in software engineering, in Proceedings of the Ninth IEEE Inter-
national Conference on Emerging Technologies and Factory Automation (ETFA), Lisbon, Portugal,
September 2003.
[16] The Network Simulator ns-2, http://www.isi.edu/nsnam/ns
[17] P. Levis et al., TOSSIM: accurate and scalable simulation of entire TinyOS applications, in
Proceedings of the First ACM Conference on Embedded Networked Sensor Systems (SenSys 2003),
Los Angeles, November 2003.
[18] TOSSIM: A Simulator for TinyOS Networks Users Manual, in TinyOS documentation.
[19] L. Girod, J. Elson, A. Cerpa, T. Stathopoulos, N. Ramanathan, and D. Estrin, EmStar: a software
environment for developing and deploying wireless sensor networks, in Proceedings of USENIX 04,
Boston, June 2004.
[20] EmStar: software for wireless sensor networks, URL: http://cvs.cens.ucla.edu/emstar/, 2004.
[21] J. Blumenthal, M. Handy, and D. Timmermann, SeNeTs test and validation environment for
applications in large-scale wireless sensor networks, in Proceedings of the Second IEEE International
Conference on Industrial Informatics INDIN04, Berlin, June 2004.
[22] J. Li, C. Blake, D.S.J. De Couto, H.I. Lee, and R. Morris, Capacity of ad hoc wireless networks, in
Proceedings of MobiCom, Rome, July 2001.
[23] Berkeley WEBS: TinyOS, http://today.cs.berkeley.edu/tos/, 2004.
[24] P. Levis and D. Culler, MAT: a tiny virtual machine for sensor networks, in Proceedings of the ACM
Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS),
San Jose, California, USA, October 2002.
[25] T. Abdelzaher, B. Blum et al., EnviroTrack: an environmental programming model for tracking
applications in distributed sensor networks, Technical report CS-2003-02, University of Virginia,
2003.
VI
Embedded Applications
Automotive Networks
41 Design and Validation Process of In-Vehicle Embedded Electronic Systems
Franoise Simonot-Lion and YeQiong Song
42 Fault-Tolerant Services for Safe In-Car Embedded Systems
Nicolas Navet and Franoise Simonot-Lion
43 Volcano Enabling Correctness by Design
Antal Rajnk
41
Design and Validation
Process of In-Vehicle
Embedded Electronic
Systems
Institut National Polytechnique de
Lorraine
YeQiong Song
Universit Henri Poincar
41.1 In-Vehicle Embedded Applications: Characteristics
and Specic Constraints. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41-1
Economic and Social Context Several Domains and Specic
Problems Automotive Technological Standards
A Cooperative Development Process
41.2 Abstraction Levels for In-Vehicle Embedded System
Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41-8
Architecture Description Languages EAST-ADL for
In-Vehicle Embedded System Modeling
41.3 Validation and Verication Techniques . . . . . . . . . . . . . . . . 41-10
General View of Validation Techniques Validation by
Performance Evaluation
41.4 Conclusions and Future Trends . . . . . . . . . . . . . . . . . . . . . . . . . 41-20
41.5 Appendix: In-Vehicle Electronic System
Development Projects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41-21
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41-22
41.1 In-Vehicle Embedded Applications: Characteristics and
Specic Constraints
41.1.1 Economic and Social Context
While automobile production is likely to increase slowly in the coming years (42 million cars produced
in 1999 and only 60 million planned in 2010), the part of embedded electronics and more precisely
embedded software is growing. The cost of electronic systems was $37 billion in 1995 and $60 billion in
2000, with an annual growth rate of 10%. In 2006, the electronic embedded system will represent at least
25% of the total cost of a car and more than 35% for a high-end model [1].
The reasons for this evolution are technological as well as economical. On the one hand, the cost of
hardware components is decreasing while their performance and reliability are increasing. The emergence
41-1
of automotive embedded networks such as LIN, CAN, TTP/C, FlexRay, MOST, and IDB-1394 leads to
a signicant reduction of the wiring cost as well. On the other hand, software technology facilitates the
introduction of new functions whose development would be costly or even not feasible if using only
mechanical or hydraulic technology and therefore allows satisfying the end-user requirements in terms
of safety and comfort. Well-known examples are Electronic Engine control, ABS, ESP, Active suspension,
etc. In short, thanks to these technologies, the customers can buy a safe, efcient, and personalized
vehicle while the carmakers are able to master the differentiation of product variants and the innovation
(analysts stated that more than 80% of innovation, and therefore of added value, will be obtained thanks
to electronic systems [2]). Another new factor is emerging. A vehicle already includes some electronic
equipments such as hand free phones, audio/radio devices, and navigation systems. For the passengers,
many entertainment devices, such as video equipments, and communication with outside world will be
available in the very near future. Even if these kinds of applications have little to do with the vehicles
operation itself, they increase signicantly the part of software embedded in a car.
Who is concerned by this evolution? First the vehicle customer, for which the requirements are on
the one hand, the increase of performance, comfort, assistance for mobility efciency (navigation), and
on the other hand, the reduction of vehicle fuel consumption and cost. Furthermore, he requires a
reliable embedded electronic system that ensures safety properties. Second, the stakeholders, carmakers,
and suppliers, who are interested in the reduction of time-to-market, development, production, and
maintenance costs. Finally, this evolution has a strong impact on the society: legal restrictions on exhaust
emission, protection of the natural resources and the environment.
The example of electronic systems formerly presented does not have to meet the same level of depend-
ability. So their designs are relevant for different techniques. Nevertheless, common characteristics are
their distributed nature and the fact that they have to provide a level of quality of service xed by the
market, the safety and cost requirements. Therefore their development and production have to be based
on a suitable methodology including their modeling, validation, optimization, and test.
41.1.2 Several Domains and Specic Problems
In-vehicle embedded systems are usually classied in four domains that correspond to different function-
alities, constraints, and models [3, 4]. Two of them are concerned specically with safety: power train
and chassis domain. The third one, body, is emerging and presently integrated in a majority of cars.
Finally, telematic, multimedia, and Human Machine Interface domains take benet of continuous
progress in the eld of multimedia, wireless communications, and Internet.
41.1.2.1 Power Train
This domain represents the system that controls the motor according to, on the one hand, requests of
the driver, that can be explicit orders (speeding up, slowing down, etc.) or implicit constraints (driving
facilities, driving comfort, fuel consumption, etc.) and, on the other hand, environmental constraints
(exhaust pollution, noise, etc.). Moreover, this control has to take into account requirements from other
parts of the embedded system as climate control or ESP (Electronic Stability Program).
In this domain, the main characteristics are:
At a functional point of view: the power train control takes into account different working modes
of the motor (slow running, partial load, full load, etc.); this corresponds to different and complex
control laws (multivariables) with different sampling periods (classical sampling periods for signals
provided by other systems are 1, 2, or 5 msec while the sampling of signals on the motor itself is in
phase with the motor times.
At a hardware point of view: this domain requires sensors whose specication has to consider
the minimization of the criteria cost/resolution, and microcontrollers providing high computa-
tion power, thanks to their multiprocessors architecture, dedicated coprocessors (oating point
computations), and high storage capacities.
Design and Validation Process 41-3
At an implementation point of view: the specied functions are implemented as several tasks with
different activationrules according tothe sampling rules, stringent time constraints imposedontask
scheduling, mastering safe communications with other systems and with local sensors/actuators.
In this domain, systems are relevant of continuous systems, sampled systems, and discrete systems.
Traditional tools for their functional design and modeling are, for example, Matlab/Simulink,
Matlab/Stateow. Currently the validations of these systems are mainly done by simulation and, for
their integration, by emulation methods and/or tests. Last, as illustrated formerly, the power train domain
includes hard real-time systems; so, performance evaluation and timing analysis activities have to be
proceeded on their implementation models.
41.1.2.2 Chassis
Chassis domain gathers all the systems that control the interaction of the vehicle with the road and the
chassis components (wheel, suspension, etc.) according to the request of the driver (steering, braking, or
speed up orders), the road prole, and the environmental conditions (wind, etc.). These systems have to
ensure the comfort of driver and passengers (suspension) as well as their safety. This domain includes
systems as ABS (Anti-lock Braking System), ESP (Electronic Stability Program), ASC (Automatic Stability
Control), 4WD (4 Wheel Drive). Note that, chassis is the critical domain contributing to the safety of
the passengers and of the vehicle itself. Furthermore, X-by-Wire technology, currently applied in avionic
systems, is emerging in automotive industry. X-by-Wire is a generic term used when mechanical and/or
hydraulic systems are replaced by electronic ones (intelligent devices, networks, computers supporting
software components that implement ltering, control, diagnosis, functionalities). For example, we can
cite brake-by-wire, steer-by-wire, that will be shortly integrated in cars for the implementation of critical
and safety relevant functions. The characteristics of the chassis domain and the underlying models are
similar to those presented for power train domain, that is multivariable control laws, different sampling
periods, and stringent time constraints. Regarding the power train domain, systems controlling chassis
components are fully distributed. Therefore, the development of such systems must dene a feasible
system, that is, satisfying performance, dependability, and safety constraints. Conventional mechanical
andhydraulic systems have stoodthe test of time andhave provedtobe reliable; it is not the same for critical
software based systems. In aerospace/avionic industries, X-by-Wire technology is currently employed; but,
for ensuring safety properties, specic hardware and software components, specic fault tolerant solutions
(heavy and costly redundancies of networks, sensors, and computers), and certied design and validation
methods are used. Now there is a challenge to adapt these solutions to automotive industries that impose
stringent constraints on component cost, electronic architecture cost (minimization of redundancies),
and development time length.
41.1.2.3 Body
Wipers, lights, doors, windows, seats, and mirrors are controlled more and more by software based sys-
tems. These kinds of functions make up the body domain. They are not subjected to stringent performance
constraints but globally involve many communications between them and consequently a complex dis-
tributed architecture. There is an emergence of the notion of subsystem or subcluster based on low cost
sensoractuator level networks as, for example, LIN that connect modules realized as integrated mechat-
ronic systems. On another side, the body domain integrates a central subsystem, termed the central body
electronic whose main functionality is to ensure message transfers between different systems or domains.
This system is recognized to be a central critical entity.
Body domain mainly implies to discrete event applications. Their design and validation rely on state
transition models (as SDL, Statecharts, UML state transition diagrams, Synchronous models). These
models allow, mainly by simulation, the validation of a functional specication. Their implementation
implies a distribution over complex hierarchical hardware architecture. High computation power for the
central body electronic entity, fault tolerance, and reliability properties are imposed on the body domain
systems. A challenge in this context is rst to be able to develop exhaustive analysis of state transition
diagrams and second, to ensure that the implementation respects the fault tolerance and safety constraints.
The problem here is to achieve a good balance between time-triggered approach and exibility.
41.1.2.4 Telematic and Human Machine Interface
Next generation of telematic devices provides new-sophisticated Human Machine Interface (HMI) to the
driver and the other occupants of a vehicle. They enable not only to communicate with other systems inside
the vehicle but also to exchange information with the external world. Such devices will be upgradeable
in the future and for this domain, a plug and play approach has to be favored. These applications have
to be portable and the services furnished by the platform (operating system and/or middleware) have to
offer generic interfaces and downloading facilities. The main challenge here is to preserve the security of
the information from, to, or inside the vehicle. Sizing and validation do not rely on the same methods as
for the other domains. Here we shift from considering messages, tasks, and deadline constraints to uid
data streams, bandwidth sharing, and multimedia quality of service and from safety and hard real-time
constraints to security on information and soft real-time constraints. Note that, if this domain is more
related to entertainment activities, some interactions exist with other domains. For example, the telematic
framework offers a support for future remote diagnostic services. In particular, the standard OBD-3,
currently under development, extends OBD-2 (Enhanced On Board Diagnosis) by adding telemetry. As
its predecessor, it denes the protocol for collecting measures on the power train physical equipments and
alerting, if necessary, the driver and a protocol for the exchanges with a scan tool. Thanks to a technology
similar to that which is already being used for automatic electronic toll collection systems, an OBD-3-
equipped vehicle would be able to report the vehicle identication number and any emission problems
directly to a regulatory agency.
41.1.3 Automotive Technological Standards
A way for ensuring some level of interoperability between components developed by different partners is
brought at rst by the standardization of services sharing the hardware resources between the application
processes. For this reason, in the current section, we provide an outline of the main standards used in
automotive industry, in particular the networks and their protocols and the operating systems. Then, we
introduce some works in progress for the denition of a middleware that will be a solution for portability
and exibility purpose.
41.1.3.1 Networks and Protocols
Due to the stringent cost, real-time, and reliability constraints, specic communication protocols and
networks have been developed to fulll the needs of the ECU (Electronic Control Unit) multiplexing.
SAE has dened three distinct protocol classes named class A, B, and C. Class A protocol is dened for
interconnecting actuators and sensors with a low bit rate (about 10 Kbps). An example is LIN. Class B
protocol supports a data rate as high as 100 Kbps and is designed for supporting nonreal-time control
and inter ECU communication. J1850 and low speed CAN are examples of SAE class B protocol. Class C
protocol is designed for supporting real-time and critical applications. Networks like high speed CAN,
TTP/C belong to the class C, which support data rates as high as 1 or several mega bits per second. This
section intends to outline the most known of them.
41.1.3.1.1 Controller Area Network
Controller Area Network (CAN) [5,6] is without any doubt the mostly used in-vehicle network. CAN was
initially designed byRobert Boschcompany at the beginning of the 1980s for multiplexing the increasing
number of ECUs in a car. It became an OSI standard in 1994 and is now a de facto standard for data
transmission in automotive applications due to its low cost, robustness, and bounded communication
delay. CANis mainly used in power train, chassis, and body domains. Further information on CANrelated
protocols and development, including TTCAN, could be found in http://can-cia.org/.
Controller Area Network is a priority-based bus that allows to provide a bounded communication delay
for eachmessage priority. The MAC(MediumAccess Control) protocol of CANuses CSMAwithbit-by-bit
TABLE 41.1 CAN and VAN Frame Format
Bit
SOF
1
ID
11/29 1 2
RTR
(Reserved)
4
DLC
064
Data
16
CRC
2
ACK
7
EOF
3
IFS
10 15 2 2 8 4 Time slot
12 Bit
SOF ID EOD ACK EOF IFG VAN
CAN
(0 28)10
(0 28)8
Data
5
4
Command
15 +3
15
CRC
nondestructive arbitration over the IDeld (Identier). The identier is coded using 11 bits (CAN2.0A) or
29 bits (CAN2.0B) and it also serves as priority. Up to 8 bytes of data can be carried by one CANframe and
a CRCof 16 bits is used for transmission error detection. CANuses a NRZbit encoding scheme for making
the bit-by-bit arbitration feasible with a logical AND operation. However the use of bit-wise arbitration
scheme intrinsically limits the bit rate of CANas the bit time must be long enough to cover the propagation
delay on the whole network. A maximumof 1 Mbps is specied to a CAN bus not exceeding 40 m.
The maximum message transmission time should include the worst-case bit stufng number
(CAN2.0A). This length is given by:
C
i
=
44 +8 DLC +
34 +8 DLC
4

bit
(41.1)
where DLC is the data length in bytes and
bit
the bit time; the fraction represents the overhead due
to the bit stufng, a technique implemented by CAN for bit synchronization, which consists in inserting
an opposite bit every time ve consecutive bits of the same polarity are encountered.
Frame format is given in Table 44.1. We will not detail eld signication here; note however that the
Inter Frame Space (IFS) has to be considered when calculating the bus occupation time of a CANmessage.
41.1.3.1.2 Vehicle Area Network
Vehicle Area Network (VAN) [7, 8] is quite similar to CAN. It was used by the French carmaker PSA
Peugeot-Citron for the body domain. Although VAN has some more interesting technical features than
CAN, it is not largely adopted by the market and has now been abandoned in favor of CAN. Its MAC
protocol is also CSMA with bit-by-bit nondestructive arbitration over the ID eld (Identier), coded with
12 bits. Up to 28 bytes of data can be carried by one VAN frame and a CRC of 15 bits is used. The bit rate
can reach 1 Mbps. One of the main differences between CAN and VAN is that CAN uses NRZ code while
VANuses a so-called E-Manchester (Enhanced Manchester) code: a binary sequence is divided into blocks
of 4 bits and the rst three bits are encoded using NRZ code (whose duration is dened as one Time Slot
per bit) while the fourth one is encoded using Manchester code (two Time Slots per bit). It means that
4 bits of data are encoded using 5 Time Slots (TS). Thanks to E-Manchester coding, VAN, unlike CAN,
does not need bit stufng for bit synchronization. This coding is sometimes denoted by 4B/5B.
The format of VAN frame is given in Table 41.1. The calculation of the transmission duration
(or equivalent frame length) of a VAN frame is given by:
C
i
= (60 +10 DLC) TS (41.2)
Note, however, that the Inter Frame Gap (IFG), xed to 4 TS, has to be considered when calculating the
total bus occupation time of a VAN message. Finally, VAN has one feature, which is not present in CAN:
the in-frame response capability. The same single frame can include the remote message request of the
consumer (identier and command elds) and the immediate response of the producer (data and CRC
elds).
41.1.3.1.3 J1850
SAE J1850 [9] is developed in North America and has been used by carmakers such as Ford, GM, and
DaimlerChrysler. The MAC protocol follows the same principle as CAN and VAN, that is, it uses CSMA
with bit-by-bit arbitration for collision resolution. J1850 supports two data rates: 41.6 Kbps for PWM
(Pulse Width Modulation) and 10.4 Kbps for VPW (Variable Pulse Width). The maximum data length
is 11 bytes. The typical applications are SAE class B ones such as instrumentation/diagnostics and data
sharing in engine, transmission, ABS.
41.1.3.1.4 TTP/C
Time-Triggered Protocol (TTP/C) [10] has been developed at the Vienna University of Technology.
Hardware implementations of the TTP/C protocol, as well as software tools for the design of the
application, are commercialized by TTTech (www.tttech.com).
At the MAC layer, the TTP/C protocol implements the synchronous TDMA scheme: the stations
(or nodes) have access to the bus in a strict deterministic sequential order. Each station possesses the bus
for a constant time duration called a slot during which it has to transmit one frame. The sequence of
slots that all stations have access once to the bus is called a TDMA round.
TTP/C is suitable for SAE class C applications with strong emphasis on fault tolerant and deterministic
real-time feature. It is now one of the two candidates for X-by-Wire applications. The bit rate is not
limited in TTP/C specication. Todays available controllers (TTP/C C2 chips) support data rates as high
as 5 Mbps in asynchronous mode and 5 to 25 Mbps in synchronous mode.
41.1.3.1.5 FlexRay
The FlexRay protocol (www.exray.com) is currently being developed by a consortium of major companies
from the automotive eld. The purpose of FlexRay is, like TTP/C, to provide for X-by-Wire applications
with deterministic real-time and reliability communication. The specication of the FlexRay protocol
is however neither publicly available nor nalized at the time of writing of this chapter.
The FlexRay network is very exible with regard to topology and transmission support redundancy.
It can be congured as a bus, a star, or multistars and it is not mandatory that each station possess
replicated channels even though it should be the case for X-by-Wire applications.
At the MAC level, FlexRay denes a communication cycle as the concatenation of a time-triggered
(or static) window and an event-triggered (or dynamic) window. To each communication window, whose
size is set at design time, a different protocol applies. The communication cycles are executed periodically.
The time-triggered window uses a TDMA protocol. In the event-triggered part of the communication
cycle, the protocol is FTDMA (Flexible Time Division Multiple Access): the time is divided into so called
mini-slots, each station possesses a given number of mini-slots (not necessarily consecutive) and it can
start the transmission of a frame inside each of its own mini-slot. A mini-slot remains idle if the station
has nothing to transmit.
41.1.3.1.6 Local Interconnect Network
Local Interconnect Network (LIN) (www.lin-subbus.org) is a low cost serial communication system
intended to be used for SAE class A applications, where the use of other automotive multiplex networks
such as CANis too expensive. Typical applications are in body domain for controlling door, window, seat,
proof, and climate.
Besides the cost consideration, LINis also a sub network solution to reduce the total trafc load on the
main network (e.g., CAN) by building a hierarchical multiplex system. For this purpose, many gateways
exist allowing for example to interconnect a LIN subnet to CAN.
The protocol of LINis based on the master/slave model. A slave node must wait for being polled by the
master to transmit data. The data length can be 1/2/4/8 bytes. A master can handle at most 15 slaves (there
are 16 identiers by class of data length). LINsupports data rates up to 20 Kbps (limited for EMI-reasons).
41.1.3.1.7 Media Oriented System Transport
Media Oriented System Transport (MOST) (http://mostnet.de/) is a multimedia ber optic network
developed in 1998 by MOST coorperation (a kind of consortium composed of carmakers, set makers,
system architects and key component suppliers). The basic application blocks supported by MOST are
audio and video transfer, based on which end-user applications like radios, GPS navigation, video displays
and ampliers, and entertainment systems can be built.
The MOST protocol denes data channels and control channels. The control channels are used to
set up what data channels the sender and receiver use. Once the connection is established, data can
ow continuously for delivering streaming data (Audio/Video). The MOST network proposes a data rate
of 24.8 Mbps.
41.1.3.1.8 IDB-1394
IDB-1394 is an automotive version of IEEE-1394 for in-vehicle multimedia and telematic applica-
tions jointly developed by the IDB Forum (www.idbforum.org) and the 1394 Trade Association
(www.1394ta.org). IDB-1394 denes a system architecture/topology that permits existing IEEE-1394 con-
sumer electronics devices to interoperate with embedded automotive grade devices. The systemtopology
consists of an automotive grade embedded plastic optical ber network including cable and connectors,
embedded network devices, one or more consumer convenience port interfaces, and the ability to attach
hot-pluggable portable devices.
The IDB-1394 embedded network will support data rates of 100 Mbps, 200 Mbps, and 400 Mbps.
The maximumnumber of embedded devices is limited to 63 nodes.
From both data rate and interoperability with existing IEEE-1394 consumer electronic devices point of
view, IDB-1394 is a serious competitor of the MOST technology.
41.1.3.2 Operating Systems
OSEK/VDX (Offene Systeme und deren schnittstellen fr die Elektronik imKraft-fahrzeug) [11] is a mul-
titask operating systemthat becomes a standard in European automotive industry. Two types of tasks are
supported by OSEK/VDX, basic tasks without blocking point and extended tasks that can include blocking
points. This Operating Systemdoes not allow the dynamic creation/destruction of tasks. It implements a
Fixed Priority (FP) scheduling policy combined with Priority Ceiling Protocol (PCP) [12] to avoid priority
inversion or deadlock due to exclusive resource access. OSEK/VDX offers a synchronization mechanism
through private events and alarms. A task can be preemptive or nonpreemptive. An implementation of
OSEK/VDX has to be compliant to one of the four conformance classes BCC1, BCC2, ECC1, ECC2
dened according to the supported tasks (basic only or basic and extended), the number of tasks on
each priority level (only one or possibly several) and the limit of the reactivation counter (only one or
possibly several). The MODISTARC project (Methods and tools for the validation of OSEK/VDX-based
DISTributed ARChitectures ) [13] aims to provide the relevant test methods and tools to assess the con-
formance of OSEK/VDXimplementations. OSEK/VDXComand OSEK/VDXNMare complementary to
OSEK/VDX for communication and network management services. Furthermore, a language OSEK/OIL
(OSEK Implementation Language) is a basis both for the conguration of an application and the tuning
of the required operating system. In order to ensure dependability and fault tolerance for critical applica-
tions, the time-triggered operating systemOSEKtime [11] was proposed. It supports static scheduling and
offers interrupt handling, dispatching, system time and clock synchronization, local message handling,
and error detection mechanisms and offers predictability and dependability through fault detection and
fault tolerance mechanisms. It is compatible to OSEK/VDX and is completed by FTCom layer (Fault
Tolerant Communication) for communication services.
Rubus is another operating system tailored for automotive industry. It is developed by Arcticus
systems [14], with support fromthe research community, and is, for example, used by Volvo Construction
Equipment. Rubus OS consists of three parts achieving an optimum solution: the Red Kernel, which
manages execution of ofine scheduled time-triggered tasks, the Blue Kernel dedicated for execution of
event-triggered tasks and the Green Kernel in charge of the external interrupts.
These three operating systems are well suited to power train, chassis, and body domain because the
number of tasks integrated in these applications is known ofine. On the other hand, they do not t to
the requirements of telematic applications. In this last domain, are available, for example, Window CE for
Automotive that extends the classical operating systemWindows CE with telematic-oriented features.
Finally, an important issue for the multipartners development and the exibility requirement is the
portability of software components. For this purpose, several projects aim to specify an embedded
middleware, which has to hide the specic communication system (portability) and to support fault
tolerance (see Titus Project [15], ITEA EAST EEA Project [24], DECOS Project [16], or Volcano [17]).
Note that these projects as well as Rubus Concept [14] provide not only a middleware or an operating
systembut also a way for a Component Based Approach for designing a real-time distributed embedded
application.
41.1.4 A Cooperative Development Process
Strong cooperation between suppliers and carmakers in the design process implies the development of
a specic concurrent engineering approach. For example, in Europe or Japan, carmakers provide the
specication of subsystems to suppliers, which are, then, in charge of the design and realization of these
subsystems including the software and hardware components and possibly the mechanical or hydraulic
parts. The results are furnished to the carmakers, who have to integrate them on the car and test them.
The last step consists in calibration activities, that is, in tuning some control and regulation parameters
for meeting the required performances of the controlled systems. This activity is closely related to testing
activities. In United States, the process is slightly different, as the suppliers cannot be really considered as
independent of carmakers. Nevertheless, the subsystem integration and calibration activities are always
to be done and, obviously, any error detected during this integration leads to a costly feedback on the
specication or design steps. Therefore, in order to improve the quality of the development process, new
design methodologies are emerging. In particular, the different actors of a system development apply
more and more methods and techniques ensuring the correctness of subsystems as early as possible in
the design stages and a new trend is to consider the integration of subsystems at a virtual level [18]. This
means that carmakers as well as suppliers will be able to design, prove, and validate the models of each
subsystemand of their integration at each level of the development in a cooperative way. This newpractice
will reduce the cost of development and production of new electronic embedded systems signicantly
while increasing exibility for the design of variants.
41.2 Abstraction Levels for In-Vehicle Embedded
System Description
As shown in Section 41.1.4, the way to improve the quality and the exibility of an embedded electronic
system while decreasing the development and production cost is to design and validate this system at
a virtual level. Therefore, the problem is, rst, to identify the abstraction level at which the components
and the whole system are to be represented. In order to apply validation and verication techniques on
the models, the second problemconsists in specifying which validation and verication activities have to
be applied and, consequently, which formalisms support the identied models.
41.2.1 Architecture Description Languages
Two main key words were introduced formerly: architectures, that refer to the concept of Architecture
Description Language (ADL), well known in computer science and, components, that leads to modularity
principles and object approach. An ADL is a formal approach for software and system architecture spe-
cication [19]. In the avionic context for which the development of embedded systems is relating to the
same problems, MetaH[20] has been developed at Honeywell and, in 2001, has been chosen as the basis of
a standardization effort aiming to dening anAvionics Architecture Description Language (AADL) stand-
ard under the authority of SAE. This language can describe standard control and data ow mechanisms
used in avionic systems, and important nonfunctional aspects such as timing requirements, fault and error
behaviors, time and space partitioning, and safety and certication properties. In automotive industry,
some recent efforts brought a solution for mastering the design, modeling, and validation of in-vehicle
electronic embedded systems. The rst result was obtained by the French project AEE (Architecture
Embedded Electronic) [21] and more specically through the denition of AIL_Transport (Architec-
ture Implementation Language for Transport). This language, based on UML, allows specifying in the
same framework, electronic embedded architectures, from the highest level of abstraction, for the cap-
ture of requirements and the functional views, to the lowest level, for the modeling of an implementation
taking into account services and performances of hardware supports and the distribution of software
components [22, 23].
41.2.2 EAST-ADL for In-Vehicle Embedded System Modeling
Taking AIL_Transport as one of the entry points for the European project ITEAEAST-EEA[24], (July 2001
to June 2004); a new language named EAST-ADL was dened. As AIL_Transport, EAST-ADL offers
a support for the unambiguous description of in-vehicle embedded electronic systems at each level
of their development. It provides a framework for the modeling of such systems through seven views
(see Figure 41.1) [25]:
Vehicle view describing user visible features such as anti-lock braking or windscreen wipers.
Functional analysis architecture level represents the functions realizing the features, their behavior,
and their exchanges. There is an n-to-n mapping between vehicle view entities and functional
analysis architecture entities, that is, one or several functions may realize one or several features.
Functional design architecture level models a decomposition or renement of functions described
at functional analysis architecture level in order to meet constraints regarding allocation,
efciency, reuse, supplier concerns, and so on. Again, there is an n-to-n mapping between entities
on functional design architecture and functional analysis architecture.
Logical architecture level where the class representation of the functional design architecture has
been instantiated to a at software structure suitable for allocation. This level provides an abstrac-
tion of the software components to implement on hardware architecture. The logical architecture
contains the leaf functions of the functional design architecture. Fromthe logical architecture point
of view, the code could automatically be generated in many cases.
Vehicle view
Functional analysis
architecture
Functional design
architecture
Logical
architecture
Operational
architecture
H
a
r
d
w
a
r
e
a
r
c
h
i
t
e
c
t
u
r
e
T
e
c
h
n
i
c
a
l
a
r
c
h
i
t
e
c
t
u
r
e
FIGURE 41.1 The abstraction layers of the EAST-ADL.
In parallel to the application functionality, the execution environment is modeled fromthree views:
1. The hardware architecture level includes the description of the ECUs and more precisely those of the
used microcontroller, the sensors and actuators, the communication links (serial links, networks)
and their connections.
2. At technical architecture level given the model of the operating system or middleware API and the
services provided (behavior of the middleware services, schedulers, frame packing, and memory
management, in particular).
3. The operational architecture models the tasks, managed by the operating systems and frames,
managed by the protocols. On this lowest abstraction level, all implementation details are captured.
A system described on the functional analysis level may be loosely coupled to hardware based on intui-
tion, various known constraints, or as a back annotation from more detailed analysis on lower levels.
Furthermore, the structure of the functional design architecture and of the logical architecture is aware
of the technical architecture. Finally, this EAST-ADL provides the consistency within and between arti-
facts belonging to the different levels, at a syntactic and semantic point of view. This leads to make an
EAST-ADL based model a strong and nonambiguous support for automatically building models suited to
optimal conguration and/or validation and verication activities. For each of these identied objectives
(simulation or formal analysis at functional level, optimal distribution, frame packing, round building
for TDMA-based networks, formal test sequences generation, timing analysis, performance evaluation,
dependability evaluation, etc.), a software, specic to the activity, to the related formalism and to the
EAST-ADL, extracts the convenient data fromthe EAST-ADL repository and translates into the adequate
formalism. Then, the concerned activity can run, thanks to the adequate tools.
41.3 Validation and Verication Techniques
In this section, we introduce, briey in Section 41.3.1, the validation issues in automotive industry and
the place of these activities in the development process and detail in Section 41.3.2 a specic validation
technique that aims to prove that an operational architecture meets its performance properties.
41.3.1 General View of Validation Techniques
The validation of an embedded systemconsists of proving, on the one hand, that this systemimplements
all the required functionalities and, on the other hand, that it ensures functional and extra-functional
properties such as performance, safety properties. From an industrial point of view, validation and
verication activities address two complementary objectives:
1. Validationandvericationof all or parts of a systemat a functional level without taking intoaccount
the implementation characteristics (e.g., hardware performance). For this purpose, simulation or
formal analysis techniques can be used.
2. Verication of properties of all or parts of a system at operational level. These activities integrate
the performances of both the hardware and technical architectures and the load that is due to a
given allocation of the logical architecture. This objective can also be reached through simulation
and formal analysis techniques. Furthermore, according to the level of guarantee required for the
system under verication, a designer can need deterministic guarantees or simply probabilistic
ones, involving different approaches.
The expressionformal analysis is employed when mathematic techniques can be applied on an abstrac-
tion of the system while simulation represents the possibility to execute a virtual abstraction of it.
Obviously, formal analysis leads to an exhaustive analysis of the system, more precisely of the model
that abstracts it. It provides a precise and denitive verdict. Nevertheless, the level of abstraction or the
accuracy of a model is in inverse ratio to its capacity to be treated in a bounded time. So this tech-
nique is generally not suitable to large systems at ne grain abstraction level as required, for example,
for the verication of performance properties of a wide distributed operational architecture; in fact, in
this case, the system is modeled thanks to timed automata or queuing systems whose complexity can
make their analysis impossible. To solve this problem, simulation techniques can be applied. They accept
models at almost any level of detail. However, the drawback is that it is merely impossible to guarantee
all the feasible execution can be simulated. Therefore, the pertinence of these results is linked to the
scenario and the simulation duration and therefore we can only ensure that a system is correct for a
set of scenarios but this does not imply that the system will stay correct for any scenario. In fact, in
automotive industry, simulation techniques are more largely used than formal analysis. An exception
to this can be found in the context of the verication of properties to be respected by frames sharing
a network. Well-known formal approach, usually named timing analysis is available for this purpose.
Finally, note that some tools are of course of general interest for the design and validation of electronic
embedded systems as, for example Matlab/Simulink or Stateow [26], Ascet [27], Statemate [28], and
SCADE [29] etc. In some cases, an interface encapsulates these tools in order to suit the tool to the
automotive context.
Moreover, these techniques that work on virtual platforms are completed by test techniques in order
to assume that a realization is correct: test of software components, of logical architectures, test of an
implemented embedded system. Note that the test activities as well as the simulation ones consist of
providing a scenario of events and/or data that stimulate the systemunder test or stimulate an executable
model of the system; then, in both techniques we have to look which events and/or data are produced by
the system. The input scenario can be manually built or formally generated. In this last case, the test or
simulation activity is closely linked to a formal analysis technique [30].
At last, one of the main targets of validation and verication activities is the dependability aspect
of the electronic embedded systems. As seen in the rst section, some of these systems are said to be
safety-critical. This fact is enhanced in chassis domain through the emergence of X-by-Wire applications.
In this case, high dependability level is required: the system has to be compliant to a number of failures
by hour less than 10
9
(this means that the systemhas to work 115,000 years without a failure). By now,
it is a challenge because it is impossible to ensure this property only through the actual reliability of the
electronic devices. Moreover, as the application may be sensitive to electromagnetic perturbations, its
behavior cannot be entirely predictable. So, the required safety properties can be reached by introducing
fault tolerant strategies.
41.3.2 Validation by Performance Evaluation
The validation of a distributed embedded systemrequires, at least, proving that all the timing properties
are respected. These properties are generally expressed as timing constraints applied to the occurrences
of specic events, as, for example, a bounded jitter on a frame emission, a deadline on a task, a bounded
end-to-end response time between two events. The rst way of doing this is analytically, but this means
one should be able to establish a model that furnishes the temporal behavior of the system and that
can be mathematically analyzed. Considering the complexity of an actual electronic embedded system,
such a model has to be strongly simplied and generally provides only oversized solutions. For instance,
the holistic scheduling approach introduced by Tindell and Clark [31] allows just the evaluation of the
worst-case end-to-end response time for the periodic activities of a distributed embedded application.
Using this holistic scheduling approach, Song et al., in [32], studied the end-to-end task response time for
architecture composed of several ECUs, interconnected by CAN.
Faced with the complexity of this mathematical approach, the simulation of a distributed application
is therefore a complementary technique. It allows taking into account a more detailed model as well as
the unavoidable perturbations that may affect the foreseen behavior. For example, simulation-based
analysis [33] of the systempresented in[32] gave more realistic performance measures thanthose obtained
analytically.
Engine
controller
AGB ABS/VDC Suspension WAS/DHC
ISU Z Z Y
CAN
VAN
FIGURE 41.2 Hardware architecture.
An outline of these two approaches is illustrated in Sections 41.3.2.3 and 41.3.2.2 by the means of
a common case study presented in Section 41.3.2.1; then, in Section 41.3.2.4, the respective results
obtained are compared. Finally, we show how a formal architecture description language, as intro-
duced in Section 41.2, is a strong factor for promoting validation and verication on virtual platform
in automotive industry.
41.3.2.1 Case Study
Figure 41.2 shows the electronic embedded system [34], used in the two following sections as a basis for
both mathematical and simulation approaches. In fact, this system is derived from an actual one presently
embedded in a vehicle manufactured by PSA Peugeot-Citron Automobile Co. [35]. It includes functions
related to power train, chassis, and body domains.
41.3.2.1.1 Hardware Architecture Level (Figure 41.2)
We consider nine nodes (ECU) that are interconnected by means of one CAN and oneVAN network.
The naming of these nodes recall the global function that they support. Engine controller, AGB
(Automatic Gear Box), ABS/VDC (Anti-lock Brake System/Vehicle Dynamic Control), WAS/DHC
(Wheel Angle Sensor/Dynamic Headlamp Corrector), Suspension controller refer nodes connected
on CAN, while X, Y, and Z (named so for condentiality reason) refer nodes connected on VAN.
At last, the ISU (Intelligent Service Unit) node ensures the gateway function between CAN andVAN.
The communication is supported by two networks: CAN 2.0A (bit rate equal to 250 kbps) and
VAN (time slot rate xed to 62.5 kTS/s).
The different ECUs are connected to these networks by means of network controllers. For this case
study are considered the Intel 82527 CAN network controller (14 transmission buffers), the Philips
PCC1008T VAN network controllers (one transmission buffer and one First In First Out (FIFO)
reception queue with two places), and the MHS 29C461 VAN network controllers (handling up to
14 messages in parallel).
41.3.2.1.2 Technical Level
The operating system OSEK [11] runs on each ECU. The scheduling policy is Fixed Priority Protocol.
Each OS task is a basic task, at OSEK sense. In the actual embedded system, preemption is not permitted
for tasks. In the study presented, analytical method is applied strictly to this system, while simulations are
run for different congurations among which, two accept preemptible tasks.
41.3.2.1.3 Operational Level
The entities that are considered at this level are tasks and messages (frames); they are summarized in
Figure 41.3, Figure 41.4, Figure 41.5, and Figure 41.7. The mapping of the logical architecture (not
presented here) onto the Technical/Hardware ones produces 44 OSEK OS tasks (in short, tasks, in the
following) and 19 messages exchanged between these tasks. Furthermore, we assume that a task operating
ECU: Suspension
P
i
T
i
D
i
T_SUS1 4 20 M9 20
T_SUS2 5 M5 20
T_SUS3 1 M1 10
T_SUS4 2 M2 14
T_SUS5 3 M7 15
ECU : Engine_Ctrl
P
i
T
i
Input Output Input Output Input Output
Input Output
Input Output
D
i
T_Engi ne1 1 10 M1 10
T_Engi ne2 4 20 M3 20
T_Engi ne3 7 20 M10 100
T_Engi ne4 3 M4 15
T_Engi ne5 2 M2 14
T_Engi ne6 6 M8 50
T_Engi ne7 5 M6 40
ECU:AGB
P
i
T
i
D
i
T_AGB1 2 15 M4 15
T_AGB2 3 50 M11 50
T_AGB3 4 M8 50
T_AGB4 1 M2 14
ECU: ABS/VDC
P
i
T
i
D
i
T_ABS1 20 M5 20
T_ABS2 40 M6 40
T_ABS3 15 M7 15
T_ABS4 100 M12 100
T_ABS5 M3 20
T_ABS6 M9 20
ECU: WAS/DHC
P
i
T
i
C
i
D
i
T_WAS1 1 14 2 M2 14
T_WAS2 2 M9 2 20
2
5
1
6
3
4
FIGURE 41.3 Operating systemtasks on nodes connected to CAN.
ECU: X
P
i
T
i
Input Output
Input Output
D
i

T_X 1 2 150 M16 150
T_X 2 4 200 M17 200
T_X 3 1 M15 50
T_X 4 3 M19 150
ECU: Z
P
i
T
i
Input Output D
i

T_Z1 2 100 M18 100
T_Z2 3 150 M19 150
T_Z3 4 M17 200
T_Z4 1 M15 50
ECU: Y
P
i
T
i
D
i
T_Y 1 2 50 M15 50
T_Y 2 3 M13 50
T_Y 3 1 M14 10
T_Y 4 4 M18 100
T_Y 5 5 M16 150
FIGURE 41.4 Operating systemtasks on nodes connected to VAN.
ECU: ISU
P
i
T
i
Input Output D
i

T_I SU1 4 50 M8 50
T_I SU2 5 M11 M13 50
T_I SU3 1 M1 M14 10
T_I SU4 6 M10 100
T_I SU5 3 M6 40
T_I SU6 2 M9 20
T_I SU7 7 M12 100
FIGURE 41.5 Operating systemtasks distributed on the gateway ECU.
system consumes (produces) possibly a message simultaneously to the beginning (respectively the end)
of its execution. In the case study, two kinds of task can be identied according to their activation law:
Tasks activated by occurrence of the event reception of a message (event-triggered tasks) as, for
example, T_Engine6 and T_ISU2.
Tasks that are activated periodically (time-triggered tasks), as T_AGB2.
Each task is characterized by its name and, on the ECU, named k, on which it is mapped by (see Figure 41.3,
Figure 41.4, and Figure 41.5):
T
k
i
. Its activation period in ms (for time-triggered tasks) or the name, M
n
, of the message whose
reception activates it (for event triggered tasks).
C
k
i
. Its WCET (Worst-Case Execution Time) on this ECU (disregarding possible preemption);
in the case study, we assume that this WCET is equal to 2 ms for each task.
D
k
i
. Its relative deadline in ms.
P
k
i
. Its priority.
M
i
. Its possibly produced message (we assume, in this case study, that at most, one message
is produced by one task; note the method can be applied even if a task produces more than
one message).
Task
activation
Task
completion
t
Task response time
Pre-emption (for
preemptive task only)
FIGURE 41.6 Task response time.
t
i
Message
(frame)
Task
i
Producer
task
DLC
i
(bytes)
T
i
Inherited
period
M1 _Engine1 8 10
M2 _WAS1 3 14
M3 _Engine2 3 20
M4 AGB1 2 15
M5 _ABS1 5 20
M6 _ABS2 5 40
M7 _ABS3 4 15
M8 _ISU1 5 50
M9 _SUS1 4 20
M10 _Engine3 7 100
M11 _AGB2 5 50
M12 _ABS4
T
T
T
T
T
T
T
T
T
T
T
T 1 100
t
i
Message
(frame)
Task
i
Producer
task
DLC
i
(bytes)
T
i
Inherited
period
M13 T_ISU2 8 50
M14 T_ISU3 10 10
M15 T_Y1 16 50
M16 T_X1 4 150
M17 T_X2 4 200
M18 T_Z1 20
2
100
M19 T_Z2 150
(a) (b)
FIGURE 41.7 Messages exchanged over networks.
For notation convenience, we assume that, on each ECU named k, priority P
k
i
is higher than priority P
k
i +1
.
In the following section, a task is simply denoted
i
if its priority is P
k
i
on an ECU, named k.
The task response time is classically dened as the time interval between the activation of a given task
and the end of its execution (Figure 41.6). We denoted R
j
i
the task response time of the instance j of a
task
i
.
Each message (frame) is characterized by its name and (Figure 41.7):
DLC
i
. Its size, in byte.
C
i
. Its transmission duration; this duration is computed thanks to the formulae given in (41.1) and
(41.2) (see Sections 41.1.3.1.1 and 41.1.3.1.2).
Task
i
. The name of the task that produces it.
T
i
. Its inherited period (for time-triggered tasks), assumed in [31] and [32], to be equal to the
activation period of its producer task.
P
i
. Its priority.
A message will also be denoted by
i
if its priority is P
i
.
The message response time is the time interval between the production of a specic message and its
reception by a consumer task (Figure 41.8). We denoted R
j
i
the message response time of the instance j of
a message
i
.
Finally, in this system, from Figure 41.3, Figure 41.4, Figure 41.5, and Figure 41.7, we identify some
logical chains that are causal sequences of tasks and messages. In the case study, the most complex logical
chains that can be identied are:
lc1: T_Engine1 - M1 - T_ISU3 - M14 - T_Y3
and
lc2: T_AGB2 - M11 - T_ISU2 - M13 - T_Y2
Message
production
=
producer task
completion
t
Message response time
End of message
transmission
=
consumer task
activation
FIGURE 41.8 Message response time.
t
T_AGB2
M11
T_ISU2
M13
T_Y2
T_AGB2
activation
T_AGB2
completion
=
M11
production
T_ISU2
activation
T_Y2
activation
T_ISU2
completion
=
M13
production
T_Y2
completion
Logical chain response time for lc2
FIGURE 41.9 Example of logical chain.
Here, the task T_Y3 (respectively T_Y2), running on a VAN connected node, depends on the message
M14 (respectively M13) supported by the VAN bus; M14 (respectively M13) is produced by task T_ISU 3
(respectively T_ISU 2) running on the ISU node; T_ISU 3 (respectively T_ISU 2) is activated by the
message M1 (respectively M11) that is produced by T_Engine1 (respectively T_AGB2) running on CAN
connected nodes.
The logical chain response time, more generally named End-to-End Response Time, is dened for lc1
(respectively lc2) as the time interval between the activation of T_Engine1 (respectively T_AGB2) and the
completion of T_Y2 (respectively T_Y3) (Figure 41.9). We note R
j
lci
the logical chain response time of the
instance j of the logical chain lci.
41.3.2.1.4 Performance Properties
As presented in Figure 41.3, Figure 41.4, and Figure 41.5, relative deadline constraints are imposed on
each task in this application. Furthermore, for the given application, some other performance properties
were required. Among these properties, we focus on two specic ones:
1. Property A: No message, transmitted on CAN or VAN, is lost.
This means that no message can be overwritten in network controller buffers or, more formally,
that each message is transmitted before its inherited period T
i
, considered as the worst case.
2. Property B: This property is expressed on the two logical chain lc1 and lc2 presented above. The
logical chain response time for lc1 (respectively lc2) is as regular as possible for each instance of
lc1 (respectively lc2). More formally, if R1 is the set of logical chain response times obtained for each
instance j of lc1 (respectively lc2), the property is: j, |R
j
lc1
E [R1]|.
This kind of property is commonly required in embedded automatic control applications where
the command elaborated through a logical chain has to be applied to an actuator as regularly as
possible.
An embedded system is correct if, at least, it meets the above-mentioned properties. Well, the task
scheduling policy on each node and the MAC protocol of VAN and CAN lead unavoidably to jitters
on task terminations. So, a mathematical approach as well as a simulation one were applied in order
to prove that the proposed operational architecture meets all its constraints. Thanks to a mathematical
approach, related to general techniques named timing analysis, for each entity (task or message) and for
each logical chain, we nd out lower and upper bounds of their respective response times. These values
represent the best and worst cases. In order to handle more detailed and more realistic models, we use a
simulation method, which furnishes minimum, maximum, and mean values of the same response times.
Furthermore, several simulations, with different parameter congurations, were performed in order to
obtain an architecture meeting the constraints. In fact, we use the mathematical approach for validating
the results obtained by simulation.
41.3.2.2 Simulation Approach
We model four different congurations of the presented operational architecture according to the formal-
ism supported by SES Workbench tool. For each conguration, we use this tool in order to run a simulation
and obtain different results. Furthermore, as we want to analyze specic response times, we introduce
adequate probes in the model. Thanks to this, the log le obtained throughout the simulation process can
be easily analyzed by applying an elementary lter that furnishes the results in a readable way.
Three kinds of parameters are considered and can be different from one conguration to another:
Networks controllers, specically the VAN ones
The fact that tasks can be preempted or not
The task priority
Rather than describing a simulation campaign that should exhaustively include each possible case among
these possibilities, we prefer to present it by following an intuitive reasoning, starting from a given
conguration (conguration 1) and, by modifying one kind of parameter at a time, leads successively to
better conguration (conguration 2, then conguration 3) for nally reaching a correct conguration
that veries the required properties A and B.
41.3.2.2.1 Conguration 1
As a rst simulation attempt:
As given in the description of the actual embedded system (see Section 41.3.2.1), all the tasks are
considered as being OSEK basic tasks and are characterized by their local priority. Moreover, their
execution is done without preemption.
We assign the Intel 82527 controller to each node connected to CAN bus and the Philips PCC1008T
controller for those are connected on VAN network. Note that the ISU ECU integrates these two
network controllers.
In this case, a probe is introduced in the model; it observes the occurrences of message production and
message transmission and detects the fact that a given instance of a message is stored in the buffer of a
network controller before the instance of the previous produced message was transmitted through the
network. For each of these detected events, it writes a specic string in the log le. The lter then consists
in extracting the lines containing this string from the log le. A screenshot is given in Figure 41.10 where
it can be seen that some messages are overwritten in the single transmission buffer of the VAN controller
that was chosen for this conguration. So, we conclude that the property A is not veried.
FIGURE 41.10 Log le ltered for verication of property A.
Logical chain response times
Simulation results Analytic results
Configurations
Logical
chain
minimum mean maximum
Standard
deviation
minimum maximum
Configuration 2 l c1 9.09 11.03
12.82
16.67 1.775 8.992 22.116
l c2 9.82 16.67 1.458 8.576 41.172
Configuration 3 l c1 9.09 9.45 12.62 0.667 8.992 16.116
l c2 12.45 14.16 20 2.101 8.576 35.172
Configuration 4 l c1 9.09 9.45 12.62 0.667 8.992 16.116
l c2 12.45 12.61 14.11 0.490 8.576 27.172
FIGURE 41.11 Response time evaluation.
One of the possible causes for the nonverication of property A by the previous conguration could
be that the VAN controller PCC1008T, providing only one single buffer, is not suitable for the required
performance property. So, we assignthe full VANcontrollerMHS 29C461 to all nodes transmitting message
on the VAN bus (ISU computer, X, Y, and Z). We modify the SES Workbench model and we relaunch
the simulation. This time probes and lters proposed for the conguration 1 provides an empty list.
So we can conclude that messages are correctly transmitted and that property A is veried. Furthermore,
SES Workbench gives additional results such as the network load; for this conguration the load of CAN
bus is less than 21.5%and of VAN bus less than 41%.
On the same conguration, we study property B. For this purpose, probes are introduced for observing
the occurrences of the rst task activation (T_Engine1 or T_AGB2) and the occurrences of the last task
completion (T_Y3 or T_Y2). Alter is developed for the evaluation of the minimum, mean, and maximum
logical chain response times of lc1 and lc2 as well as their standard deviation. The obtained results are given
in Figure 41.11. Under this conguration, none of the chains meet the required property.
In the two last congurations, preemption is not allowed for any task. We change this characteristic and
allowthe preemption; as T_Engine1 and T_AGB2 have the highest local priority and considering that they
are basic tasks, they will never wait for the processor. One more, we model the operational architecture by
modifying the scheduling policy on nodes Engine_ctlr and AGB without changing the other parameters.
The same probes and lters are used; the results obtained by simulation of conguration 3 are shown
in Figure 41.11. We can conclude that property B is veried only for the logical chain lc1. So, this
conguration does not correspond to a correct operational architecture.
Further log le analysis points out the problem: priority of T_ISU2 is probably too weak. After modifying
the priority of this task (2 in place of 5), by always using the same probes and lters and simulating the
new model, we obtain the results presented in Figure 41.11. The property B is veried for lc1 and lc2.
41.3.2.3 Deterministic Timing Analysis
In order to validate these results, we apply analytic formulas of [32] to the case study. The main purpose
of this analysis is to obtain the lower (best case) and the upper (worst case) bounds on the response times.
It is worth noting that in practice neither the best case nor the worst case can necessarily be achieved, but
they provide some deterministic bounds.
As time-triggered design approach is adopted, both tasks and messages are hoped to be periodic
although in practice jitters exist on events whose occurrences are supposed to be periodic. In the following,
a nonpreemptive task or a message
i
whose priority is P
i
can be indifferently characterized by (C
i
,T
i
) as
dened previously.
As introduced earlier, we are interested in evaluating:
The response time R
i
of such a task or message of priority P
i
.
The logical chain response time of lc1 and lc2 obtained by summing these individual response
times.
41.3.2.3.1 Best-Case Evaluation
The best case corresponds to the situation where a task
i
(respectively a message
i
), whose priority is P
i
,
is executed (respectively transmitted) without any waiting time. In this case,
R
i
= C
i
(41.3)
The best case of the logical chain response time is the sumof best-case response times of all entities (tasks
and messages) involved in the chain. Applying it to the two logical chains, we obtain (see Figure 41.11):
R_best
lcx
=
ylcx
C
y
(41.4)
41.3.2.3.2 Worst-Case Evaluation
We distinguish the evaluation of the worst case for task and message response time.
Messages. For a message
i
of priority P
i
the worst-case response time can be calculated as:
R
i
= C
i
+I
i
(41.5)
where I
i
is the interference period during which the transmission medium is occupied by other higher
priority messages and by one lower priority message (because of nonpreemption). Take notice of the fact
that the message response time is dened here in a way different fromthose specied by Tindell and Burns
in [36], so, the jitter J
i
is not included in the formulae (41.5).
The following recurrence relation can calculate the interference period I
i
:
I
n+1
i
= max
i+1jN
(C
j
) +
i1
j=1
I
n
i
+J
j
T
j
+1
C
j
(41.6)
where N is the number of messages and max
i+1jN
(C
j
) is the blocking factor due to the nonpreemption.
A suitable initial value could be I
0
i
= 0. Equation (41.6) converges to a value as long as the transmission
mediums utilization is less than or equal to 100%. We also notice that the jitters should be taken into
account for the calculation of the worst-case interference period as the higher priority messages are
considered periodic with jitters.
Tasks. For a task
i
whose priority is P
i
, the same arguments leadtoformulae similar tothose obtainedfor
messages. However, we must distinguish two cases for the task response time evaluation. For the nonpree-
mptive xed priority scheduling, equations (41.5) and (41.6) are directly applicable while, if the basic tasks
are scheduled thanks to a preemptive xed priority policy, the factor max
i +1j N
(C
j
) has not to be
considered (the possibility of preemption ensures that a task at a given priority level cannot be preempted
by a task at a lower level). Therefore, the following recurrence relation allows to calculate the response
time of a basic preemptive task
R
n +1
i
= C
i
+
j <i
J
j
+ R
n
i
T
j
C
j
(41.7)
Again, equation (41.7) converges to a value as long as the processor utilization is less than or equal to 100%.
In addition, a suitable initial value for computing could be R
0
i
= 0.
Logical chains. Finally, we can apply equation (41.5) for nonpreemptive case (respectively
equation [41.7] for preemptive case) to calculate the worst-case response time of the two logical chains:
R_worst
lcx
=
y lcx
R
y
(41.8)
Figure 41.11 presents the bounds (minimum and the maximum response time) obtained thanks to this
mathematical timing analysis for both the logical chains according to the equations (41.4) and (41.8).
Note that the maximum response time in Figure 41.11 corresponds to the nonpreemptive case for the
conguration 2 while the other two congurations are based on preemptive assumption.
41.3.2.4 Comments on Results
First, we notice that simulation results remain within bounds given by the analytical method of
Section 41.3.2.3. However, it can be seen that analytic bounds for the worst case are never reached
during simulation. Maximum values obtained by simulation vary from 40 to 70% of analytic calculated
worst cases while mean values vary from 30 to 60%. The importance of the simulation to obtain more
realistic results becomes obvious when evaluating performances of an embedded system. From these
tables we can also see that, compared with nonpreemptive scheduling, preemptive scheduling logically
results in shorter response time for high priority tasks and longer response time for low priority tasks.
Note however that this fact seems to be in contrast with analytic method results, where the worst-case
bound gets better for preemptive policies than for nonpreemptive ones, irrespective of task priority! This
is perfectly normal when results from the two methods have not to be interpreted in the same way: analytic
results can be used as bounds to validate simulation results, but they have different meanings and they are
rather complementary.
41.3.2.5 Automatic Generation of Models for Simulation Purpose
Usually, the direct use of a general-purpose simulation platform is not judged suitable by in-vehicle
embedded system designers since too much effort must be made in building the simulation model.
Thanks to a nonambiguous description of embedded systems, as seen in Section 41.2, it is possible
to generate automatically a model that can be run on a specic discrete simulation tool. For example,
in [34], is proposed a modeling methodology, developed in collaboration with the French carmaker PSA
Peugeot-Citron, based on a component approach. This methodology has been implemented through
the development of a simulation tool called Carosse-Perf and based on SES Workbench simulation
platform [37]. It is composed, on the one hand, of a library of prebuilt component modeled in SES
Workbench formalism and, on the other hand, of a constructor that uses these models and the descrip-
tion of the embedded distributed architecture in order to obtain the whole model that will be simulated.
The constructor extracts the pertinent information from the static description of the system at logical
architecture level (tasks, data exchanged between tasks, behavior), from the technical and hardware archi-
tectures (policies for the access to resources scheduler policy and network protocols , performances
of the hardware components) and, nally, from the description of how the logical architecture is mapped
onto the technical one. Technical and hardware architecture components are modeled once and for
all in SES Workbench formalism. The principle of the model building is presented in Figure 41.12(a).
Library of
predefined
hardware
components
models
Hardware
architecture
modeling
Hardware
architecture
description
Hardware model
in SES
workbench
language
Compilation
Runable simulation
program
E
x
t
r
a
c
t
i
o
n
Constraints
description
Logical
architecture
description
Environment
scenario
description
Runable simulation
program
Simulation
Trace
Trace
analysis
Results
(a) (b)
ESI
LAI
FIGURE 41.12 Simulator generation and simulation process.
As at the simulation step, the behavior of the logical architecture entities (tasks and messages) and the envi-
ronment signal occurrences animate the simulation, the constructor has to include two generic modules
in the model that will be executed by the simulator: a logical architecture interpreter and an environment
scenario interpreter (named respectively LAI and ESI in Figure 41.12) whose role is to extract, during
the simulation, the current event from logical architecture entities or environment signals that is to be
managed by the discrete event simulator.
This kind of tool allows designers to easily build a simulation model of their new in-vehicle embedded
systems (operational architecture) andthento simulate the model. More details about the underlying prin-
ciples can be found in [34]. Carosse-Perf was used to automatically construct the models corresponding
to the four congurations of the case study and to simulate them.
41.4 Conclusions and Future Trends
The part of embedded electronics and especially embedded software takes more and more importance
within a car, in terms of both the functionality and cost. Due to the cost, real-time, and dependability
constraints in automotive industry, many automotive specic networks (e.g., CAN, LIN, FlexRay) and
operating systems (e.g., OSEK/VDX) have been or are still being developed, most of themwithin the SAE
standardization process.
Todays in-vehicle embedded systemis a complex distributed system, mainly composed of four different
domains: power train, chassis, body, and telematic. Functions of the different domains are under quite
different constraints. SAEhas classied the automotive applications into classes A, B, and Cwith increasing
order of criticality on real-time and dependability constraints. For the design and validation of such
a complex system, an integrated design methodology as well as the validation tools are therefore necessary.
After introducing the specicity of the automotive applicationrequirements interms of time-to-market,
design cost, variant handling, real-time and dependability, and multipartner involvement (carmakers and
suppliers) during the development phases, in this chapter, we have described the approach proposed by
EAST-ADL which is a promising design and development framework tailored to t the specic needs of
the embedded automotive applications.
Concerning the validation of the meeting of the application constraints by an implementation of the
designed embedded system, we have reviewed the possible ways and illustrated the use of simulation for
validating the real-time performance. This illustration is done through a case study drawn from a PSA
Peugeot-Citron application. The obtained results have shown that the use of simulation approach, com-
bined with the timing analysis method (especially the holistic scheduling method), permits to efciently
validate the designed embedded architecture.
If we can consider that the power train and body domains begin to achieve their maturity, the chassis
domain and especially the X-by-Wire systems are however still in their early developing phase. The
nalization of the new protocol FlexRay as well as the development of the 42 V power supply will certainly
push forward the X-by-Wire system development. The main challenge for X-by-Wire systems is to prove
that their dependability is at least as high as that of the traditional mechanical/hydraulic systems.
Portability of embedded software is another main preoccupation of the automotive embedded applica-
tion developers, and consists of another main challenge. For this purpose, carmakers and suppliers
established AUTOSAR consortium (http://www.autosar.org/) to propose an open standard for automotive
embedded electronic architecture. It will serve as a basic infrastructure for the management of functions
within both future applications and standard software modules. The goals include the standardization of
basic system functions and functional interfaces, the ability to integrate and transfer functions, and to
substantially improve software updates and upgrades over the vehicle lifetime.
41.5 Appendix: In-Vehicle Electronic System
Development Projects
System Engineering of Time-Triggered Architectures (SETTA). This project (January 2000 to December
2001) was partly funded by the European Commission under the Information Society Technologies pro-
gram. The overall goal of the SETTA project was to push time-triggered architecture an innovative
European-funded technology for safety-critical, distributed, real-time applications such as y-by-wire or
drive-by-wire to future vehicles, aircraft, and to train systems. The consortium was led by Daimler-
Chrysler AG. DaimlerChrysler and the partners Alcatel (A), EADS (D), Renault (F), and Siemens VDO
Automotive (D) acted as the application providers and technology validators. The technology providers
were Decomsys (A) and TTTech (A). The academic research component was provided by the University
of York (GB), and the Vienna University of Technology (A). http://www.setta.org/.
Embedded Electronic Architecture (EAST-EEA) ITEA-Project-No. 00009. The major goal of
EAST-EEA (July 2001 to June 2004) was to enable a proper electronic integration through denition
of an open architecture. This would allow to reach hardware and software interoperability and reuse for
mostly distributed hardware. The partners were AUDI AG (D), BMW AG (D), DaimlerChrysler AG (D),
Centro Ricerche Fiat (I), Opel Power train GmbH (D), PSA Peugeot Citron (F), Renault (F), Volvo
Technology AB (S), Finmek Magneti Marelli Sistemi Elettronici (I), Robert Bosch GmbH (D), Siemens
VDO Automotive AG (D), Siemens VDO Automotive SAS (F), Valeo (F), ZF Friedrichshafen AG (D),
ETAS GmbH (D), Siemens SBS C-LAB (D), VECTOR Informatik (D), CEA-LIST (F); IRCCyN (F),
INRIA (F), Linkping University of Technology (S), LORIA (F), Mlardalen University (S), Paderborn
University C-LAB (D), Royal Institute of Technology (S), Technical University of Darmstadt (D).
www.east-eea.net/docs.
AEE project (Embedded Electronic Architecture). This project (November 1999 to December 2001) was
granted by the French Ministry for Industry. It involved French carmakers (PSA, RENAULT), OEM
suppliers (SAGEM, SIEMENS, VALEO), EADS LV company and research centers (INRIA, IRCCyN,
LORIA). It aimed to specify new solutions for in-vehicle embedded system development. The Architecture
Implementation Language (AIL_Transport) had been dened to specify and describe precisely any vehicle
electronic architecture. http://aee.inria.fr/en/index.html.
Electronic Architecture and System Engineering for Integrated Safety Systems (EASIS). The goal of the
EASIS project (January 2004 to December 2006) is to dene and develop a platform for software-based
functionality in vehicle electronic systems providing common services upon which future applications
can be built; a vehicle on-board electronic hardware infrastructure which supports the requirements
of integrated safety systems in a cost effective manner; a set of methods and techniques for handling
critical dependability-related parts of the development lifecycle and an engineering process enabling the
application of integrated safety systems. This project is funded by the European Community (6th FWP).
Partners are Kuratorium Ofs E. V. (G), DAF Trucks N.V. (N), Centro Richerche FIAT, Societa Consortile
per Azioni (I), Universitaet Duisburg Essen, Standort Essen (G), Dspace GMBH (G), Valo
Electronique et Systmes de Liaison (F), Motorola GMBH (G), Peugeot-Citron Automobiles SA (F),
Mira Limited (UK), Philips GMBH Forschungslaboratorien (G), ZF Friedrichshafen AG (G), Adam
Opel Aktiengesellschaft (G), ETAS (G), Volvo Technology AB (S), Lear Automotive (S), S.L. (S), Vector
Informatik GMBH (G), Continental Teves AG & CO. OHG (G), Decomsys GMBH (A), Regienov (F),
Robert Bosch GMBH (G).
Automotive Open System Architecture (AUTOSAR). The objective of the partnership involved in
AUTOSAR (May 2003 to August 2006) is the establishment of an open standard for automotive E/E
architecture. It will serve as a basic infrastructure for the management of functions within both future
applications and standard software modules. The goals include the standardization of basic system func-
tions and functional interfaces, the ability to integrate and transfer functions and to substantially improve
software updates and upgrades over the vehicle lifetime. The AUTOSAR scope includes all vehicle domains.
A three-tier structure, proven in similar initiatives, is implemented for the development partnership.
Appropriate rights and duties are allocated to the various tiers: Premium Members, Associate Members,
Development Member, and Attendees. http://www.autosar.org/.
References
[1] Society of Automotive Engineers, www.sae.org.
[2] G. Leen, D. Heffernan, Expanding automotive electronic systems, Computer, 35, 8893, 2002.
[3] A. Sangiovanni-Vincentelli, Automotive Electronics: Trends and Challenges, Convergence 2000,
Detroit MI, October 2000.
[4] F. Simonot-Lion, In Car embedded electronic architectures: how to ensure their safety, in Pro-
ceedings of the 4th IFAC Conference on Fieldbus Systems and their Applications FET03. Aveiro,
Portugal, July 2003, pp. 18.
[5] ISO, Road Vehicles Interchange of Digital Information Controller Area Network for High-Speed
Communication, ISO 11898, International Organization for Standardization (ISO), 1994.
[6] ISO, Road Vehicles Low-Speed Serial Data Communication Part 2: Low-Speed Controller Area
Network, ISO 11519-2, International Organization for Standardization (ISO), 1994.
[7] ISO, Road Vehicles Low-Speed Serial Data Communication Part 3: Vehicle Area Network,
ISO 11519-3, International Organization for Standardization (ISO), 1994.
[8] B. Abou, J. Malville, Le bus VAN Vehicle Area Network: Fondements du protocole, Dunod, Paris,
1997.
[9] SAE, Class B Data Communications Network Interface, J1850, Society of Automotive Engineers
(SAE), May 2001.
[10] TTTech, Specication of the TTP/C Protocol, Version 0.5, TTTech Computertechnik GmbH,
July 1999.
[11] OSEK, OSEK/VDX Operating System, Version 2.2, 2001. http://www.osek-vdx.org.
[12] J.B. Goodenough, L. Sha, The priority ceiling protocol: a method for minimizing the blocking of
high priority tasks, in Proceedings of the 2nd International Workshop on Real-Time Ada Issues, Ada
Letters 8, 1988, pp. 2031.
[13] Modistarc Project, http://www.osek-vdx.org/whats_modistarc.htm.
[14] http://www.arcticus.se/.
[15] U. Freund, M. von der Beeck, P. Braun, and M. Rappl, Architecture centric modeling of automotive
control software, SAE Technical paper series 2003-01-0856.
[16] DECOS Project, http://www.decos.at/.
[17] A. Rajnak, K. Tindell, and L. Casparsson, Volcano Communications Concept, Volcano Communica-
tions Technologies AB, Gothenburg, Sweden, 1998. Available at http://www.vct.se.
[18] P. Giusto, J.-Y. Brunel, A. Ferrari, E. Fourgeau, L. Lavagno, and A. Sangiovanni-Vincentelli,
Automotive virtual integration platforms: whys, whats, and hows, in Proceedings of the 2002
IEEE International Conference on Computer Design: VLSI in Computers and Processors (ICCD02),
Freiburg, Germany, 1618 September, 2002, pp. 370378.
[19] M. Nenad, N.T. Richard, A framework for classifying and comparing architecture description
languages, Technical report, Department of Information and Computer Science, University of
California, Irvine, 1997.
[20] V. Steve, MetaH Users Manual, Honeywell Technology Center, Carnegie-Mellon, 1995.
http://www.htc.honeywell.com/metah/uguide.pdf.
[21] AEE, Architecture Electronique Embarque, 1999, http://aee.inria.fr.
[22] E. Jean-Pierre, S.-L. Franoise, An architecture description language for in-vehicle embedded
system development, in Proceedings of the 15th IFAC World Congress, IFAC B02, Barcelona, Spain,
2126 July, 2002.
[23] M. Jrn, E. Jean-Pierre, Embedded electronic architecture, in Proceedings of 3rd International
Workshop on Open Systems in Automotive Networks, Bad Homburg, Germany, 23 February, 2000.
[24] ITEA EAST EEA Project, www.east-eea.net/docs.
[25] U. Freund, O. Gurrieri, J. Kster, H. Lonn, J. Migge, M.O. Reiser, T. Wierczoch, and M. Weber,
An architecture description language for developing automotive ECU-software, in INCOSE2004,
Toulouse, France, 2024 June, 2004, pp. 101112.
[26] www.mathworks.com/.
[27] Ascet SupplyChain, www.ascet.com/.
[28] Ilogic Statemate, www.ilogix.com/.
[29] Esterel Technologies SCADE Suite
TM
for Safety-Critical Software, www.esterel-
technologies.com.
[30] C. Jard. Automatic Test Generation Methods for Reactive Systems. CIRM Summer School,
Marseille, 1998.
[31] Tindell Ken and Clark John, Holistic schedulability analysis for distributed hard real-time systems,
Microprocessor and Microprogramming, 40, 117134, 1994.
[32] Y.Q. Song, F. Simonot-Lion, and N. Navet, De lvaluation de performances du systme de
communication la validation de larchitecture oprationnelle cas du systme embarqu
dans lautomobile, Ecole dt temps rel 1999, Poitiers (France), C.N.R.S., Poitiers (France),
Ed. LISI-ENSMA, 1999.
[33] Y.Q. Song, F. Simonot-Lion, and B. Pierre, VACANS A tool for the validation of CAN-based
applications, in Proceedings of WFCS97, Barcelona, Spain, October 1997.
[34] C. Paolo, Y.Q. Song, F. Simonot-Lion, and A. Mondher, Analysis and simulation methods for
performance evaluation of a multiple networked embedded architecture, IEEE Transactions on
Industrial Electronics, 49, 12511264, 2002.
[35] C. Alain, The electrical electronic architecture of PSA Peugeot Citroen vehicles: current situation
and future trends, in Presentation at Networking and Communication in the Automobile, Munich,
Germany, March 2000.
[36] K. Tindell and A. Burns, Guaranteed message latencies on controller area network (CAN),
in Proceedings of the 1st International CAN Conference, ICC94, 1994.
[37] SES Workbench, HyPerformix Inc. http://www.hyperformix.com.
42
Fault-Tolerant
Services for Safe
In-Car Embedded
Systems
Nicolas Navet and
Institut National Polytechnique de
Lorraine
42.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-1
The Issue of Safety-Critical Systems in the Automotive
Industry Generic Concepts of Dependability
42.2 Safety-Relevant Communication Services . . . . . . . . . . . . . 42-3
Reliable Communication Higher-Level Services
42.3 Fault-Tolerant Communication Systems . . . . . . . . . . . . . . . 42-8
Dependability from Scratch: TTP/C Scalable
Dependability: FlexRay Adding Missing Features to an
Existing Protocol: CAN
42.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-12
Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-12
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42-12
42.1 Introduction
In the next decade, most features of a car will be supported by an electronic embedded system. This
strategy is already used for functions such as light management, window management, door manage-
ment, etc., as well as for the control of traditional functions, such as braking, steering, etc. Moreover,
the planned deployment of X-by-Wire technologies is leading the automotive industry in the world of
safety-critical applications. Therefore, such systems must, obviously, respect their functional requirements,
obey the properties of performance and cost and furthermore, guarantee their dependability despite the
possible faults (physical or design) that may occur. More precisely, the design of such systems must
take into account the dependability of two kinds of requirements. On the one hand, safety, the absence
of catastrophic consequences, for the driver, the passengers, and the environment, has to be ensured
and on the other hand, the system has to provide reliable service and be available for the request of its
users. This section introduces the emerging standards that are likely to inuence the certication process
for in-vehicle embedded systems and describes the general concepts of dependability and the means by
which dependability can be attained. The communication system is a key point for an application: it is
42-1
in charge of the transmission of critical information or events between functions that are deployed on
distant stations (Electronic Control Units ECUs) and it is a means for the OEM(car-makers) to inte-
grate functions provided by different suppliers. So, in this chapter, we pay special attention to in-vehicle
embedded networks and to the services that enhance the dependability of the exchanges and the depend-
ability of the embedded applications. Note that a classical means, that is sometimes imposed by the
regulatory policies in domains close to those in automotives, consists of introducing mechanisms that
enable a system to tolerate faults. The purpose of Section 42.2 is to present the main services, provided
by a protocol, that allow an application to tolerate certain faults. These services generally provide fault
detection and, for some of them, are able to mask fault occurrences from upper layer and to prevent
the propagation of faults. In Section 42.3, we compare some classes of protocols with respect to their
ability to ensure services for increasing the dependability of an application. For each class, we will dis-
cuss the effort needed at the middleware level or application level for reaching the same quality of
system.
42.1.1 The Issue of Safety-Critical Systems in the Automotive Industry
In some domains recognized as critical (e.g., nuclear plants, railways, avionics), safety requirements in
computer-basedembeddedsystems are very rigorous andthe manner of specicationandthe management
of dependability/safety requirements are animportant issue. These systems have toobey regulatory policies
that require these industries to followa precise certication process. At the moment, nothing similar exists
inthe automotive industry for certifying electronic embeddedsystems. Nevertheless, the problemis crucial
for car-makers as well as for suppliers and, so, several proposals, are presently under study. Among the
existing certication standards [1], RTCA/DO-178B [2], used in avionics, or EN50128 [3], applied in the
railway industry, provide stringent guidelines for the development of a safety-critical embedded system.
But, these standards are hardly transposable for in-vehicle software-based systems: partitioning of software
(critical/noncritical), multiple versions, dissimilar software components, use of active redundancy, and
hardware redundancy. In the automotive sector, the Motor Industry Software Reliability Association
(MISRA), a consortium of the major actors of automotive products in UK, proposes a loose model for
the safety-directed development of vehicles with software on-board [4]. Finally, the generic standard
IEC 61508 [5], applied to Electrical/Electronic/Programmable Electronic systems is a good candidate for
supporting a certication process in the automotive industry. In Europe, in particular, in the transport
domain, the trend is to move fromrule-based torisk-based regulation [6]. So, the certication process
will certainly be based on the denition of safety performance levels that characterize a safety function
regarding the consequences of its failures dened as catastrophic, severe, major, minor, or insignicant.
The IEC 61508 standard proposes, in addition to other requirements on the design, validation, and
testing processes, four integrity levels, termed Safety Integrity Levels (SILs) and a quantitative safety
requirement for each (see Table 42.1). The challenge is therefore to prove that each function realized by
a computer-based system, reaches the requirements imposed by its SIL. Dependability,safety,failure,
etc. are terms used in standard documents. So, we evoke, in the next section, denitions admitted in the
context of dependability.
TABLE 42.1 Relationship between Integrity Levels and
Quantitative Requirements for a Systemin Continuous
Operation (IEC-61508)
Integrity Probability of dangerous
level failure occurence/h
SIL 4 P 10
8
SIL 3 10
8
< P 10
7
SIL 2 10
7
< P 10
6
SIL 1 10
6
< P 10
5
Fault-Tolerant Services 42-3
42.1.2 Generic Concepts of Dependability
Dependability is dened in Reference 7 as the ability of a system to deliver service that can justiably be
trusted. The service delivered by a system is its behavior as it is perceived by another system (human or
physical) interacting with it.
A service can deviate from its desired functionality. The occurrence of such an event is termed a failure.
An error is dened as the part of the system state that may cause a failure. A fault is the determined
or hypothesized cause of an error. It can be active, when it produces an error and dormant otherwise.
A system fails according to several failure modes. A failure mode characterizes a service that does not t
with its desired functionality according to three parameters: the failure domain (value domain or time
domain, see Section 42.2.2.2.1), the perception of the failure by several users of the system (consistent or
inconsistent), and the consequences of the failures (from insignicant to catastrophic). As we will see in
Section 42.2 at the communication level, services are available to contend with the occurrences of failures
in the value or time domain and to preserve the consistency, as well as the possibility, of the perception of
a failure by several stations. The consequence of a failure at the communication level is the responsibility
of the designer of the embedded systemand its assessment is a difcult issue.
Dependability is a concept that covers, in fact, several attributes. Froma quality point of view, reliability,
or the continuity of a correct service, and availability, expressing the readiness for a correct service, are
important for automotive embedded systems. Note that the online detection of a lowlevel of the reliability
or availability of a service supported by an embedded systemcan lead to thenonavailabilityof the vehicle
and consequently affect the quality of the vehicle as perceived by the customer.
Safety is the reliability of the system regarding critical failure modes, or failure modes leading to cata-
strophic, severe, or major consequences [8]. This attribute characterizes the ability of a systemto avoid the
occurrences of catastrophic events that may be very costly in terms of monetary loss and human suffering.
One way to reach the safety objective is, rst, to apply a safe development process in order to prevent
and remove any design faults. As presented in Reference 9, this method has to be completed, in the design
step, with an evaluation of the embedded systems behavior (fault forecasting). This can be achieved
through a qualitative analysis (identication of failure modes, component failures, environmental condi-
tions leading to a system failure) and a quantitative analysis (the probability evaluation applied to some
parameters for the verication of dependability properties). The last means for reaching dependability is
to apply a fault-tolerant approach. This technique is mandatory for in-car embedded systems because the
environment of the system is partially known and the reliability of the hardware components cannot be
fully guaranteed.
Note that, the problem, in the automotive industry, is not only to be compliant to standards whose
purpose mainly concerns the safety of the driver, the passengers, the vehicle, and its environment but also
to ensure a level of performance, comfort, and, more generally, the quality of the vehicle. The specication,
in a quantitative way, of the properties required by an electronic embedded system, and the proof that this
systemmeets these requirements are the principal challenges in the automotive industry.
42.2 Safety-Relevant Communication Services
In this section, we discuss the main services and functionalities that the communication system should
offer for easying the design of fault-tolerant automotive applications. In order to reduce the development
time and increase quality through the reuse of validated components, these services should, as much as
possible, be implemented in layers belowthe application-level software. More precisely, some services such
as the global time are usually provided by the communication controller, while others, such as redundancy
management, are implemented in the middleware software layer (e.g., OSEK fault-tolerant layer [10] or
the middleware described in Reference 11). As suggested in Reference 12, solutions where the middleware
is running on a dedicated CPU, will enhance the predictability of the systemby reducing the interactions
between the middleware layer and the application-level software. In particular, it will prevent conicts in
accessing the CPU, which may induce temporal faults such as missed deadlines.
42.2.1 Reliable Communication
The purpose of this section is to discuss the main services and features related to the data exchange
one can expect for safety-critical automotive applications. On the one hand, these services serve to hide
the occurrence of faults from higher levels. For example, a shielded transmission support will mask
some EMIs (electromagnetic interferences), considered as faults. On the other hand, other services are
intended to detect the occurrence of errors and to avoid their propagation in the system (e.g., a Cyclic
Redundancy Check [CRC] will prevent corrupted data from being used by an applicative process).
42.2.1.1 Robustness against EMIs
Embedded automotive systems suffer from environmental perturbations such as particles, temperature
peaks, or EMIs. The EMI type of perturbations has been identied for a long time [13,14] as being
a serious threat to the correct behavior of an automotive system. EMIs can either be radiated by some
in-vehicle electrical devices (switches, relays, etc.) or come from a source outside the vehicle (radio, radar,
ashes of lightning, etc.). EMIs could affect the correct functioning of all the electronic devices but the
transmission support is a particularly weak link. The whole problem is to ensure that the system will
behave according to its specication, whatever the environment.
In general, the same Medium Access Control (MAC) protocol can be implemented on different types
of physical layers (e.g., unshielded pair, shielded twisted pair, or plastic optical ber) which exhibit signif-
icantly different behavior with regards to EMIs (see Reference 15 for more details on the electromagnetic
sensitivity of different types of transmission support). Unfortunately, the use of an all-optical network,
which offers very high immunity to EMIs, is not generally feasible because of the low-cost requirement
imposed by the automotive industry.
Besides using a resilient physical layer, another means to alleviate the EMI problem is to replicate the
transmission channels where each channel transports its own copy of the same frame. Although an EMI
is likely to affect both channels in quite a similar manner, the redundancy provides some resilience to
transmission errors.
The two previous approaches are classical means for hiding as well as a possible fault due to EMIs
that can occur at the physical layer level. Nevertheless, when a frame is corrupted during transmission
(i.e., at least one bit has been inverted), it is crucial that the receiver be able to detect it in order to discard
the frame. This is the role of the CRCwhose so-called Hamming distance indicates the number of inverted
bits below which the CRC will detect the corruption. It is worth noting that if the Hamming distance of
the MAC protocol CRC is too small with regard to the dependability objectives, a middleware layer can
transparently insert an additional CRC in the data eld of the MAC level frame. This will reinforce the
ability of the systemto detect errors happening during the transmission.
42.2.1.2 Time-Triggered Transmissions
One major design issue is to ensure that at run-time no errors will jeopardize the requirements imposed on
the temporal behavior of the system; for data exchanges, these temporal requirements can be imposed on
response times of frames or jitter upon reception. Among communication networks, one distinguishes
time-triggered (TT) protocols where transmissions are driven by the progress of time (i.e., frames are
transmitted at predened points in time) and event-triggered (ET) protocols where transmissions are
driven by the occurrence of events. Major representatives of ET and TT protocols considered for use in
safety-critical in-vehicle communications will be discussed in Section 42.3. Both types of communications
have advantages and drawbacks but it is now widely admitted that dependability is much easier to ensure
using a TT bus (see, for instance, [9,1618]), the main reasons being that:
Access to the medium is deterministic (i.e., the order of the transmissions is dened statically at
the design time and organized inrounds that repeat in cycles), and thus the frame response times
are bounded and there is no jitter at reception.
It simplies the composability, which is the ability to add new nodes without affecting existing
ones,
1
as well as partitioning, which is the property that assures that a failure occurring in one
subsystem cannot propagate to others.
The behavior of a TT communication system is predictable, which makes it easier to understand
its behavior and verify if the temporal constraints have been respected.
Message transmissions can be used as heartbeats, which allow a very prompt detection of station
failures.
Finally, the medium access scheme does not limit the network bandwidth, as is the case with the
arbitration on message priority used by Controller Area Network (CAN), and thus large amounts
of data can be transferred between nodes.
These reasons explain that, currently, only TT communication systems are being considered for use in
safety-critical applications, such as steer-by-wire [19,20] or brake-by-wire.
42.2.1.3 Global Time
Some control functions need to know the occurrence order among a set of events that happened in the
system; some functions, such as diagnosis, even need to be able to precisely date them. This can be achieved
by forming a global synchronized time base.
The second reason why a global time is needed comes from the TT communication scheme. In TT
communications, as time drives the transmissions, all nodes of the network must have a coherent notion
of time and a clock synchronization algorithm is required. This clock synchronization algorithm is, in fact,
a service that tolerates faults that can affect local clocks. In fact, the local clocks tend to drift apart since
oscillators are not perfect; this imposes periodic resynchronization. For instance, on TTP/C, each node
periodically adjusts its clock according to the difference between its own clock and the average value of
those from other nodes (the clocks with the highest value and lowest value are discarded).
A crucial performance metric for a clock synchronization algorithm is the maximum difference that
can be observed among all local clocks. This value directly impacts the networks throughput in TT buses
since the length of a transmission window, in addition to the actual transmission time of the frame, has
to include some extra time to compensate for the skew between local clocks (i.e., a frame transmitted at
the right point in time must not be rejected because the clock of a receiver diverges from the clock of the
sender). Other criteria of major interest are the number and the type of faults (e.g., wrong clock value or
no value received) that can be tolerated by the algorithm. For example, the TTP/C algorithm can tolerate
a single fault on a network composed of at least four nodes (see Reference 21 for a detailed analysis).
42.2.1.4 Atomic Broadcast and Acknowledgment
At some point in time, it is mandatory that some functions distributed on the network have the same
understanding of the state of the system in order to interoperate in a satisfactory manner. This implies
that the information on the state of the system must be consistent throughout the whole network (this
property is termed spatial consistency or exact agreement). The requirement of spatial consistency
is particularly important for active redundancy,
2
which is the basic strategy for ensuring fault tolerance,
that is, the capacity of a system to deliver its service even in the presence of faults. To be able to compare
the output results, it is crucial that the set of all replicated components process the same input data, which,
in particular, implies that the values obtained from local sensors are exchanged over the network. All
nonfaulty nodes must thus receive the messages inthe same order andwiththe same content. This property,
which is calledatomic broadcastor interactive consistent broadcast(see References 22 and 16), enables
1
Adding new nodes requires that some bandwidth has been reserved for their transmission at design time.
For instance, in TTP/C, some slots can be left free for future use.
2
Active redundancy means that a set of components realizing the same functions in parallel enables the system
to continue to operate despite the loss of one or more units. In passive redundancy, additional components are only
activated when the primary component fails.
distributed processes to reach common decisions orconsensusdespite faults, for instance, using majority
voting.
In practice, it may happen that all or a subset of nodes do not receive a message because of an incorrect
signal shape due to EMIs or because nodes are temporarily faulty. The communication system usually
provides, through the use of a CRC for detecting corrupted frames, a weak form of atomic broadcast
that ensures that all stations that successfully receive a frame get the same value. This alone is however
not sufcient for constructing fault-tolerant applications and, in addition, at least the acknowledgment
of the reception of a message is needed because the sender, and possibly other nodes, may have to adapt
their behavior according to this information (e.g., reschedule the transmission of the information in
a subsequent frame). This latter requirement is important, in the automotive context, for distributed
functions, such as steering, braking, or active suspension.
42.2.1.5 Avoiding Babbling-Idiots
As said before, it is crucial that the system does not deviate from the temporal behavior dened at
design time. If a node does not behave in the specied manner, it has to be detected and masked at the
communication systemlevel in order to prevent the failure frompropagating.
It may happen that a faulty ECUtransmits outside its specication, for example, it may send at a wrong
point in time or send a frame larger than planned at design time. When communications are multiplexed,
this will perturb the correct functioning of the whole network, especially the temporal behavior of
the data exchanges. One well-known manifestation is the so-called babbling idiots [23,24] nodes that
transmit continuously (e.g., due to a defective oscillator). To avoid this situation, a component called the
bus guardian, restricts the controllers ability to transmit by allowing transmission only when the node
exhibits a specied behavior. Ideally, the bus guardian should have its own copy of the communication
schedule, should be physically separated from the controller, should possess its own power supply and
should be able to construct the global time itself. Due to the strong pressure fromthe automotive industry
concerning costs, these assumptions are not fullled in general, which reduces the efciency of the bus
guardian strategy.
If the network has a star topology, with a central interface called the star for interconnection,
instead of the classical bus topology, then the star can act as a central bus guardian and protect against
errors that cannot be avoided by a local bus guardian. For instance, a star topology is more resilient to
spatial proximity faults (e.g., temperature peaks) and to faults due to the desynchronization of an ECU
(i.e., the star can disconnect a desynchronized station). To avoid a single point of failure, a dual star
topology should be used with the drawback that the length of the wires is signicantly increased.
42.2.2 Higher-Level Services
In this section, we identify services that provide fault-tolerant mechanisms belonging conceptually to
layers above the MAC in the OSI reference model.
42.2.2.1 Group Membership Service
As discussed in Section 42.2.1.4, atomic broadcast ensures that all nonfaulty stations possess the same
variables describing the state of the systemat a particular point in time. Another property that is required
for implementing fault tolerance at a high level is that all nonfaulty stations knowthe set of stations that are
operational (or nonfaulty). This service, which is basically a consensus on the set of operational nodes,
is provided by the group membership and it is generally highly recommended for X-by-Wire applications.
A classical example detailed in Reference 12 is a brake-by-wire system where four ECUs, interconnected
by a network, control the brakes located at the four wheels of the car. As soon as a wheel ECU is no
longer functioning, the brake force applied to its wheel has to be redistributed among the remaining three
wheels in such a way that the car can be safely parked. As pointed out in Reference 12, for a brake-by-wire
application, the time interval between the dysfunctioning of the wheel ECU and the knowledge of this
event by all other stations has an impact on the safety of the application and thus it has to be bounded
and taken into account at design time.
A membership service implemented at the communication system level assumes that all nodes that are
correctly participating in the communication protocol are nonfaulty. In TT systems, as transmissions are
perfectly foreseeable, the decisions regarding membership can be taken at points in time where frames
should have been received. In a very simplied way, a missing or faulty frame indicates to the receivers
that the sending node is not functioning properly. In addition, a node that is unable to transmit must
consider itself as faulty and stops operating. Since it takes some time to detect faulty nodes, there can
be faulty stations in the membership list of a node during some time intervals. The maximum number
of such undetected faulty nodes, the maximum duration it takes to discover that a node is faulty, the
maximum number of faulty stations, and the types of faults than can be detected are major performance
criteria of a membership algorithm. Other criteria include: the time needed for a repaired node to rejoin
the membership list, how well the different nodes agree on the membership list at any point in time (are
cliques, i.e., sets of stations that disagree on the state of the system, possible?, and how long can these
cliques coexist?) and the implementation overheads mainly in terms of CPU load and network bandwidth.
Group membership algorithms are complex distributed algorithms and formal methods are of great
help in analyzing and validating them; the reader can refer to [21,22,25,26] as good starting points on this
topic.
42.2.2.2 Management of Nodes Redundancy
A classical way for ensuring fault tolerance is to replicate critical components. We saw, in Section 42.2.1.1,
that the redundancy of the bus can hide faults due to EMIs. To achieve fault tolerance, certain nodes are
also replicated and clustered into so-called Fault-Tolerant Units (FTUs). An FTU is a set of several stations
that performs the same function and each node of an FTU possesses its own slot in the round so that the
failure of one or more stations in the same FTU can be tolerated. Actually, the role of FTUs is twofold.
First, they make the system resilient in the presence of transmission errors (some frames sent by nodes
of the FTU may be correct while others are corrupted). Second, they provide a means to ght against
measurement and computation errors occurring before transmission (some nodes may send the correct
values while others may make errors).
42.2.2.2.1 Fail-Silence Property
In the fault tolerance terminology, a node is said to be fail-silent if (1) it sends frames at the correct point
in time (correctness in the time domain), and (2) the correct value is transmitted (correctness in the value
domain), or (3) it sends detectably incorrect frames (e.g., wrong CRC) in its own slot or no frame at all.
A communication system such as TTP/C provides very good support for the requirements (1) and (3)
(whose fulllment provides the so-called fail silence in the temporal domain) especially through the bus
guardian concept (see Section 42.2.1.5), while the value domain is the responsibility of higher-level layers.
The use of fail-silent nodes greatly decreases the complexity of designing a critical application since data
produced by fail-silent nodes is always correct and thus can be safely consumed by the receivers. Tolerating
one arbitrary failure can be achieved with FTUs made of two nodes whereas three are necessary if the nodes
are not fail silent. However, in practice, it is difcult to ensure the fail-silent assumption, especially in the
value domain. Basically, a fail-silent node has to implement redundancy plus error detection mechanisms
and stop functioning after a failure is detected. Self-check mechanisms can be implemented in hardware
or, more usually, in software on commercial off-the-shelf hardware [27]. An example of such mechanisms
is the double execution strategy, which consists of running each task twice and comparing the output.
However, both executions can be affected in the same way by a single error; a solution that provides some
protection against so-called common-mode faults is to perform a third execution with a set of reference
input data and to compare the output of the execution with precomputed results that are known to
be correct. This strategy is known as double execution with reference check.
The reader is referred to References 11, 27, and 28 for good starting points on the problem of
implementing fail-silent nodes.
42.2.2.2.2 Message Agreement
From an implementation point of view, it is usually preferable to present only one copy of data to the
application in order to simplify the application code (considering possible divergences between replicated
message instances is not needed) and to keep it independent from the degree of redundancy (i.e., the
number of nodes composing an FTU).
The algorithm responsible for the choice of the value that will be transmitted to the application is
termedthe agreement algorithm. Many agreement strategies are possible: pick-any (replicated messages
are coming from an FTU made of fail-silent nodes), average-value, pick-a-particular-one (the selected
value has been produced by the best sensor), majority vote, etc. OSEK/VDXconsortium[10] has proposed
a software layer responsible for implementing the agreement strategy. Two other important services
of the OSEK FTCom (Fault-Tolerant Communication layer) are (1) to manage the packing of signals
(elementary pieces of informationsuchas the speedof the vehicle) into frames according to a precomputed
conguration, which is needed if the use of network bandwidth has to be optimized (see, for instance,
References 29 and 30 for frame-packing algorithms), and (2) to provide message ltering mechanisms for
passing only signicant data to the application. Another fault-tolerant layer that offers the agreement
service is described, as well as the set of associated tools, in Reference 11.
42.2.2.3 Support for Functioning Mode
Afunctioning mode is a specic operational phase of an application. Typically, several functioning modes,
that are mutually exclusive, are dened in a safety-critical application. For a vehicle, possible modes
include factory mode (e.g., download of calibration parameters), prerun mode (after doors are unlocked
and before the engine is started preheating is possible for some components), postrun mode (engine
was shut-off but, e.g., cooling can still be necessary), park mode (most ECUs are powered off), and even
show-room mode. Besides these normal functioning modes, the occurrence of a failure can trigger the
switching to a particular mode that will aimto bring the systemback to a safe state again.
Particular functions corresponds to each functioning mode, which means a different set of tasks and
messages as well as different schedules. If mode changes provide exibility, great care must be taken
that changes happen at the right points in time and that all nodes agree on the current mode. The
communication system can provide some support in this area by ensuring that mode changes take place
only at predened points in time, are triggered by the authorized nodes and that the message schedule is
changed simultaneously for all nodes. For example, TTP/C [31,32] offers services for immediate mode
changes (i.e., the change is performed at the end of the transmission window where it was requested)
as well as deferred mode changes (i.e., the change is performed at the end of the current message schedule
or cluster cycle in the TPP/C terminology).
42.3 Fault-Tolerant Communication Systems
Among communication protocols that are considered for being used in safety-critical automotive systems,
one can distinguish three main types:
Protocols that have been designed fromscratch to provide all the main fault-tolerant services. The
prominent representative of this class is the TTP/C protocol [47].
Protocols that offer the basic functionalities for fault-tolerant systems among which are global
time and bus guardians. The idea is to allow a scalable dependability on a per network or
even on a per node basis. Missing features are to be implemented in software layers above
the communication controllers. The representative of this class in the automotive context is
FlexRay [33].
Protocols not initially conceived with the objective of fault tolerance to which missing features are
added. This is the case with CAN[34], current de facto standard in production cars, which is being
considered for use in safety-critical applications (see, for instance, Reference 17) with the condition
of additional features.
42.3.1 Dependability from Scratch: TTP/C
The TTP/C protocol, which is specied in Reference 32, was designed and extensively studied at the
Vienna University of Technology. TTP/C is a central part of the Time-Triggered Architecture (TTA see
Reference 35) which is a complete framework for building fault-tolerant distributed applications according
to the TT paradigm. Hardware implementations of the TTP/C protocol, as well as software tools for the
design of the application, are commercialized by the TTTech company and are available today.
On a TTP/C network, transmission support is replicated and each channel transports its own copy of the
same message. TTP/C can be implemented with a bus topology or a more resilient single star or dual star
topology. At the MAC level, the TTP/C protocol implements a synchronous TDMA scheme: the stations
(or nodes) have access to the bus in a strict deterministic sequential order and each station possesses the
bus for a constant period of time called aslotduring which it has to transmit one frame. The sequence of
slots, such that all stations have accessed the bus one time, is called a TDMA round. The size of the slot
is not necessarily identical for all stations in the TDMA round, but a slot belonging to one station is the
same size in each round. Consecutive TDMA rounds may differ according to the data transmitted during
the slots, and the sequence of all TDMA rounds is the cluster cycle that repeats itself in a cycle.
The TTP/C possesses numerous features and services related to dependability along with TT com-
munication. In particular, TTP/C implements a clique avoidance algorithm (the stations that belong to
a minority in their understanding of the state of the system will eventually be excluded) and a mem-
bership algorithm that also provides data acknowledgment (one knows after a bounded time whether a
station has received a message or not). Bus guardian, global clock, and support for mode changes are also
parts of the specication.
The algorithms used in TTP/C are by themselves intricate and interact in a very complex manner but
most of them have been formally veried (see [21,25,36]). The fault hypothesis used for the design of
TTP/C is well specied, but also quite restrictive (two successive faults such as transmission errors must
occur at least two rounds apart). Situations outside the fault hypothesis are treated using never give up
(NUP) strategies that aimto continue operating in a degraded mode. Fromthe point of view of the set of
available services, TTP/C is a mature solution. In our opinion, future research should investigate whether
the fault hypothesis considered in the TTP/C design are pertinent in the context of automotive embedded
systems where the environment can be very harsh (e.g., bursts of transmission errors may happen). This
can be done starting from measurements taken on-board of prototypes that would help to estimate the
relevance of the fault hypothesis. Other research could study the behavior of the communication system
outside the fault hypothesis and the impact on the application; this could be undertaken by using fault
injection.
42.3.2 Scalable Dependability: FlexRay
A consortium of major companies from the automotive eld is currently developing the FlexRay
protocol. The core members are BMW, Bosch, Daimler-Chrysler, General Motors, Motorola, Philips,
and Volkswagen. The rst publicly available specications of the FlexRay Protocol have already been
released [33].
The FlexRay network is very exible with regard to topology and transmission support redundancy.
It can be congured as a bus, a star, or multistar, and it is not mandatory that each station possess
replicated channels or a bus guardian, even though this should be the case for critical functions. At the
MAC level, FlexRay denes a communication cycle as the concatenation of a TT (or static) window and
an ET (or dynamic) window. In each communication window, whose size is set statically at design time,
a different protocol is applied. The communication cycles are executed periodically. The TT window uses
a TDMA MAC protocol; the main difference with TTP/C is that a station might possess several slots in
the TT window, but the size of all the slots is identical.
In the ET part of the communication cycle, the protocol is FTDMA (Flexible Time Division Multiple
Access): the time is divided into so-called minislots, each station possesses a given number of minislots
Channel A
Channel B
n n+1 n+2
n+2
Frame ID n+1
n
Frame ID n
n+1
Frame ID n+2
n+3
MiniSlot
n+4
Frame ID n+4
n+3
Frame ID n+4
n+5
n+4
n+6
Slot Counter
n+7
FIGURE 42.1 Example of message scheduling in the dynamic segment of the FlexRay communication cycle.
(not necessarily consecutive) and it can start the transmission of a frame inside each of its own minislots.
The bus guardian is not used in the dynamic window to control whether transmissions take place as
specied. A minislot remains idle if the station has nothing to transmit. An example of a dynamic window
is shown in Figure 42.1: on channel B, frame m has begun transmitting in minislot n while minislots n + 1
and n +2 have not been used. It is noteworthy that frame n +4 is not received simultaneously on channels
A and B since, in the dynamic window, transmissions are independent in both channels.
The FlexRay MAC protocol is more exible than the TTP/C MAC since in the static window nodes are
assigned as many slots as necessary (up to 4095 for each node) and since the frames are only transmitted if
necessary in the dynamic part of the communication cycle. In a similar way as with TTP/C, the structure
of the communication cycle is statically stored in the nodes, however, unlike TTP/C, mode changes with
a different communication schedule for each mode are not possible.
From the dependability point of view, FlexRay species solely TT communication with bus guardian
and clock synchronization algorithm on dual wires (shielded or unshielded see Reference 37 for the
specications of the physical layer). Should we consider the example of brake-by-wire in Section 42.2.2.1,
the protocol offers no way offered for a node to know that one of the wheel ECUs is no longer operational,
which would be needed to take the appropriate decision (e.g., redistribution of the brake force). Features
that can be necessary for implementing fault-tolerant applications, such as membership and acknowledg-
ment services or mode management facilities, will have to be implemented in software or hardware layers
on top of FlexRay with the drawback that efcient implementations might be more difcult to achieve
above the data-link layer level. There is indeed in literature individual solutions for each of the missing
services but these protocols might have very complex interactions when used jointly, which requires that
the whole communication prole is carefully validated by tests, simulation, fault injection, and formal
proof under a well-dened fault hypothesis.
In automotive systems, critical and noncritical functions will increasingly coexist and interoperate.
In the FlexRay specication ([33], p. 8), it is argued that the protocol provides scalable dependability, that is,
the ability to operate in congurations that provide various degrees of fault-tolerance. Indeed, the
protocol allows for mixing single and dual transmission supports (interconnected though a star) on the
same network, subnetworks of nodes without busguardians or with different fault-tolerance capability
with regard to clock synchronization, nodes that do not send or receive TT messages, etc. This exibility
can prove to be efcient in the automotive context in terms of cost and reuse of existing components if
missing fault-tolerance features are providedina middleware layer suchas OSEKFTCom(see introduction
of Section 42.2 and Reference 10) or the one currently under development within the automotive industry
project AUTOSAR (see http://www.autosar.org).
42.3.3 Adding Missing Features to an Existing Protocol: CAN
Controller Area Network has proved to be a very cost and performance effective solution for data exchange
in automotive systems during the last 15 years. However, as specied by the ISO standards [34,38],
CAN lacks almost all the features and services identied in Section 42.2 as are important for the
implementation of fault-tolerant systems: no redundant medium, no TT communication, no global time,
no atomic broadcast (even in the weak form described in Section 42.2.1.4, due to the well-known incon-
sistent message omission [39]), no reliable acknowledgment, no bus guardian, no group membership, no
functioning mode management services, etc.
Some authors advocate that CAN can be used as a base and missing facilities can be added as needed
[17] and, over the last years, there was in fact a number of studies and proposals aimed at adding fault-
tolerant features to CAN (see, for instance, [9,4048]). In the rest of this section, we discuss some such
proposals of possible interest for automotive systems.
42.3.3.1 TTCAN: TT Communications on Top of CAN
Two main protocols were proposed to enable TT transmissions over CAN: TTCAN (Time-Triggered
Controller Area Network see References 40 and 49) and FTT-CAN (Flexible Time-Triggered CAN see
Reference 9). In the following, we consider TTCAN, which has received much attention in the automotive
eld since it was proposed by Robert Bosch GmbH, a major actor in the automotive industry.
Time-Triggered CAN was developed on the basis of the CAN physical and data-link layers. The bus
topology of the network, the characteristics of the transmission support, the frame format, as well as the
maximum data rate 1 Mbits/sec are imposed by CAN protocol [49]. In addition to the standard
CAN features, TTCAN controllers must have the possibility to disable automatic retransmission and to
provide the application with the time at which the rst bit of a frame was sent or received [49]. Channel
redundancy is possible, but not standardized, and no bus guardian is implemented in the node. The key
idea is to propose, as with FlexRay, a exible TT/ET protocol. TTCAN denes a basic cycle (the equivalent
of the FlexRay communication cycle) as the concatenation of one or several TT (or exclusive) windows
and one ET (or arbitrating) window. Exclusive windows are devoted to TT transmissions (i.e., periodic
messages) while the arbitrating window is ruled by the standard CAN protocol: transmissions are dynamic
and bus access is granted according to the priority of the frames. Several basic cycles, that differ in their
organization (exclusive and arbitrating windows) and in the messages sent inside exclusive windows, can
be dened. The list of successive basic cycles is called the system matrix and the matrix is executed in
loops. Interestingly, the protocol enables the master node, the node that initiates the basic cycle through the
transmission of the reference message, to stop functioning in TTCAN mode and to resume in standard
CAN. Later, the master node can switch back to TTCAN mode by sending a reference message.
42.3.3.2 Improving Error Connement
Controller Area Network protocol possesses fault connement-mechanisms aimed at differentiating
between short disturbances caused by EMI and permanent failures due to hardware dysfunctioning.
The scheme is based on error counters that are increased and decreased according to particular events
(e.g., successful reception of a frame, reception of a corrupted frame, etc.). The relevance of the algorithms
involved is questionable (see Reference 50) but the main drawback is that a node has to diagnose itself,
which can lead to the nondetection of some critical errors such as the node transmitting continu-
ously a dominant bit (one manifestation of the babbling idiot fault known as stuck-at-dominant,
see Section 42.2.1.5 and Reference 46). Furthermore, other faults such as the partitioning of the network
into several subnetworks may prevent all nodes from communicating due to bad signal reection at the
extremities.
To address these problems, several solutions were proposed among which are the variant of RedCAN
discussed in Reference 47 and CANcentrate discussed in Reference 46. The latter proposal is an active star
that integrates some fault-diagnosis and fault-connement mechanisms that can in particular prevent
a stuck-at-dominant behavior. The former proposal relies on a ring architecture where each node is
connected to the bus through a switch that possesses the ability to exclude a faulty node or a faulty
segment fromthe communication. These two proposals are promising but developments are still needed
(e.g., test implementation, fault injection, formal proofs) before they can be actually used in safety-critical
applications. Furthermore, some faults such as a node transmitting correct frames more often than
specied at design time are not covered by these proposals.
Many other mechanisms were proposed for increasing the dependability on CAN-based networks
[4145,48], but as pointed out in Reference 43, if each proposal solves a particular problem, they have not
been thought to be combined. Furthermore, the fault hypothesis used in the design are not necessarily the
same and the interactions between protocols remains to be studied in a formal way.
42.4 Conclusion
In the current state of practice, automotive embedded systems widely make use of fault-prevention
(e.g., shielded ECU or transmission support), fault-detection (e.g., watch-dog ECU that monitors the
functioning state of the engine controller, check whether a data is obsolete or out-of-range) and fault-
connement techniques (e.g., missing critical data are reconstituted on the basis of other data and more
generally, specication and implementation of several degraded functioning modes). Redundancy is used
at the sensor level (e.g., for the wheel angle) but seldom at the ECU level because of cost pressure and
because the criticality of the functions does not absolutely impose it. Some future functions, such as
brake- and steer-by-wire, are likely to require active redundancy in order to comply with the acceptable
risk levels and the design guidelines that could be issued by certication organisms.
For critical functions that are distributed and replicated throughout the network, the communication
system will play a central role by providing the services that will simplify the implementation of fault-
tolerant applications. The networks that are candidates are TTP/C, FlexRay, and CAN-based TT solutions.
TTP/C is a mature technology that provides the most important services for supporting fault-tolerant
applications. Moreover, TTP/C was designed under a well-specied fault hypothesis and the committees
of most of its algorithms were formally proven. In our opinion, future research should investigate the
relevance of the TTP/C fault hypothesis in the context of automotive embedded systems and the behavior
of the protocol outside the fault hypothesis. At the time of writing, FlexRay, which is developed by the
major actors of the European automotive industry, seems in a strong position for becoming a standard in
the industry. The main advantage of FlexRay is its exibility; in particular, it provides both TT and ET
communications and nodes with different fault-tolerance capabilities can coexist on the same network.
The services provided by FlexRay do not fulll all the needs for fault tolerance and higher-level protocols
will have to be developed and validated before FlexRay can be used in very demanding applications. The
major issue is that higher-level implementations tend to be less efcient (e.g., bandwidth overhead for
acknowledgment, maximum time needed for detecting faulty nodes). Finally, the solutions based on the
TT-CAN protocol will require additional low-level mechanisms for fault connement as well as higher-
level services such as atomic broadcast and membership. Many proposals exist for more dependability
on CAN-based network but much work remains to be done to come up with a coherent and validated
communication stack that includes all necessary services.
Acknowledgment
We would like to thank Mr. Christophe Marchand, project leader in the eld of diagnosis at PSA Peugeot
Citron, for helpful comments on an earlier version of this chapter.
References
[1] Y. Papadopoulos and J.A. McDermid. The potential for a generic approach to certication of
safety-critical systems in the transportation sector. Journal of Reliability Engineering and System
Safety, 63: 4766, 1999.
[2] Radio Technical Commission for Aeronautics. RTCA DO-178B software considerations in
airbone systems and equipment certication, 1994.
[3] CENELEC. Railway applications software for railway control and protection systems,
EN50128, 2001.
[4] P.H. Jesty, K.M. Hobley, R. Evans, and I. Kendall. Safety analysis of vehicle-based systems.
In Proceedings of the 8th Safety-Critical Systems Symposium, Southampton, UK, 2000.
[5] IEC. IEC61508-1, Functional Safety of Electrical/Electronic/Programmable Safety-Related
Systems Part 1: General Requirements, IEC/SC65A, 1998.
[6] J.A. McDermid. Trends in system safety: a European view? In Proceedings of the 7th Australian
Workshop on Safety Critical Systems and Software, North Adelaide, Australia, 2002.
[7] A. Avizienis, J. Laprie, and B. Randell. Fundamental concepts of dependability. In Proceedings of
the 3rd Information Survivability Workshop, Boston, USA, 2000, pp. 712 .
[8] ARTIST, Project IST-2001-34820. Selected topics in embedded systems design: roadmaps for
research, May 2004. Available at http://www.artist-embedded.org/Roadmaps/
ARTIST_Roadmaps_Y2.pdf.
[9] J. Ferreira, P. Pedreiras, L. Almeida, and J.A. Fonseca. The FTT-CAN protocol for exibility in
safety-critical systems. IEEE Micro, Special Issue on Critical Embedded Automotive Networks, 22:
4655, 2002.
[10] OSEK Consortium. OSEK/VDX Fault-Tolerant Communication, Version 1.0, July 2001. Available
at http://www.osek-vdx.org/.
[11] C. Tanzer, S. Poledna, E. Dilger, and T. Fuhrer. A fault-tolerance layer for distributed fault-tolerant
hard real-time systems. In Proceedings of the Annual IEEE Workshop on Fault-Tolerant Parallel and
Distributed Systems, San Juan, Puerto Rico, USA, 1999.
[12] H. Kopetz and G. Bauer. The time-triggered architecture. Proceedings of the IEEE, 91:
112126, 2003.
[13] I.E. Noble. EMCand the automotive industry. Electronics and Communication Engineering Journal,
4(5): 263271, 1992.
[14] E. Zanoni and P. Pavan. Improving the reliability and safety of automotive electronics. IEEE Micro,
13: 3048, 1993.
[15] J. Barrenscheen and G. Otte. Analysis of the physical CAN bus layer. In Proceedings of the 4th
International CAN Conference, ICC97, Berlin, Germany, October 1997, pp. 06.0206.08.
[16] J. Rushby. Acomparison of bus architecture for safety-critical embedded systems. Technical report,
NASA/CR, March 2003.
[17] L.-B. Fredriksson. CAN for critical embedded automotive networks. IEEE Micro, Special Issue on
Critical Embedded Automotive Networks, 22: 2835, 2002.
[18] A. Albert. Comparison of event-triggered and time-triggered concepts with regards to distributed
control systems. In Proceedings of Embedded World 2004, Nrnberg, February 2004.
[19] X-by-Wire Project, Brite-EuRam111 Program. X-By-Wire safety related fault tolerant systems
in vehicles, nal report, 1998.
[20] C. Wilwert, Y.Q. Song, F. Simonot-Lion, and T. Clment. Evaluating quality of service and
behavioral reliability of steer-by-wire systems. In Proceedings of the 9th IEEE International
Conference on Emerging Technologies and Factory Automation (ETFA), Lisbon, Portugal, 2003.
[21] J. Rushby. An overview of formal verication for the time-triggered architecture. In Proceed-
ings of Formal Techniques in Real-Time and Fault-Tolerant Systems, Oldenburg, Germany, 2002,
pp. 83105.
[22] T.D. Chandra and S. Toueg. Unreliable failure detectors for reliable distributed systems. Journal
of the ACM, 43: 225267, 1996.
[23] K. Tindell and H. Hansson. Babbling idiots, the dual-priority protocol, and smart CANcontrollers.
In Proceedings of the 2nd International CAN Conference, London, UK, 1995, pp. 7.227.28.
[24] C. Temple. Avoiding the babbling-idiot failure in a time-triggered communication system. In Pro-
ceedings of the 28th International Symposium on Fault-Tolerant Computing, Munich, Germany,
June 1998.
[25] H. Pfeifer. Formal verication of the TTP group membership algorithm. In Proceedings
of FORTE/PSTV 2000, Pisa, Italy, 2000.
[26] H. Pfeifer and F.W. von Henke. Formal analysis for dependability properties: the time-triggered
architecture example. In Proceedings of the 8th IEEE International Conference on Emerging
Technologies and Factory Automation (ETFA 2001), Antibes, France, October 2001, pp. 343352.
[27] F. Brasileiro, P. Ezhilchelvan, S. Shrivastava, N. Speirs, and S. Tao. Implementing fail-silent nodes
for distributed systems. IEEE Transactions on Computers, 45: 12261238, 1996.
[28] M. Hiller. Software fault-tolerance techniques from a real-time systems point of view
an overview. Technical report, Chalmers University of Technology, Gteborg, Sweden,
November 1998.
[29] R. Santos Marques, N. Navet, and F. Simonot-Lion. Frame packing under realtime constraints.
In Proceedings of the 5th IFAC International Conference on Fieldbus Systems and their Applications
FeT2003, Aveiro, Portugal, July 2003, pp. 185192.
[30] R. Saket and N. Navet. Frame packing algorithms for automotive applications. Technical report
RR-4998, INRIA, 2003. Available at http://www.inria.fr/ rrrt/rr-4998.html.
[31] H. Kopetz, R. Nossal, R. Hexel, A. Krger, D. Millinger, R. Pallierer, C. Temple, and M. Krug. Mode
handling in the time-triggered architecture. Control Engineering Practice, 6: 6166, 1998.
[32] TTTech Computertechnik GmbH. Time-Triggered Protocol TTP/C, High-Level Specication
Document, Protocol Version 1.1, November 2003. Available at http://www.tttech.com.
[33] FlexRay Consortium. FlexRay Communication System, Protocol Specication, Version 2.0, June 2004.
Available at http://www.flexray.com.
[34] International Standard Organization. ISO 11519-2, Road Vehicles Low Speed Serial Data
Communication Part 2: Low Speed Controller Area Network, ISO, 1994.
[35] H. Kopetz. Real-Time Systems: Design Principles for Distributed Embedded Applications. Kluwer
[36] G. Bauer and M. Paulitsch. An investigation of membership and clique avoidance in ttp/c.
In Proceedings of the 19th IEEE Symposium on Reliable Distributed Systems, Nrnberg, Germany,
2000.
[37] FlexRay Consortium. FlexRay Communication System, Electrical Physical Layer, Version 2.0,
June 2004. Available at http://www.flexray.com.
[38] International Standard Organization. ISO 11898, Road Vehicles Interchange of Digital
Information Controller Area Network for High-Speed Communication, ISO, 1994.
[39] J. Runo, P. Verssimo, G. Arroz, C. Almeida, and L. Rodrigues. Fault-tolerant broadcasts in CAN.
In Proceedings of the 28th International Symposium on Fault-Tolerant Computing Systems, IEEE,
Munich, Germany, June 1998, pp. 150159.
[40] International Standard Organization. 11898-4, RoadVehicles Controller Area Network (CAN)
Part 4: Time-Triggered Communication, ISO, 2000.
[41] G. Lima and A. Burns. Timing-independent safety on top of CAN. In Proceedings of the 1st
International Workshop on Real-Time LANs in the Internet Age, Vienna, Austria, 2002.
[42] G. Lima and A. Burns. A consensus protocol for CAN-based systems. In Proceedings of the 24th
Real-Time Systems Symposium, Cancun, Mexico, 2003, pp. 420429.
[43] G. Rodriguez-Navas, M. Barranco, and J. Proenza. Harmonizing dependability and real time in
CAN networks. In Proceedings of the 15th Euromicro Conference on Real-Time Systems, Porto,
Portugal, 2003.
[44] J. Ferreira, L. Almeida, J. Fonseca, G. Rodriguez-Navas, and J. Proenza. Enforcing consistency
of communication requirements updates in FTT-CAN. In Proceedings of the 22nd Symposium on
Reliable Distributed Systems, Florence, Italy, 2003.
[45] G. Rodriguez-Navas and J. Proenza. Clock synchronizationinCANdistributed embedded systems.
In Proceedings of the 3rd International Workshop on Real-Time Networks, Catania, Italia, 2004.
[46] M. Barranco, G. Rodriguez-Navas, J. Proenza, andL. Almeida. CANcentrate: anactive star topology
for can networks. In Proceedings of the 5th International Workshop on Factory Communication
System, Vienna, Austria, 2004.
[47] H. Sivencrona, T. Olsson, R. Johansson, and J. Torin. RedCAN: simulations of two fault recovery
algorithms for CAN. In Proceedings of the 10th IEEE Pacic Rim International Symposium on
Dependable Computing, Papeete, French Polynesia, 2004, pp. 302311.
[48] L.M. Pinho and F. Vasques. Reliable real-time communication in can networks. IEEE Transactions
on Computers, 52: 15941607, 2003.
[49] Robert Bosch GmbH. Time Triggered Communication on CAN. Available at http://www.can.
bosch.com/content/TT_CAN.html, 2004.
[50] B. Gaujal and N. Navet. Fault connement mechanisms onCAN: analysis and improvements. IEEE
Transactions on Vehicular Technology, 54(5), 2004. Accepted for publication. Preliminary version
available as INRIA Research Report at http: //www.inria.fr/rrrt/rr-4603.html.
43
Volcano Enabling
Correctness by Design
Antal Rajnk
Volcano Communications
Technologies AG
43.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-1
43.2 Volcano Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-3
Volcano Signals and the Publish/Subscribe Model Frames
Network Interfaces The Volcano API Timing Model
Capture of Timing Constraints
43.3 Volcano Network Architect. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-10
The Car OEM Tool Chain One Example VNA Tool Overview
43.4 Volcano Software in an ECU . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-15
Volcano Conguration Workow
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-18
More Information. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43-18
43.1 Introduction
Volcano is a holistic concept dening a protocol independent design methodology for distributed real-time
networks in vehicles. The concept is dealing with both technical and nontechnical entities (i.e., partitioning
of responsibilities into well-dened roles in the development process).
The vision of Volcano is Enabling Correctness by Design. By taking a strict systems engineering
approach and focusing resources into design, a majority of system-related issues can be identied and
solved early in a project. The quality is designed into the vehicle, not tested out. Minimized cost, increased
quality, and high degree of conguration/reconguration exibility are the trademarks of the Volcano
concept.
The Volcano approach is particularly benecial as the complexity of vehicles is increasing very rapidly
and as projects will have to cope with new functions and requirements throughout their lifetime.
A unique feature of the Volcano concept is the solution called post-compile-time reconguration
exibility, where the network conguration containing signal to frame mapping, ID assignment, and
frame period is located in a congurable ash area of the Electronic Control Unit (ECU), and can
be changed without the need for touching the application software thus eliminating the need for
re-validation, saving cost, and lead-time. The origin of the concepts can be traced back to a project at
Volvo Car Corporation during 1994 to 1998 when development of Volvos new large platform [3] had
taken place. It is reusing solid industrial experience, and is taking into account recent ndings from
real-time research (Figure 43.1) [2].
43-1
ECE
PDM
PSM
RTI
REM
TCM
ETM
SAS
ABS
DDM
DIM SWM
CCM
AUM
PHM
SRS
UEM
CEM
CAN high speed (250kbit)
CAN low speed (125kbit)
FIGURE 43.1 The main networks of the Volvo S80 [4].
The concept is characterized by three important features:
Ability to guarantee the real-time performance of the network already at the design stage, thus
signicantly reducing the need for testing.
Built-in exibility enabling the vehicle manufacturer to upgrade the network in the preproduction
phase of a project as well as in the aftermarket.
Efcient use of available resources.
The actual implementation of the concept consists of two major parts:
The ofine tool-set for requirement capturing and automated network design (covering multiple
protocols and gateway conguration). It provides strong administrative functions for variant and
version handling, which are needed during the complete life cycle of a car project.
The target part, represented by a highly efcient and portable embedded software package, offers a
signal-based API, handles multiple protocols, integrated gateway functionality, and post-compile-
time reconguration capability, together with a PC-based generation tool.
Even though the implementation originally supported the Control Area Network (CAN) and Volcano
lite
1
protocols, it has successfully been extended to t other emerging network protocols also. LIN was
added rst, followed by the FlexRay and MOST protocols. The philosophy behind this is that communic-
ation has to be managed in one single development environment, covering all protocols used, in order
to ensure end-to-end timing predictability, which still provides the necessary architectural freedom to
choose the most economic solution for the task.
The Volcano approach is particularly benecial as the complexity of vehicles is increasing very rapidly
and as projects will have to cope with new functions and requirements throughout their lifetime. Over the
last 40 years the computing industry has discovered that certain techniques are needed in order to manage
complex software systems. Two of these techniques are: abstraction (where unnecessary information is
hidden) and composability (if software components proven to be correct are combined, then the resulting
system will be correct as well). Volcano is making heavy use of both these techniques.
The automotive industry is implementing an increasing number of software functions. Introduction of
protocols, such as MOST for multimedia and FlexRay for active chassis systems, results in highly complex
electrical architectures. Finally all these complex subnetworks are linked through gateways. The behavior
of the entire car network has a crucial inuence upon the cars performance and reliability. To manage
software development involving many suppliers, hundreds of thousands of lines of code and thousands of
signals require a structured systems engineering approach. Inherent in the concept of systems engineering
is a clear partitioning of the architecture, requirements, and responsibilities.
1
A low-speed, SCI-based proprietary master-slave protocol used by Volvo.
Volcano Enabling Correctness by Design 43-3
A modern vehicle includes a number of microprocessor-based components called Electronic Control
Units (ECUs), provided by a variety of suppliers.
Control Area Network provides an industry-standard solution for connecting ECUs together using a
single broadcast bus. A shared broadcast bus makes it much easier to add desired functionality: ECUs
can be added easily, and they can communicate data easily and cheaply (adding a function may be just
software). But increased functionality leads to more software and greater complexity. Testing a module
for conformance to timing requirements is the most difcult of the problems. With a shared broadcast
bus, the timing performance of the bus might not be known until all the modules are delivered and the bus
usage of each is known. Testing for timing conformance can only then begin (which is often too far into
the development of a vehicle to nd and correct major timing errors). The supplier of a module can only
do limited testing for timing conformance: they do not have a complete picture of the nal load placed on
the bus. This is particularly important when dealing with the CAN bus: arrivals of frames from the bus
may cause interrupts on a module wishing to receive the frames, and so the load on the microprocessor
in the ECU is partially dependent on the bus load.
It is often thought that CANis somehowunpredictable and the latencies for lower priority frames in the
network are unbounded. This is untrue, and in fact CANis a highly predictable communications protocol.
Furthermore, CAN is well suited to handle large amounts of trafc with differing time constraints.
However, with CAN there are a few particular problems:
The distribution of identiers. CAN uses identiers for two purposes: distinguishing different
messages on the bus, and assigning relative priorities to those messages the latter being often
neglected.
Limited bandwidth. This is due to low maximum signaling speed of 1 Mbit/sec, further reduced by
signicant protocol overhead.
Volcano was designed to provide abstraction, composability, and identier distribution reecting true
urgencies, and at the same time providing the most efcient utilization of the protocol.
43.2 Volcano Concepts
The Volcano concept is founded on the ability to guarantee the worst-case latencies of all frames sent in a
multiprotocol network system. This is a key step because it gives the following:
A way of guaranteeing that there are no communications-related timing problems.
A way of maximizing the amount of information carried on the bus. The latter is important for
reduced production costs.
The possibility to develop highly automated tools for design of optimal network congurations.
The timing guarantee for CANis providedby mathematical analysis developedfromacademic research[1].
Other protocols, such as FlexRay, are predictable by design. For this reason, some of the subjects discussed
below are CAN specic, others are independent of the protocol used.
The analysis is able to calculate the worst-case latency for each frame sent on the bus. This latency
is the longest time from placing a frame in a CAN controller at the sending side to the time the frame is
correctly received at all receivers. The analysis needs to make several assumptions about how the bus is
used. One of these assumptions is that there is a limited set of frames that can access the bus, and that
time-related attributes of these frames are known (e.g., frame size, frame periodicity, queuing jitter, and
so on).
Another important assumption is that the CAN hardware can be driven correctly:
The internal message queue within any CAN controller in the system is organized (or can be used)
such that the highest priority message will be sent out rst if more than one message is ready
to be sent. (The hardware-slot position based arbitration is OK as long as the number of sent
frames is less than the number of transmit slots available in the CAN controller.)
The CAN controller should be able to send out a stream of scheduled messages without releasing
the bus in the interframe space between two messages. Such devices will arbitrate for the bus right
after sending the previous message and will only release the bus in case of lost arbitration.
A third important assumption is the error model: the analysis can account for retransmissions due to
errors on the bus, but requires a model for the number of errors in a given time interval.
The Volcano software running in each ECU controls the CAN hardware and accesses the bus so that
all these assumptions are met, allowing application software to rely on all communications taking place on
time. This means that integration testing at the automotive manufacturer can concentrate on functional
testing of the application software.
Another important benet is that a large amount of communications protocol overhead can be avoided.
Examples of how protocol overheads are reduced by obtaining timing guarantees are:
There is no need to provide frame acknowledgment within the communications layer, dramatically
reducing bus trafc. The only case where an ECU can fail to receive a frame via CAN is if the ECU
is off the bus, a serious fault that is detected and handled by network management and on-board
diagnostics.
Retransmissions are unnecessary. The system-level timing analysis guarantees that a frame will
arrive on time. Timeouts only happen after a fault, which can be detected and handled by network
management and/or the on-board diagnostics.
A Volcano system never suffers from intermittent overruns during correct operation because of the
timing guarantees, and therefore achieves these efciency gains.
43.2.1 Volcano Signals and the Publish/Subscribe Model
The Volcano system provides signals as the basic communication object. Signals are small data items
that are sent between ECUs.
The publish/subscribe model is used for dening signaling needs. For a given ECU there are a set of
signals that are published (i.e., made available to the system integrator), and a number of subscribed
signals (i.e., signals that are required as inputs to the ECU).
The signal model is provided directly to the programmer of ECU application software, and the Volcano
software running in each ECU is responsible for translation between signals and CAN frames.
An important design requirement for the Volcano software was that the application-programmer
is unaware of the bus behavior: all the details of the network are hidden and the programmer only deals
with signals through a simple API. This is crucial because a major problem with alternative techniques
is that the application software makes assumptions about the CAN behavior and, therefore, changing the
bus behavior becomes difcult.
In Volcano there are three types of signals:
Integer signals. These represent unsigned numbers and are of a static size between 1 and 16 bits. So,
for example, a 16-bit signal can store integers in the range 0 to 65,535.
Boolean signals. These represent truth conditions (true/false). Note that this is not the same as
a 1-bit integer signal (which stores the integer values 0 or 1).
Byte signals. These represent data with no Volcano-dened structure. A byte signal consists of
a xed number of between 1 and 8 bytes.
The advantage of Boolean and integer signals is that the values of a signal are independent of pro-
cessor architecture (i.e., the values of the signals are consistent regardless of the endian-ness of the
microprocessors in each ECU).
For published signals, Volcano internally stores the value of these signals and in case of periodic signals
will send them to the network according to a pattern dened ofine by the system integrator. The system
integrator also denes the initial value of a signal. The value of a signal persists until updated by the
application program via a write call or until Volcano is reinitialized.
For subscribed signals, Volcano internally stores the current value of each signal. The system integrator
also denes the initial value of a signal. The value of a subscribed signal persists until:
It is updated by receiving a new value from the network
Volcano is reinitialized
A signal refresh timeout occurs and the value is replaced by a substitute value dened by the
application-programmer
In the case where new signal values are received from the network, these values will not be reected in the
values of subscribed signals until a Volcano input call is made.
A published signal value is updated via a write call. The latest value of a subscribed signal is obtained
via a read call. Awrite call for a subscribed signal is not permitted.
The last-written value of a published signal may be obtained via a read call.
43.2.1.1 Update Bits
The Volcano concept permits placement of several signals with different update rates into the same frame.
It provides a special mechanism named update bit to indicate which signals within the frame has
actually been updated: that is, the ECU generating the signal wrote a fresh value of the signal since the last
time it has been transmitted. The Volcano software on an ECU transmitting a signal automatically clears
the update bit when it has been sent. This ensures that a Volcano-based ECU on the receiving side will
know each time the signal has been updated (the application can see this update bit, by using ags tied
to an update bit: see below). Using update bits to their full extent require that the underlying protocol
is secure. (Frames cannot be lost without being detected.) The CAN protocol is regarded as such, but
not the LIN protocol. Therefore, the update bit mechanism is limited to CAN within Volcano.
43.2.1.2 Flags
A ag is a Volcano object purely local to an ECU. It is bound to one of two things:
The update bit of a received Volcano signal; the ag is set when the update bit is set.
The containing frame of a signal; the ag is set when the frame containing the signal is received
(regardless of whether an update bit for the signal is set).
Many ags can be bound to each update bit, or the reception of a containing frame. Volcano sets all the
ags bound to an object when the occurrence is seen. The ags are cleared explicitly by the application
software.
43.2.1.3 Timeouts
A timeout is, like the ags, a Volcano object purely local to an ECU. The timeout is declared by the
application-programmer and is bound to a subscribed signal. A timeout condition occurs when the
particular signal was not received within the given time limit. In this case, the signal (or/and a number of
other signals) is/are set to a value specied as part of the declaration of the timeout. As with the ags, the
timeout reset mechanism can be bound to either:
The update bit of a received Volcano signal.
The frame carrying a specic signal.
43.2.2 Frames
A frame is a container capable of carrying a certain amount of data (0 to 8 bytes for CAN and LIN).
Several signals can be packed into the available data space and transmitted together in one frame on the
network. The total size of a frame is determined by the protocol. A frame can be transmitted periodically
or sporadically. Each frame is assigned a unique identier. The identier serves two purposes in the
CAN case:
Identifying and ltering a frame on reception at an ECU.
Assigning a priority to a frame.
43.2.2.1 Immediate Frames
Volcano normally hides the existence of network frames from the application designer. However, under
certain cases there is a need to send and receive frames with very short processing latencies. In these cases
direct application support is required. Such frames are designated immediate frames.
There are two Volcano calls to handle immediate frames:
Atransmit call, which immediately sends the designated frame to the network.
A receive call, which immediately processes the designated incoming frame if that frame is
pending.
There is also a read update bit call to test the update bit of a subscribed signal within an immediate
frame.
The signals packed into an immediate frame can be accessed with normal read and write function
calls in the same way as all other normal signals. The application-programmer is responsible for ensuring
that the transmit call is made only when the signal values of published signals are consistent.
43.2.2.2 Frame Modes
In Volcano, it is allowed to specify different frame modes for an ECU. A frame mode is a description of
an ECU working mode, where a set of frames (signals) can be active (in- and output). The frames can
be active in one or many frame modes. The timing properties of frames do not have to be the same for
different frame modes supporting the same frame.
43.2.3 Network Interfaces
A network interface is the device used to send and receive frames to and from networks. A network
interface connects a given ECU to the network. In the CAN case, more than one network interface (CAN
controller) on the same ECU may be connected to the same network. Likewise, an ECU may be connected
to more than one network.
The network interface inVolcano are protocol specic. The protocols currently supported are CANand
LIN FlexRay and MOST are under implementation.
The network interface is managed by a standard set of Volcano calls. These allow the interface to
be initialized or reinitialized, connected to the network (i.e., begin operating the dened protocol),
disconnected from the network (i.e., take no further part in the dened protocol). There is also a Volcano
call to return the status of the interface.
43.2.4 The Volcano API
The Volcano API provides a set of simple calls to manipulate signals and to control the CAN/LIN control-
lers. There are also calls to control Volcano sending to, and receiving fromnetworks. To manipulate signals
there are read and write calls. A read call returns to the caller the latest value of a signal; a write
call sets the value of a signal. The read and write calls are the same regardless of the underlying
network type.
43.2.4.1 Volcano Thread-of-Control
There are two Volcano calls that must be called at the same xed rate: v_input() and v_output(). If the
v_gateway() function is used, the same calling rate shall be used as for the v_input() and v_output()
functions. The v_output() call places the frames into the appropriate controllers. The v_input() call takes
received frames and makes the signal values available to read calls. The v_gateway() call copies values
of signals in frames received from the network to values of signals in frames sent to the network. The
v_sb_tick() call handles transmitting and receiving frames for sub-buses.
Volcano also provides a very low latency communication mechanism in the form of the immediate
frame API. This is aviewof frames on the network, which allows transmission and reception from/to the
Volcano domain without the normal Volcano input/output latencies, or mutual exclusion requirements
with the v_input() and v_output() calls. There are two communication calls in the immediate signal API:
v_imf_rx() and v_imf_tx().
The v_imf_tx() call copies values of immediate signals into a frame and places the frame in the
appropriate CANcontroller for transmission. The v_imf_rx() takes a receivedframe containing immediate
signals and makes the signal values available to read calls.
A third call v_imf_queued() allows the user to see if an immediate frame has really been sent on the
network. The controller calls allow the application to initialize, connect, and disconnect from networks,
and to place the controllers into sleep mode among others.
43.2.4.2 Volcano Resource Information
The ambition of the Volcano concept is to provide a fully predictable communications solution. In order
to achieve this, the resource usage of the Volcano embedded part has to be determined. Resources of
special interest are memory and execution time.
43.2.4.2.1 Execution Time of Volcano Processing Calls
In order to bound processing time a budget for the v_input() call that is, the maximum number of
frames that will be processed by a single call to v_input() has to be established. A corresponding process
for transmitted frames applies as well.
43.2.5 Timing Model
TheVolcanotiming model covers end-to-endtiming (i.e., frombuttonpress toactivation). Atiming model
is used to be able to set in context the signal timing information needed in order to analyze a network
conguration of signals and frames. This section denes the required information that must be provided
by an application-programmer in order to be able to guarantee the end-to-end timing requirements.
A Volcano signal is transported over a network within a frame. Figure 43.2 identies six time points
between the generation and consumption of a signal value.
Max_age
T
PL
T
SL
Notional
generation
Frame enters
arbitration
Transmission
completed
Notional
consumption
1 2 3 4 5 6
Time
T
T
T
BT
T
AT
First v_output
at which new
value is
available
First v_input
at which
signal is
available
FIGURE 43.2 The Volcano timing model.
The six time points are:
1. Notional generation (signal generated) either by hardware (e.g., switch pressed) or software
(e.g., timeout signaled). The user can dene this point to best reect their system.
2. First v_output() (or v_imf_tx() for an immediate frame) at which a new value is available. This is
the rst such call after the signal value is written by a write call.
3. The frame containing the signal is rst entered for transmission (arbitration on a CAN bus).
4. Transmission of the frame completes successfully (i.e., the subscribers communication controller
receives the frame from the network).
5. v_input() (or v_imf_rx() for an immediate frame) makes the signal available to the application.
6. Notional consumption the user application consumes the data. The user can dene this point
to best reect their system.
The max_age of the signal is the maximum age, measured from notional generation, at which it is
acceptable for notional consumption. The max_age is the overall timing requirement on a signal.
T
PL
(publish latency) is the time from notional generation to the rst v_output() call when the signal
value is available to Volcano (a write call has been made). It will depend on the properties of the
publishing application. Typical values might be the frame_processing_period (if the signal is written
fresh at every period but this is not synchronized with v_output()), the offset between the write call and
v_output() (if the two are synchronized), or the sum of the frame_processing_period and the period of
some lower rate activity that generates the value. This value must be given by the application-programmer.
T
SL
(subscribe latency), the time from the rst v_input that makes the new value available to the
application to the time when the value is consumed. The consumption of a signal is a user-dened event
that will depend on the properties of the subscribing function. As an example it can be a lamp being lit,
or an actuator starting to move. This value must be given by the application-programmer.
The intervals T
BT
, T
T
, and T
AT
are controlled by the Volcano 5 conguration and are dependent upon
the nature of the frame in which the signal is transported.
The value T
BT
is the time before transmission (the time from the v_output call until the frame enters
arbitration on the bus). T
BT
is a per-frame value that depends on the type of frame carrying the signal
(see later sections). This time is shared by all signals in the frame, and is common to all subscribers to
those signals.
The value T
AT
is the time after transmission (the time from when the frame has been successfully
transmitted on the network until the next v_input call). T
AT
is a per-frame value that may be different for
each subscribing ECU.
The value T
T
is the time required to transmit the frame (including the arbitration time) on the
network.
43.2.5.1 Jitter
The application-programmer at the supplier must also provide information of the jitter to the systems
integrator. This information is as follows:
The input_jitter and output_jitter refer to the variability in the time taken to complete the v_input()
and v_output() calls, measured relative to the occurrence of the periodic event causing Volcano processing
to be done (i.e., calls to v_input() v_gateway(), and v_output() to be made). Figure 43.3 shows how the
output_jitter is measured. In the gure, E marks the earliest completion time of the v_output() call, and
L marks the latest completion time, relative to the start of the cycle. The output_jitter is therefore L E.
The input_jitter is measured according to the same principles.
If a single-thread systemis used, without interrupts, the calculation of the input_jitter and output_jitter
is straightforward: the earliest time is the best-case execution time of all the calls in the cycle (including
the v_output() call), and the latest time is the worst-case execution time of all the calls. The situation is
more complex if interrupts can occur or the system consists of multiple tasks, since the latest time must
take into account preemption from interrupts and other tasks.
T
V
E
Occurrence of
periodic event
that initiates
Volcano
processing calls
Worst-case
execution time of
v_output call
Frame
processing
period
Completion
v_output call
Execution of
other
computation
Best-case
execution time of
v_output call
L
T
V
T
V
FIGURE 43.3 Measurement of output jitter.
43.2.6 Capture of Timing Constraints
The declaration of a signal in a Volcano xed conguration le provides syntax to capture the following
timing-related information:
Whether a signal is state or state change (info_type)
Whether a signal is sporadic or periodic (generation_type)
The latency
The min_interval
The max_interval
The max_age
The rst two (together with whether the signal is published or subscribed to) provide signal properties
that determine the kind of the signal.
A state signal carries a value that completely describes the signaled property (e.g., the current position
of a switch). A subscriber to such a signal need only observe the signal value when the information is
required for the subscribers purposes (e.g., signal values can be missed without affecting the usefulness
of later values). Astate change signal carries a value that must always be observed in order to be meaningful
(e.g., distance traveled since last signal value). A subscriber must observe every signal value.
A sporadic signal is one that is written by the application in response to some event (e.g., a button
press). A periodic signal is one that is written by the application at regular intervals.
The latency of a signal is the time from notional generation to being available to Volcano (for a
published signal), or from being made available to the application by Volcano to notional consumption
(for a subscribed signal). Note that immediate signals (those in immediate frames) include time taken to
move frames to/from the network in these latencies.
The min_interval has different interpretation for published and for subscribed signals. For a published
signal, it is the minimum time between any pair of write calls to the signal (this allows, e.g., the calculation
of the maximum rate at which the signal could cause a sporadic frame carrying it to be transmitted).
For a subscribed signal, it is the minimum acceptable time between arrivals of the signal. This is
optional: it is intended to be used if the processing associated with the signal is triggered by arrival of
a new value, rather than periodic. In such a case, it provides a constraint that the signal should not be
connected to published signal with a faster rate.
The max_interval has different interpretation for published and subscribed signals. For a pub-
lished signal, the interesting timing information is already captured by min_interval and publish
latency.
For a subscribed signal it is the maximuminterval betweennotional consumptionsof the signal (i.e., it
can be used to determine that signal values are sampled quickly enough so that none will be missed).
The max_age of a signal is the maximum acceptable age of a signal at notional consumption, measured
from notional generation. This value is meaningful for subscribed signals.
In addition to the signal timing properties described above, the Volcano xed conguration le provides
syntax to capture the following additional timing-related information:
The Volcano processing period. The Volcano processing period denes the nominal interval between
successive v_input() calls on the ECU, and also between successive v_output() calls (i.e., the rates of the
calls are the same, but v_input() and v_output() are not assumed to become due at the same instant).
For example, if the Volcano processing period is 5 msec then each v_output() call becomes due 5 msec
after the previous one became due.
The Volcano jitter time. The Volcano jitter denes the time by which the actual call may lag behind the
time at which it became due. Note that becomes due refers to the start of the call, and jitter refers to
completion of the call.
43.3 Volcano Network Architect
To manage increasing complexity in electrical architectures a structured development approach is believed
essential to assure correctness by design. Volcano automotive group has developed a network design tool,
Volcano Network Architect (VNA), to support a development process, based on strict systems engineering
principles. Gatewaying of signals betweendifferent networks is automatically handled by the VNAtool and
the accompanying embedded software. The tool supports partitioning of responsibilities into different
roles, suchas systemintegrator andfunctionowner. Thirdparty tools may be usedfor functional modeling.
These models can be imported into VNA.
Volcano Network Architect is the top-level tool in the Volcano Automotive Groups tool chain for
designing vehicle network systems. The tool chain supports important aspects of systems engineering
such as:
Use of functional modeling tools.
Partitioning of responsibilities.
Abstracting away fromhardware and protocol-specic details providing a signals-based API for the
application developer.
Abstracting away from the network topology through automatic gatewaying between different
networks.
Automatic frame compilation to ensure that all declared requirements are fullled (if possible),
that is, delivering correctness by design.
Reconguration exibility by supporting post-compile-time reconguration capability.
The VNA tool supports network design and makes management and maintenance of distributed network
solutions more efcient. The tool supports capturing of requirements and then takes a user through all
stages of network denition.
43.3.1 The Car OEM Tool Chain One Example
Increasing competition and complex electrical architectures demands enhanced processes. Function mod-
eling has proved to be a suitable tool to capture the functional needs in a vehicle. Tools, such as Rational
Rose, provide a good foundation to capture all different functions and other tools, such as Statemate and
Simulink, model them in order to allocate objects and functionality in the vehicle. Networking is essential
FIGURE 43.4 VNA screen.
since the functionality is distributed among a number of ECUs in the vehicle. Substantial parts of the
outcome from the function modeling are highly suitable to use as input to a network design tool, such
as VNA.
The amount of information required to properly dene the networks are vast. To support input of data
VNA provides an automated import from third party tools through an XML-based format.
It is the job of the signal database administrator/systemintegrator to ensure that all data entered into the
systemare valid and internally consistent. VNAsupports this task through a built-in multilevel consistency
checker that veries all data (Figure 43.4).
In this particular approach the network is designed by the system integrator in close contact with
the different function owners in order to capture all necessary signaling requirements functional and
nonfunctional (including timing). When the requirements are agreed and documented in VNA, the
system integrator uses VNA to pack all signals into frames, this can be done manually or automatically.
The algorithm used by VNA handles gatewaying by partitioning end-to-end timing requirements into
requirements per network segment.
All requirements are captured in the form of a Microsoft Word document called Software Requirement
Specication (SWRS) that is generated by VNA and sent to the different node owners as a draft copy to
be signed-off. When all SWRS has been signed-off VNA automatically creates all necessary conguration
les used in the vehicle along with a variety of les for third party analysis and measurement tools.
The network level (global) conguration les are used as input to the Volcano Conguration Tool and
Volcano Back-End Tool in order to generate a set of downloadable binary conguration les for each
node. The use of recongurable nodes makes the system very exible since the Volcano concept separates
application dependent information and network dependent information. A change in the network by the
systemintegrator can easily be applied to a vehicle without having to recompile the application software in
the nodes. The connection between function modeling and VNA provide good support for iterative
design. It veries network consistency and timing up front, to ensure a predictable and deterministic
network.
43.3.2 VNA Tool Overview
43.3.2.1 Global Objects
The workow in VNA ensures that all relevant information about the network is captured. Global objects
shall be created rst, and then (re-)used in several projects. The VNAuser works with objects of types such
Frame
compiler
Consistency
check
Config.
generator
Quick
generator
GUI
D
Backup
DB
in
RA
D
Console
Use
manage
D
If
D
Converter
Volcano configuration files
Fixed Target Network
Specs,
reports, and
document
ASAP
SWRS
LIN description files
.ldf
Generi
Exp./Imp.
XML
FIBEX
XML files
Conversion
tool
Customer
Third party format
HTML
DB
FIGURE 43.5 The database is a central part of the VNA system. In order to ensure highest possible performance,
each instance of VNA accesses a local mirror of the database that is continuously synchronized with its parent.
as signals/nodes/interfaces, etc. These objects are used to build up the networks used in a car. Signals are
dened by name and type, and can have logical or physical encoding information attached. Interfaces
detailing hardware requirements are dened leading to describe actual nodes on a network. For each
node, receive and transmit signals are dened, and timing requirements are provided for the signals. This
information is intended for global use, that is, across car variants, platforms, etc.
43.3.2.2 Project or Conguration Related Data (Projects, Congurations, Releases)
When all global data have been collected the network will be designed by connecting the interfaces
in a desired conguration. VNA has strong project and variant handling. Different congurations can
selectively use or adapt the global objects, for example, by removing a high-end feature from a low-end
car model. This means that VNA can manage multiple congurations, designs, and releases, with version
and variant handling.
The release handling ensures that all components in a conguration are locked. It is however still
possible to reuse the components in unchanged form. This makes it possible to go back to any released
conguration at any point in time (Figure 43.5).
43.3.2.3 Database
All data objects, both global and conguration-specic, are stored in a common database. The VNA tool
was designed to have one common multiuser database per car OEM. In order to secure highest possible
performance all complex and time-consuming VNAoperations are performed toward a local RAMmirror
of the database. Aspecially designed database interface ensures consistency in the local mirror. Operations
that are not time critical, such as database management, operate toward the database.
The built-in multiuser functionality allows multiple users to access all data stored in the database
simultaneously. To ensure that a data object is not modied by more than one user the object must be
locked before any modication, read access is of course allowed for all users although an object is locked
for modication.
43.3.2.4 Version and Variant Handling
The VNA database implements functionality for variants and versions handling. Most of the global data
objects, for example, Signals, functions, and nodes, may exist in different versions, but only one version
of an object can be used in a specic project/conguration.
The node objects can be seen as the main global objects, since hierarchically they include all other types
of global objects. The node objects can exist in different variants but only one object can be used from a
variant folder in a particular project/conguration.
43.3.2.5 Consistency Checking
Extensive functionality for consistency checking is built into the VNA tool. The consistency check can be
manually activated when needed, but is also running continuously to check user input and give immediate
feedback on any suspected inconsistency. The consistency check ensures that the network design follows
predened rules and generates errors when appropriate.
43.3.2.6 Timing Analysis/Frame Compilation
The Volcano concept is based on a foundation of guaranteed message latency and a signals-based publish
and subscribe model. This provides abstraction by hiding the network and protocol details, allowing the
developer to work in the application domain with signals, functions, and related timing information.
Much effort has been spent on developing and rening the timing analysis in VNA. The timing analysis
is built upon a scheduling model called DMA (Deadline Monotonic Analysis), and calculates the worst-
case latency for each frame among a dened set of frames sent on the bus. Parts of this functionality have
been built into the consistency check routine as described above but the real power of the VNA tool is
found in the frame packer/frame compiler functionality.
The frame packer/compiler attempts to create an optimal packing of signals into frames, than calculate
the proper IDs to every frame ensuring that all the timing requirements captured earlier in the process
are fullled (if possible). This automatic packing of multiple signals into each frame makes more ef-
cient use of the data bus, by amortizing some of the protocol overheads involved thus lowering bus load.
The combined effect of multiple signals per frame and perfect ltering results in a lower interrupt and
CPU load, which means that the same performance can be obtained at lower cost. The frame packer
can create the most optimal solution if all nodes are recongurable. To handle carry over nodes that
are not recongurable (ROM-based), these nodes and their associated frames can be classed as xed.
Frame packing can also be performed manually if desired. Should changes to the design be required at
a later time, the process allows rapid turnaround of design changes, rerunning the Frame Compiler and
regenerating the conguration les.
The VNA tool can be used to design network solutions that are later realized by embedded software
from any provider. However, the VNA tool is designed with the Volcano embedded software (VTP) in
mind, which implements the expected behavior into the different nodes. To get the full benets of the tool
chain, VNA and VTP should be used together.
43.3.2.7 Volcano Filtering Algorithm
A crucial aspect of network conguration is how to choose identiers so that the load on a CPU related
to handling of interrupts generated by frames of no interest for the particular node is minimized: most
CAN controllers have only limited ltering capabilities. The Volcano ltering algorithm is designed to
achieve this.
An identier is split into two parts: the priority bits and the lter bits. All frames on a network must
have unique priority bits; for real-time performance the priority setting of a frame should reect the
relative urgency of the frame. The lter bits are used to determine if a CAN controller should accept
or reject a frame. Each ECU that needs to receive frames by interrupts is assigned a single lter bit; the
hardware ltering in the CAN controller is set to must match 1 for the lter bit, and dont care for all
other bits.
0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 1 0 0 1
0 8 12 16 20 24 28 4 ID bit
Number of priority bits= 7 Number of filter bits=13
Unused bits (0)
2 C
4 0 0 0 6 1
2 C 4 0 8 7
FIGURE 43.6 A CAN identier on an extended CAN network. The network clause has dened the CAN identiers
to have 7 priority bits and 13 lter bits. The least signicant bit of the value corresponds with the bit of the identier
transmitted last. Only legal CAN identiers can be specied: identiers with the 7 most signicant bits equal to 1 are
illegal according to the CAN standard.
The lter bits of a frame are set for each ECU to which the frame needs to be seen. So a frame that
is broadcast to all ECUs on the network is assigned lter bits all set to 1. For a frame sent to a single
ECU on the network, just one lter bit is set. Figure 43.6 illustrates this; the frame shown is sent to
four ECUs.
If an ECU takes an interrupt for just the frames that it needs then the ltering is said to be perfect. In
some systems there may be more ECUs needing to receive frames by interrupt than there are lter bits in
the network; in this case, some ECUs will need to share a bit. If this happens then Volcano will lter the
frames in software, using the priority bits to uniquely identify the frame and discarding unwanted frames.
The priority bits are the most signicant bits. They indicate priority and uniquely identify a frame. The
number of priority bits must be large enoughtouniquely identify a frame ina givennetwork conguration.
The priority bits for a given frame are set by the relative urgency (or deadline) of the frame. This is
derived from how urgently each subscriber of a signal in the frame needs the signal (as described earlier).
In most systems 5 to 10 priority bits are sufcient.
The lter bits are the remaining least signicant bits and are used to indicate the destination ECUs
for a given frame. Treating them as a target mask does this: Each ECU (or group of ECUs) is assigned
a single lter bit. The ltering for a CAN controller in the ECU is set up to accept only frames where
the corresponding lter bit in the identier is set. This can give perfect ltering: an interrupt is raised
if and only if the frame is needed by the ECU. Perfect ltering can dramatically reduce the CPU load
compared with ltering in software. Indeed, perfect ltering is essential if the system integrator needs
to connect ECUs with slow 8-bit CPUs to high-speed CAN networks (if ltering were implemented in
software the CPU would spend most of its available processing time handling interrupts and discarding
unwanted frames). The ltering scheme also allows for broadcast of a frame to an arbitrary set of ECUs.
This can reduce the trafc on the bus since frames do not need to be transmitted several times to different
destinations. Because the system integrator is able to dene the conguration data and because that data
denes the complete network behavior of an ECU, the in-vehicle networks are under the control of the
system integrator.
43.3.2.8 Multiprotocol Support
The existing version of VNA supports the complementary, contemporary network protocols of CAN and
LIN. The next version, will also have support for the FlexRay protocol. A prototype version of VNA with
partial MOST support is currently under construction. As network technology continues to advance into
other protocols, VNA will also move to support these advances.
43.3.2.9 Gatewaying
A network normally consists of multiple network segments using different protocols. Signals, may be
transferred from one segment to another through a gateway node. As implemented throughout the
whole tool chain of Volcano Automotive Group, gatewaying of data even across multiple protocols
is automatically congured in VNA. In this way VNA allows any node to subscribe any signal gener-
ated on any network without needing to know how this signal is gatewayed from the publishing node.
Handling of timing requirements over one or more gateways is also handled byVNA. The Volcano solution
requires no special gatewaying hardware and therefore provides the most cost-efcient solution to signal
gatewaying.
43.3.2.10 Data Export and Import
The VNA tool enables the OEMs to get a close integration between VNA and functional modeling tools
and to share data among different OEMs and subcontractors, for example, node developers.
Support of emerging standards, such as FIBEX and XML, will further simplify information sharing and
become a basis for conguration of third party communication layers.
43.4 Volcano Software in an ECU
The Volcano tool chain includes networking software running in each ECU in the system. This software
uses the conguration data to control the transmission and reception of frames on one or more buses
and present signals to the application-programmer. One view of the Volcano network software is as
a communications engine under the control of the system integrator. The view of the application-
programmer is different: the software is a black-box into which published signals are placed, and out of
which can be summoned subscribed signals.
The main implementation goals for Volcano target software are as follows:
Predictable real-time behavior: no data-loss under any circumstances.
Efciency: low RAM usage, fast execution time, small code size.
Portability: low cost of moving to a new platform.
43.4.1 Volcano Conguration
Building a conguration is a key part of the Volcano concept. A conguration is, as already mentioned,
based on details, such as how signals are mapped into frames, allocation of identiers, and processing
intervals.
For each ECU, there are two authorities acting in the conguration process: the system integrator and
the ECU supplier. The system integrator provides the Volcano conguration for the ECU regarding the
network behavior at the system level, and the supplier provides the Volcano conguration data for the
ECU in terms of the internal behavior.
43.4.1.1 The Conguration Files
The Volcano conguration data is captured in four different types of les. These are:
Fixed information, which is agreed between the supplier and system integrator.
Private information, which is provided by the ECUsupplier. The ECUsupplier does not necessarily
have to provide this information to the system integrator.
Network conguration information, which is supplied by the system integrator.
Target information, which is the supplier descriptionof the ECUpublished to the systemintegrator.
43.4.1.1.1 Fixed Information
The xed information is the most important in achieving a working system. It consists of a complete
description of the dependencies between the ECU and the network. This includes a description of the
signals the ECU needs from the network, how oftenVolcano calls will be executed, and so on. The inform-
ation also includes description of the CAN controller(s), and possible limitations regarding reception and
transmission boundaries and supported frame modes. The xed information forms a contract between
the supplier and the system integrator: the information should not be changed without both parties being
aware of the changes. The xed information le is referred to as the FIX le.
43.4.1.1.2 Private Information
The private le contains additional information for Volcano, which does not affect the network: timeout
values associated to signals and what ags are used by the application. The private information le is
referred to as the PRI le.
43.4.1.1.3 Network Information
The network information species the network conguration of the ECU. The system integrator must
dene the number of frames sent from, and received by the ECU, the frame identier and length, and
details of how the signals in the agreed information are mapped into these frames. Here, the vehicle
manufacturer also denes the different frame modes used in the network. The network information le
is referred to as the NET le.
43.4.1.1.4 Target Information
The target information contains information about the resources that the supplier has allocated toVolcano
in the ECU. It describes the ECUs hardware (e.g., used CAN controllers and where those are mapped in
memory). The target information le is referred to as the TGT le.
43.4.2 Workow
The Volcano system identies two major roles in the development of a network of ECUs: the application
designer (which may include the designer of the ECU system or the application-programmer) and the
system integrator. The application designer is typically located at the organization developing the ECU
hardware and application software. The system integrator is typically located at the vehicle manufacturer.
The interface between the application designer and the system integrator is carefully controlled, and the
information owned by each side is strictly dened. The Volcano tool chain implementation is clearly
reecting this partitioning of roles.
The Volcano system includes a number of tools to help the system integrator in dening a network
conguration. The Network Architect is a high-level design tool, with a database containing all the
publish/subscribe informationfor eachECUavailable, as describedinthe previous sections. After mapping
the signaling needs on particular network architecture, thus dening the connections between the pub-
lished and subscribed signals, an automatic Frame Compiler will be run. The Frame Compiler tool uses
the requirements captured earlier to build a conguration that meet those requirements. There are many
possibilities to optimize the bus behavior. The frame compiler includes the CAN bus timing analysis and
LIN schedule table generation and will not generate a conguration that violates the timing requirements
placed on the system. The frame compiler also uses the analysis to answer what if? type of questions and
guide the user in building a valid and optimized network conguration.
The output of the frame compiler is used to build conguration data specic to each ECU. This is used
by the Volcano target software in the ECU to properly congure and use the hardware resources.
The Volcano conguration data generator tool set (V5CFG/V5BND) is used to translate this ASCII text
information to executable binary code in the following way:
When the supplier executes the tool, it reads the FIX, PRI, and TGT les to generate compile-time
data les. These data les are compiled and linked together with the application program together
with the Volcano library supplied for the specic ECU system.
When the vehicle manufacturer executes the tool, it reads the FIX, NET, and TGT les to generate
the binary data that is to be located in the ECUs Volcano conguration memory (known as the
Volcano NVRAM). An ECUis then congured (or recongured) by downloading the binary data
to the ECUs memory.
Note: It is vital torealize that, changes toeither the FIXor the TGTle cannot be done without coordinating
between the system integrator and the ECU supplier.
The vehicle manufacturer can, however, change the NET le without informing the ECU supplier.
In the same way, the ECU supplier can change the PRI le without informing the system integrator.
Figure 43.7 shows howthe Volcano Target Code for an ECUis congured by the supplier and the system
integrator.
Private
V5CFG
configuration
tool
V5CFG
configuration
tool
V5BND
target
tailoring
Fixed Network
Volcano 5
target library
Compile-time
data
Application
program
Target
Binary data
for ECU
configuration
Program code (ROM / FLASH
EEPROM)
Volcano 5 NVRAM pool
ECU memory
Intermediate
fix / net
Intermediate
fix / pri
Agreed
information
ECU
supplier
Vehicle
manufacturer
V5BND
target code
generation
FIGURE 43.7 Volcano Target Code conguration process.
The Volcano concept and related products has been successfully used in production since 1996. Present
car OEMs using the entire tool chain are Aston Martin, Jaguar, LandRover, MG Rover Volvo Cars, and
Volvo Bus Corporation.
Acknowledgments
I wish to acknowledge the contributions of my colleagues at Volcano Automotive Group in particular,
Istvn Horvth, Niklas Amberntsson, and Mats Ramnefors for their contributions to this chapter.
References
[1] K. Tindell and A. Burns, Guaranteeing Message Latencies on Controller Area Network (CAN),
in Proceedings of the 1st International CAN Conference, 1994, pp. 211.
[2] K. Tindell, A. Rajnak, and L. Casparsson, CAN Communications Concept with Guaranteed Message
Latencies, SAE paper, 1998.
[3] L. Casparsson, K. Tindell, A. Rajnak, and P. Malmberg, Volcano A Revolution in On-board
Communication, Volvo Technology report, 1998.
[4] W. Specks and A. Rajnk, The Scaleable Network Architecture of the Volvo S80, in Proceedings
of the 8th International Conference on Electronic Systems for Vehicles, Baden-Baden, October 1998,
pp. 597641.
More Information
http: www.VolcanoAutomotive.com
44 Embedded Web Servers in Distributed Control Systems
Jacek Szymanski
45 HTTP Digest Authentication for Embedded Web Servers
Mario Crevatin and Thomas P. von Hoff
44
Embedded Web
Servers in Distributed
Control Systems
Jacek Szymanski
Alstom Transport
44.1 Objective and Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-1
44.2 Application Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-2
44.3 FDWS Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-4
Embedded Server Functions Embedded Site Structure
Embedded Server Operation Site Implementation
44.4 Guided Tour to Embedded Server Implementation . . . 44-10
Steps of Embedded Site Implementation Process
Implementation of VFS Implementation of Look-and-Feel
Objects Implementation of Page Composition Routines
Implementation of Script Activation Routines
Implementation of Application Wrappers Putting Pieces Together
44.5 Example of Site Implementation in a
HART Protocol Gateway . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-20
Structure of the Site Embedded in the Protocol Gateway
Detailed Implementation of Principal Functions Access to
Site Home Page Access to Parameters of the Gateway
Access to Active Channel List Access to Channel Parameters
Monitoring of Principal Channel Measure Access Control
and Authentication Application Wrapper
44.6 Architecture Summary and Test Case Description for
the Embedded Server . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-31
Embedded Site Architecture Test Description Test Scenarios
44.7 Summing Up . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-38
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44-38
44.A1 Appendix: Conguration of VFS . . . . . . . . . . . . . . . . . . . . . . . 44-39
Programming of VFS Component BNF of Specication
Language Specication Example
44.1 Objective and Contents
In todays landscape of information technology the World Wide Web (WWW) is omnipresent. Its usage
is unavoidable in everyday life and in all domains. The WWW technology is around for approximately
44-1
15 yr. From its early days the WWW passed many phases from initial research status through the euphoria
of e-anything in the late 1990s, till todays mature status in the domain of information broadcast,
advertising, and e-commerce.
The main objective of this account is to present the application of web-related technologies in the
domain of industrial control and more specically in the area of distributed control equipment operating
on the oor shop around the eldbus interconnections. The rationale for using the technology is to
provide access to (control) system elements by communication means based on the industrial standards
dened around the WWW.
Embedded Web Servers are now omnipresent within packages proposed by different software editors.
The proposed products differ in size, performance, price, architecture, application area. The objective
of this account is not to provide review of the existent products or their comparison. Rather than reviewing
the features of the ready-made solution, the account proposes to go through the technological bases on
which relies the construction of these applications on the example of an existent software application
called in the sequel Field Device Web Server (FDWS).
The FDWS was implemented with the objective to enhance the operational functionality of a large class
of distributed control system architectures and especially eldbus-based parts of these architectures, by
providing them with the power and the exibility of internet technology.
The account outlines the design of embedded web servers in two steps. First, it presents the context
in which the embedded web servers are usually implemented. Second, it sketches the structure of an
FDWS application with the presentation of its component packages and the mutual relationship between
the content of the packages and the architecture of a typical embedded site. The main motivation of
the account is, however, to show the user an exemplary approach to an embedded site design. For this
reason, an illustrated real-life example presents the details of design, implementation, and test trials of an
embedded website implemented in an existent eld device.
44.2 Application Context
To sketch the impact of technology on control applications, it is important to identify the location of a
eld device in a typical architecture of the control system, as shown in Figure 44.1. The eld device is part
of an automation cell a collection of cooperating instruments that realize a well-dened automation
function. These devices are of different levels of complexity fromsimple sensors with extremely limited
functions to process computers equipped with powerful processors and large memory banks containing
several embedded software programs.
Automation cells of a control system cooperate in order to implement a coherent control application.
The cooperation is possible, thanks to information exchange via a higher level network that forms the
systembackbone. The backbone links the automation cells with the control roomsupervisory computers
that provide the interface between the systemand human operators. So the global control applications are
structured into two collections of functions:
Automation functions implemented in automation cells.
Supervisory functions implemented in control roomcomputers.
Information exchange between the two parts is based almost exclusively on the clientserver paradigm.
The idea that is at the base of the development of the FDWS has its origins in the analysis of the
structure of the WWW, and in the statement that it makes a perfect example of successful application of
interoperability principle applied to diverse software products. The interoperability of internet products
(clients and servers) is based on universally accepted standards that are:
TCP/IP protocol for reliable data transport.
HTTP protocol for application information exchange.
XML/ HTML format for information presentation and structuring.
Embedded Web Servers in Distributed Control Systems 44-3
Automation cell
F
i
e
l
d
b
u
s
Interconnecting network
Field devices
Control room
console
Process computer
FIGURE 44.1 Place of eld devices in automation system. (From J. Szymanski, Proceedings of WFCS 2000,
September 68, ISEP, Porto Portugal, 2000, pp. 301308. With permission. Copyright 2000 IEEE.)
Application
Client
tier
Server
tier
Application
tier
FIGURE 44.2 Three-tier architecture of internet clientserver application. (From J. Szymanski, Proceedings of
WFCS 2000, September 68, ISEP, Porto Portugal, 2000, pp. 301308. With permission. Copyright 2000 IEEE.)
Another successful application of universally accepted standards concerns the architectural pattern. The
internet base distributed applications are based on the principle of multitier architecture (Figure 44.2),
which makes use of universal client and server frameworks independent of the nature of data processed,
on the condition that the data exchanged over the network are transported via HTTP protocol and are
structured according to XML/HTML format.
The multitier architecture standardizes basic services of client and server parts of the architecture. In the
majority of congurations, the client part is totally independent of the application (so-called thin client).
It is important to state that the properties and advantages of the multitier architecture are independent
on the technology used for implementation of its components and are valid for both embedded and
nonembedded implementations.
The universal nature of the client places the burden of application personalization on the server side,
which is interfaced directly with the embedded application software. This conguration is at the origin of
the internal architecture of the server described in the further sections of the account.
This architecture simplies the development process of newly created applications. As one can see, the
client tier of the architecture is totally standard and as such is not to be developed. A big part of server tier
is also based on generic modules, such as TCP/IP and socket modules, which are included in the standard
deliveries for the majority of implementation platforms. The FDWS is designed to interface with these
modules and provides a large degree of independence from the implementation details.
Not all application independent modules exist on implementation platforms and FDWS provides a
collection of software modules that can be included into an application. It is important to state here that
the FDWS is only an exemplary presentation of an embedded web technology. For other implementations
consider Reference 1.
Figure 44.2 shows the three-tier version of the architecture where the generic parts of it, described
above, were completed by the application-dependent part that is considered monolithic. In the quest
for further factorization of the design it can be considered that this monolithic software bloc be split in
thinner layers.
The application-dependent part of server tier is to be developed for each application. Despite evident
advantages of a standard architectural pattern, design and implementation process of this part of the
embedded site is not an easy task. The reason for this is that it requires the technical uency in four
disciplines:
Comprehensionof basic applicationimplementedby the equipment hosting the site. This is because
the operation of the hosting equipment should be enhanced rather than modied the embedded
site is then the extension of existent application.
Skill of creation of HTTP-based websites implying the knowledge of technologies such as CGI
scripting, HTML, Java, JavaScript, etc. This is because these techniques are the basis of operation
of components executed in the generic client.
Good knowledge of constraints of the platform on which the site is to be installed. This reason is
imposed by the principle of minimal interference with the basic application and should inuence
the complexity of site structure as well as the size of site components.
Good comprehension of FDWS technology; at least of its part in charge of data transfer through
the server tier toward the application tier.
The guided tour through the development of an application embedded in a eld device is described in
sections which follow.
44.3 FDWS Technology
44.3.1 Embedded Server Functions
The place of the server tier in the structure presented in Figure 44.2 shows its role in the architecture
of the application. This role has nothing to do with the organic mission of the hosting equipment, that
is, the server software will not directly intervene in execution of control algorithms while installed on a
Programmable Logic Controller (PLC) nor will it do any protocol conversion while installed in a protocol
gateway such as HART/FIP converter described below.
1. Embedded server can store and serve the complete interface to the application within the
equipment.
2. Server can activate routines that are able to extract and interpret orders sent from the client part
and modify application status via accessible interface.
3. Server resident routines can extract the information coming from the application, format it, and
wrap into HTML pages in order to provide them to the client side.
4. Dynamically generated user-interface components can easily manage the evolution of visual aspect
of the user interface in function of the application status; this aspect can ease operations, such as
anomaly signaling.
5. Internal server mechanisms provide the possibility of easy implementation of password-based
security locks.
In this context, the server role consists in providing a greatly exible and relatively easily implementable
interface. This interface provides remote clients with controlled and congurable access mechanisms
to data, structure, status, and processing modes of (organic) applications embedded in control system
devices.
The FDWS software is designed to implement all the required functions of an embedded server. These
functions express the requirements from the point of view of the nal user and have to be reformulated
in terms of communication architecture. From this viewpoint the embedded site takes the shape of an
HTTP protocol server operating above the TCP/IP transport. The basic functions of such an entity are:
1. Management of connections coming from distant clients.
2. Analysis of clients requests, in terms of syntax and semantics.
3. Maintenance of local server objects in view of their access by distant clients.
4. Decision of granting or refusing access to server objects; composition and transmission of responses
corresponding to clients requests.
5. Execution of processing expressed in clients requests.
The FWDS software is structured in ve interrelated software packages. Each of the basic server func-
tions is supported by one or many software modules. The roles of the packages are as explained in
Table 44.1.
The software is organized in ve packages for better design, easier deployment, and maintainability.
Figure 44.3 shows the mutual interdependence among packages. In a typical implementation the modules
fromall ve packages have to be used in embedded site construction.
44.3.2 Embedded Site Structure
The architecture of an embedded server does not differ in principle from the architecture of a regular
(nonembedded) web-enabled application.
TABLE 44.1 Package Functions of FDWS Software
Package function Role
Main server engine In charge of connection management process: this package groups the
modules that realize the functions of server engine operation, network
adaptation, and support of persistence of request data
Http request parser In charge of request analysis: this package implements the parsing of PDU
and CGI environment building
Embedded le system In charge of controlling access to server objects: this package provides the
elements that support the implementation of embedded equivalent of disk
le system
Dynamic page generator Is in charge of generation of servers requests on-the-y (servlets): this
package provides the elements that support the implementation of
dynamically generated HTML pages
Embedder response composer In charge of response composition
HTTP
Request
Parser
Main
Server
Engine
Embedded
Response
Composer
Dynamic
Page
Generator
Embedded
File
System
HTTP
Request
Parser
Main
Server
Engine
Embedded
Response
Composer
Dynamic
Page
Generator
Embedded
File
System
HTTP
Request
Parser
Main
Server
Engine
Embedded
Response
Composer
Dynamic
page
Generator
Embedded
File
System
HTTP
request
parser
Main
server
engine
Embedded
response
composer
Dynamic
page
generator
Embedded
file
system
FIGURE 44.3 General architecture of FieldWebServer software. (From J. Szymanski, Proceedings of WFCS 2000,
In the most general terms every server tier is composed of three basic elements:
1. Generic server body principal active component that loops hearing to the incoming service
requests and processes them; request processing consists in:
(a) Parsing the Protocol Data Unit (PDU) syntax.
(b) Recovering environment variables in order to support server operations.
(c) Identifying requested resources together with the operations to be applied to them.
The generic server body is in principle independent of the applications in which it is incorporated.
Its basic elements are server engine, request parser, response composer, and persistence module.
2. Virtual File System (VFS), an embedded object repository organized as the le system of a typical
compute. It is an active component implementing the logistics of server pages management. This
component helps managing the collection of objects being in direct contact with the application.
3. Collection of application specic components elements that implement both look-and-feel part
of application (HTML pages, compressed images, Java applets, ActiveX controls) and its dynamics
(embedded scripts and servlets). These components, which are managed by the VFS, are designed
in order to convey data between client part and the essential application. Most naturally these
elements are totally dependent on application.
The analysis of the structure of embedded server puts in evidence yet another building block of the
architecture the application wrapper. This block is very often introduced into the device structure for
convenience reasons. Its role consists in adapting the functional interface of basic application to the needs
of the page composition module. The structure of this block is totally dependent on the basic application.
The construction of this block is not supported by the modules of the FieldWebServer and for this reason
it is left outside of the server-tier structure.
Taking into account these considerations the whole Internet-based server architecture, in the context of
a control device can be represented by the schematic from Figure 44.4. The left part of the schematic above
shows the software architecture at run time that puts in evidence mutual relationships among all building
block instances. The right part of the schematic shows the organization of FieldWebServer module library.
44.3.3 Embedded Server Operation
Application of FDWS technology in a device is possible if and only if three conditions are fullled:
The executionmodel of the software embeddedinthe device is basedonthe multiprocess paradigm,
it is necessary that all server operations are encapsulated in a separate thread of execution.
Basic
Application
Application
Wrapper
Module
Virtual File
em
Page Composition
Module
HTML pages,
images &applets
CGI scripts
Fieldbus stack(layers1 and 2)
TCP/IP
Socket Presentation Layer
Server Engine
Request
Analyser
Response
Composer
Persistence Module
Basic
application
Application
wrapper
module
VFS
Page composition
module
HTML pages,
images and applets
CGI scripts
Fieldbus stack (layers 1 and 2)
TCP/IP
Socket presentation layer
Server engine
Request
analyser
Response
composer
Persistence module
FIGURE 44.4 Architecture of internet-based application in a ledbus equipment. (From J. Szymanski, Proceedings
of WFCS 2000, September 68, ISEP, Porto Portugal, 2000, pp. 301308. With permission. Copyright 2000 IEEE.)
The device processing power is sufcient enough to support the additional computation burden
caused by the server operation.
The basic (organic) application of the device exposes a well-dened programming interface that
will provide server routines with the means of accessing to the application data.
Putting server into execution needs two operations which should be executed in order:
Server data that implement internal server objects have to be congured.
Server engine routine has to be activated by the device monitoring program in an independent
execution thread.
44.3.3.1 Minimal Server Interface
Efcient development of embedded servers using the FDWS technology is possible only in the case when
the designer understands basic elements of server interface and server operation. The structure of FDWS
software allows the user to have access to the interface of all the modules from which its packages are
constructed. This means that an advanced user could have made calls to more than a hundred functions
and has direct access to many tenths of global variables. It is really a very complex task and in the majority
of cases not necessary. All what an average FDWS user should be aware of is limited to some ve modules
fromthree packages. In the case when the user disposes the adequate tools for software conguration, its
knowledge of large FDWS interface can be limited to three data types, three global variables, and to less
than 10 routine calls.
Normal server operation is composed of two phases:
Initialization and conguration of internal data structures, this phase is executed only at the server
thread startup.
Activation of the main server loop (server engine routine), lasting till the server task destruction.
Both phases are described below.
44.3.3.2 Init Phase
Server engine routine operates on four global variables, all exposed by the modules from request parser
package and module fromserver engine package. The variables have the following meaning:
Pointer to VFS root. Pointer to the data structure which is the root of VFS. The VFS represents the store
of server-owned objects and is structured as a disk le system. It should contain all passive objects that
the server is supposed to provide: HTML pages, images of all formats, Java applets.
Pointer to CGI root. Pointer to the data structure which represents the structured store of active server
objects commonly referred to as CGI scripts.
Script exec routine. Pointer to a routine which is in charge of activation of CGI script routines.
Page compose routine. Pointer to a routine which is in charge of composition of passive objects stored
in VFS.
Comprehension of the role of these variables is the key to understanding server operation, since the
software user is to provide the adequate values to the variables. All four variables should be duly initialized
prior to activationof the server engine routine or the server will not be able to provide any object or execute
any CGI routine. The initialization code should be written by the application programmer according to
the principles described in Section 44.3.3.3.
44.3.3.3 Operation Phase
Initialization phase congures vital server resources. All passive and active components of the embedded
site are placed within reach of server engine via VFS root and CGI root pointers. The engine is also
provided with the methods of reaction to client requests via two user routines pointed by script exec and
page compose routine. Operation phase can now be effectively reacted. The FDWS software contains the
ready-made server engine that implements the operation phase according to a predened scheme that
suits the majority of cases. This standard server engine is implemented by the routine from the package
server engine within the module.
The signature of the routine is as follows:
int server_boot (unsigned short service)
where the formal parameter service is the TCP port number on which the server listens to the connections
fromthe distant clients.
The standard server engine provided within FDWS software implements the policy of the iterative
server; that is, the server which realizes sequentially the requests received from distant clients. This
means that two transactions are never actively executed by the server on the same time. If a series
of requests arrives on the site their data are queued with the buffers of communication software
modules.
Server engine operation entails execution of an initialization step followed by a loop over a sequence of
ve steps. The most important operation of the initialization step involves an attempt of creation of an
access point to the network via a passive socket of SOCK_STREAMtype that is bound to the IP address
of the hosting station and to the TCP port number passed as the procedure parameter.
If this operation succeeds, the routines control activates the loop; in the contrary, the routine exits with
an error code and error message that is passed over to the systemof the embedding device.
The success in creation of this socket, often referred to as the main socket, enables the programto enter
the processing loop. Server operation proceeds then in the following steps:
Step 1. Server waits passively for incoming requests from the distant clients. On arrival of clients request,
server attempts to accept the connection request and open a secondary (stream-type) socket. If this
operation fails, the secondary socket is not created and the execution thread returns to listening to the
main socket. In the case of success, step 2 is executed.
Step 2. Server reads and parses request data unit received via secondary socket. In this second step of
processing loop, the HTTP request coming froma distant client is received and analyzed. This analysis is
done by a set of routines grouped in the package of the HTTP parser.
If the request structure is recognized as being conformant with the version of the protocol implemented
by server, parsing routine extracts all the important data, which allows the server to elaborate the response.
These data concern the following parameters contained in the request data unit:
Protocol version
Requested HTTP service (GET or POST)
Full identication of the service object (path within server internal structure, object name,
extension)
Object class (HTML page, CGI routine, applet, image, . . .)
Browser options (type, accepted formats, acceted languages, OS type, etc.)
Request parameters (optionally, if included in the request)
CGI variables (optionally, if included in the request)
In the case of nonconformance of the request data, the analysis is abandoned and the parsing routine
returns an error status. If the error occurs the loop control is transferred to the step 4. Otherwise, the
processing executes step 3.
Step 3. Server searches the object identied by the request analysis, prepares, and sends the response data
unit. Successful termination of the request analysis provides the server with all the data necessary to
elaborate a response matching the received request. The step of response preparation is decomposed of
three sequentially executed actions:
Identication of object class (passive object or CGI script).
Object searchwithinone of the server object repositories. To execute this actionobject management
routines exploit the data provided by the user in the initialization phase (server page root and CGI
script root pointers to object repositories) if the requested object is found, the next action is
executed. Otherwise, a standard not found page is send back to the client.
Object composition. To execute this action generic server routines call user-provided routines
plugged to the loop thread in the initialization phase via user congured pointers, script exec
routine and page compose routine.
The generic part of these actions is implemented by the routines from the packages server engine and
VFS. Execution of this step always transfers execution control to step 5.
Step 4. Error report. This step is the alternative to step 3 and is executed only when the analysis of
the received request declares its structure to be nonconformant with the structure of the conventions
recognized by the server. In such a case, a standard error-notifying page is send back to the client. This
situation should be considered as an implementation of a graceful failure mode.
Step 5. Connection closure. According to the HTTP protocol requirements the complete transaction
between the client and the server should be terminated by the request of connection closure initiated by
the server. The routine operation is limited to secondary socket closure.
Application of the ready-made solution freezes the main features of the embedded site. For example,
the request processing policy is imposed to be iterative. The advantage of such a solution for embedded
application is evident: no problems with concurrent accesses to server-owned objects. Disadvantage is
the lack of efciency since some client requests can be rejected by the underlying network software is the
processing loop lags to fetch the arriving request at a sufciently rapid pace.
/* crude server loop routine */
void server_loop(
unsigned short service_port_nr,
tcallback parse_request_routine,
tcallback generate_response_routine,
tcallback error_report_routine,
tcallback closing_routine);
FIGURE 44.5 Code of the server routine body. (FromJ. Szymanski, Proceedings of WFCS 2000, September 68, ISEP,
Porto Portugal, 2000, pp. 301308. With permission. Copyright 2000 IEEE.)
The advanced user is not obliged to follow the standard engine. The server engine package contains
a skeletal support for user congurable engines implemented by the routine server_loop, which has ve
parameters as shown in Figure 44.5.
The rst routines parameter is the TCP port number, four others are pointers to user-provided routines
that should implement steps 2, 3, 4, and 5 of the above described loop. If any of the parameters is set to
NULL, the appropriate step of the loop is implemented by the default routine provided by FDWS packages.
The application programmer can thus take over the control of any of execution phases by providing the
pointer to his own routine.
44.3.4 Site Implementation
The programmer who wishes to implement the embedded site should undertake the following steps:
Create the routines that ll in the appropriate memory regions with embedded HTML objects
(pages, page templates, images, applets).
Programthe routines that implement the CGI scripts referenced within embedded objects.
Provide the routines that generate data structures enabling the management of all repositories
included in VFS. These structures should contain references to memory regions, in which are
stored server-embedded objects, and should also hold the addresses of all routines implementing
CGI scripts. These routines should assign the reference of the data structure to the server page
root and the script exec routine pointers.
Provide the routines that determine actions to be undertaken on invocation of each server
object and on activation of each CGI script; references of these routines should be assigned to
send_page_routine and send_script_routine pointers (the implementation of script exec routine
and page compose routine, respectively).
An initialization routine that activates all necessary conguration actions described above.
Create the server process that calls the initialization routine and activates the server engine routine.
All the steps need to be realized according to certain rules described in the next section.
44.4 Guided Tour to Embedded Server Implementation
44.4.1 Steps of Embedded Site Implementation Process
Applicationprogrammer shouldunderstandFDWS software operationinmainfeatures. If he or she agrees
on the predened operation mode, nothing is to be modied nor extended in the routine implementing
the main server loop, described in the Section 44.2.4. In the case he or she decides to customize one or
many phases of the main server loop, the development effort is to increase. In any case, all application
dependent part of the embedded site is to be developed.
Embedded site is placed within the target platform as an executable object. It is of no importance
whether it is statically linked and loaded with the main application or if its dynamically linked when other
software processes are already running. This detail depends on the platform.
The scenario described in this section is based on the following assumptions:
Server is activated in a separate process.
Server process is implemented as a relocatable object code statically linked with the main
application.
Modules implementing reusable server mechanisms are placed in the static library and are linked
with the code of server process.
The main effort of presentation below is concentrated on the development of application-dependent
software. This software implements the following elements of embedded site:
The VFS tree which determines the skeleton of site structure. The VFS tree holds the references of
all objects making part of the site and provides the mechanism of search for server objects: pages,
applets, images, and CGI scripts. The application developer constructs this part of the embedded
site by using the routines from the package of the VFS. This work can be tedious and complicated
while done manually, but can be easily mechanized by using a conguration tool presented below.
Embedded look-and-feel objects which are data structures representing embedded passive objects
(page frames, applets, images). These data are usually implemented as octet arrays residing in
memory regions referenced by the VFS tree nodes. In the usual development process these objects
are designed and implemented by the tools adapted to the object nature (HTML editors, image
editors, Java development environment). The necessary step concerning the transformation from
the standard formats of their representations (ASCII les, gif/jpeg les, byte code) to the byte arrays
loadable to device memories is to be supported by appropriate tools.
Page composition routines which merge application-dependent data with static page frames in
order to form complete HTML pages that incorporate application status; these routines are to be
programmed manually or are to be generated from a user-friendly notation.
Routines representing active server objects (CGI scripts, dynamic pages) executed on requests
received from the client. These routines serve to integrate application data to server pages. The
routines usually reuse generic functions provided by the FDWS packages; their design is highly
dependent on the application and only manual development process is possible.
Script launching routine.
Application wrapper which extracts useful information from the basic application of hosting device.
Initialization routine which assigns four pointer variables from the server interface with the
appropriate values.
Server process code which calls initialization routine and bootstraps main server loop.
The mutual relationships among these elements as well as the relationship with the library of server
modules are presented in Figure 44.6. Their construction is described in detail, point by point, in the
following sections.
44.4.2 Implementation of VFS
The basis of the VFS is the data structure that is manipulated by the routines that look for the HTML
objects referenced in the client requests. The standard implementation of the site architecture assumes
that this data is composed of two separate lookup trees, called repositories:
Passive object repository holding the references of all HTML pages, images, and applets; the root
of this tree is referenced by the repository root pointer.
Active object repository holding the references to routines that implement the CGI scripts; the root
of this tree is referenced by script routine pointer.
Script
launcher
VFS
Page
composer
INIT
Application wrapper
Application
Custom component
Custom component
Script
launcher
VFS
Page
composer
INIT
FDWS
modules
Application wrapper
Application
Custom component
Custom component
Pages,
images,
applets
CGI script,
servlets
dynamic pages
FIGURE 44.6 Relationships of elements of embedded site. (From J. Szymanski, Proceedings of WFCS 2000,
The tree is built of three type of nodes:
Repository root. It is a unique entry point to each data structure. This object holds the list of references
to other elements of the tree: embedded directories and embedded les. One of the embedded les, called
by default server page plays the special role in the process of response composition. The tree root can also
hold a reference to the list of authentication records whose role and structure is described below.
Embedded directory node type. This type of node plays the role of the root of a subtree within the server
structure. It holds the list of references to other embedded directories and/or embedded les. It can also
hold the reference to the list of authentication record.
Embedded le node type. This node is the tree leaf that holds directly the reference to the data necessary
to compose and send the requested object.
An example of structure of the repository is presented in Figure 44.7.
Data type of objects that formthe structure of VFS repositories is denedintheVFS package. Repository
creation and the tree grow up is obtained by the successive calls of these routines. One of the possible
sequences of calls that implement the creation of the page repository like in Figure 44.7 can be as follows:
1. Creation of the tree root
2. Creation of the by-default page tree node
3. Append the default page to the tree root
Repository root
Page by default
Directory public Directory images
Directory javadir
FIGURE 44.7 Example of VFS repository.
4. Creation of embedded directory node named public
5. Creation of series of embedded le nodes and insertion of nodes to the directory
6. Append of directory node to the repository
7. Repeat steps 4, 5, and 6 for directories images and javadir
It is important to state that the data structure constructed by the procedure detailed above generates
the containers which should be lled with the references to data which really implement the embedded
objects. These references should point to the really exploitable data (byte arrays mentioned in the preceding
sections) which are to be generated as the result of a separate operation, described in the section which
follows.
The method of creation of active object repository is nearly identical of the one described above.
44.4.3 Implementation of Look-and-Feel Objects
Data structures corresponding to the repositories of the VFS enable the usual operations of le man-
agement system like le creation, deletion, search, access to data stored in le-type node, activation of
routine referenced by script-type node. They do not contain directly any data or code that should be
created by separate operations. These data are implemented by the passive server objects, contributing to
look-and-feel aspects of embedded site. The objects are placed inside memory regions accessible to the
server routines. They take two different forms:
For static HTML pages and HTML page templates are represented as character strings
Embedded images and embedded Java applets are stored as byte arrays
The difference of storage form is to be explained by the method of object processing in response
composition phase of server operation. HTML pages are composed of printable characters only and
never contain a null character which is used uniquely as the marker of page(string) terminator. The same
assumption does not hold neither for images in .gif and .jpeg formats nor for applets. These objects can
(and do) contain nonprintable bytes, included null character and cannot be stored as character strings.
Their storage format should follow the pattern of a byte array of a known size.
From the external point of view, embedded pages and embedded images do not differ from regular
(nonembedded) ones and there is no reason for them not to be created by the tools which usually serve to
edit them (Microsoft Front Page, Netscape Composer, Microsoft Image Composer, etc.). Standard tools
create standard storage formats, compatible with the le system of the hosting platform. This is the main
problem in the creation of embedded sites, since the standard storage formats are not directly useful in
the construction of the server custom component. The output les produced by the tools have to be
transformed into modules that can be linked (statically or dynamically) with the code of other server
modules. Example of such a module, embedding an image in gif format, is shown in Figure 44.8.
extern const unsigned char aautobull2_img[];
extern int aautobull2_img_length;
const unsigned char aautobul12_img[] = {
0x47, 0x49, 0x46, 0x38, 0x39, 0x61, 0x0c, 0x00, 0x0c, 0x00, 0xb3, 0xff, 0x00, 0xff, 0xff, 0x66
, 0xff, 0xff, 0x33, 0xff, 0xff, 0x00, 0xcc, 0xff, 0x00, 0xc0, 0xc0, 0xc0, 0x99, 0xff, 0x00, 0x99
, 0xcc, 0x00, 0x99, 0x99, 0x00, 0x99, 0x66, 0x00, 0x66, 0x99, 0x00, 0x66, 0x66, 0x00, 0x33, 0x66
, 0x00, 0x33, 0x33, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x21, 0xf9, 0x04
, 0x01, 0x00, 0x00, 0x04, 0x00, 0x2c, 0x00, 0x00, 0x00, 0x00, 0x0c, 0x00, 0x0c, 0x00, 0x00, 0x04
, 0x42, 0x90, 0xc8, 0x49, 0x6b, 0xbb, 0xb7, 0x92, 0x86, 0x42, 0x10, 0x47, 0x43, 0x35, 0x02, 0xe0
, 0x09, 0x83, 0x21, 0x6e, 0x88, 0x79, 0x0e, 0x83, 0x22, 0x92, 0x81, 0x39, 0x08, 0xc6, 0x91, 0xcc
, 0xde, 0x57, 0x14, 0xba, 0xdd, 0x46, 0x40, 0x84, 0x19, 0x14, 0x0b, 0xd9, 0xe6, 0x00, 0xlb, 0xlc
, 0x16, 0x50, 0xc6, 0xaa, 0x31, 0x28, 0x18, 0x12, 0x48, 0xe9, 0x48, 0xal, 0x53, 0x68, 0x2d, 0x98
, 0x95, 0x24, 0x02, 0x00, 0x3b};
int aautobull2_img_length = 149;
/* */
FIGURE 44.8 Code snippet implementing an embedded .gif image. (From J. Szymanski, Proceedings of WFCS 2000,
extern char* transpassword_str;
extern int transpassword_str_length;
static const unsigned char transpassword_str_array[] = {
0x3c, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a, 0x0a, 0x3c, 0x68, 0x65, 0x61, 0x64, 0x3e, 0x0a, 0x3c
, 0x74, 0x69, 0x74, 0x6c, 0x65, 0x3e, 0x50, 0x61, 0x73, 0x73, 0x77, 0x6f, 0x72, 0x64, 0x20, 0x45
, 0x2d, 0x2d, 0x3e, 0x3c, 0x2f, 0x66, 0x6f, 0x6e, 0x74, 0x3e, 0x3c, 0x2f, 0x62, 0x6f, 0x64, 0x79
, 0x3e, 0x0a, 0x3c, 0x2f, 0x68, 0x74, 0x6d, 0x6c, 0x3e, 0x0a, 0x00};
char* transpassword_str = (char*) (&transpassword_str_array);
int transpassword_str_length = 1546;
FIGURE 44.9 Code snippet representing an embedded HTML page.
The module contains two variables:
Byte array that holds the data normally placed within a disk le, the reference of this variable should
be passed as the parameter of the call to le creation routine.
Integer variable storing the length of the array; this variable is used by the routines that serve the
object data over the network.
Nearly the same format can be used to store HTML pages, with the difference, shown in Figure 44.9.
In this second module, the byte array is encapsulated within the module and only its reference, casted
to the type compatible with the character string type, is exported. It can also be seen that the nal null
character is placed at the end of the byte array. This enables the processing of the array in exactly the same
way as a character string is processed.
Transformation of standard storage formats to the modules shown above is done by simple programs
which read disk les and generate appropriate modules automatically. The principle of their operation is
The memory regions corresponding to the les are reserved and lled in at build time of the server code
by the operation of compiler and linker producing object code of appropriate modules. The same process
leads to the resolution of references of the regions held within data structures of repositories of VFS.
44.4.4 Implementation of Page Composition Routines
One of the specic features of web servers operating on diskless platforms concerns the method of serving
HTML objects. This problem is not so crucial in the case of servers placed on platforms equipped with
disks, where the basic service consists of copying page contents froma disk le to the communication link
(socket). A similar procedure can be used for the embedded platform, where an octet string or memory
embedpage
.html .c
embedbin
.gif .c
.jpeg
.class
.c
.c
FIGURE 44.10 Principle of transformation of passive objects into embedding modules.
region takes the role of a mass storage le. However, this procedure is of lesser interest for the majority of
embedded application. Static HTML pages have no appeal when used as front ends for control application.
Genuinely useful pages should incorporate information produced by the back-end application.
Two methods are possible to implement such a requirement:
HTML page is split into two objects: static frame (page template) and application-dependent data;
in the process of page serving the frame is merged with the data recovered dynamically from the
application.
HTML page is generated on the y by routines that compute page components one by one;
the attributes of page components are determined by the parameters of the routines, which are
application-dependent data.
The rst method is simpler to implement but in the case of complicated interfaces requires voluminous
page templates stored in large byte arrays. The second method reduces storage consumption but needs a
supplementary special software basis composed of routines implementing page computations. In the case
of FDWS, these routines are provided by package of online generation of HTML pages. Both methods are
briey described in the next sections.
44.4.4.1 Template-Based Dynamic Pages
Template-based generation of dynamic HTML involves separation of every page into two components:
static page template and dynamic data. Page generation involves merging both components before the
page is transmitted to the requesting client.
Page templates resemble a lot to regular pages and can be constructed using regular tools for HTML
page creation. In the embedded site structure, templates are stored in the same way as the embedded static
pages, that is, as byte arrays or character strings.
A page template contains all constant page elements, such as nonvarying text, images, applets, constant
hyperlinks, constant attributes of HTML tags, etc. Anything varying within the page is to be replaced by
a placeholder dynamic data representative.
Placeholders can replace virtually any element, be it a text, numeric data, or a tag attribute. Their
implementation strongly depends on the method of serving the page. In the case of this software, place-
holders implementation is based on the C language conversion specications, as used in format strings of
C functions fromthe family of printf (printf, sprintf, fprintf, etc.). This means that any numeric integer
data are replaced by the specications %d, %i, %o, %u, %x, %X. Floating point numeric data are replaced
by the %f, %e, and %E specications. Strings are replaced by %s specication.
Examples
HTML page of the formshown below.
<HTML>
<HEAD>
<TITLE> Count of visitors </TITLE>
</HEAD>
<BODY>
Page of ALSTOMTECHNOLOGY was seen by 123456 visitors.
</BODY>
</HTML>
If two highlighted elements are variable data, the page template should have the formshown below.
<HTML>
<HEAD>
<TITLE> Count of visitors </TITLE>
</HEAD>
<BODY>
Page of %s was seen by %d visitors.
</BODY>
</HTML>
The origin of such representation lies in the type of implementation of the routine that merges the
template with the variable data. In the FDWS this function is implemented by the routine fromthe server
engine package. The routine signature is of the same type as that of fprintf or sprintf, namely:
int sockprintf(unsigned short socket_id, char
page_template, );
The piece of code that composes the page as above is shown in Figure 44.11. Page template is referenced
by vis_page_str pointer and ssock is the unique identier of the server socket. The method employed
char comp_name_str[32];
int vis_nr;
strcpy(comp_name_str, ALSTOM TECHNOLOGY);

vis_nr = 123456;
send_result =
sockprintf(ssock,vis_page_str,comp_name_str,vis_nr);
FIGURE 44.11 Code fragment which implements the example of embedded page generation.
requires that any dynamic data to be merged with page templates be transformed to a signed decimal,
a signed oating point number or to a string.
The proposed server interface provides a unique entry point to page composition code of all server
pages via the pointer send_page_routine. It is important to observe that page composition code sequences
for every page within the server should be accessible via this entry point. This constraint can be fullled
only if the embedded server customized component contains a routine which intercepts all requests for
HTML object services and dispatches them as specialized pieces of code. The recommended solution is
presented in the next chapter.
44.4.4.2 Dynamic Pages Generated on the Fly
In the dynamic page generation the static page frame is not employed and the page is produced as a result
of a series of routine calls that generate code strings representing successive components (HTML tags) of
the page. Routines can be called conditionally and can have parameters dependent on application data.
The data strings produced by the routines are either directly written to the socket or stored in a buffer that
is nally send to the socket.
44.4.5 Implementation of Script Activation Routines
The structure of the VFS imposed by the standard architecture of embedded server separates objects
provided at users disposal in two distinguished collections: passive objects (pages, applets, images)
and active objects (scripts). Standard application interface separates the conguration of passive object
composition and transmission from active object servicing. The operation of active object servicing is
done by the procedure which should be user provided and inserted to the servers structure via its reference.
As for page composition routine, this entry point should be unique for activation of every script solicited
by clients requests.
44.4.6 Implementation of Application Wrappers
There is no general method proposed for this part of customized component. Only vague recommenda-
tions, deducted from the mission of the basic application, can be provided. The wrapper modules serve
as an interface adaptor for the data transmitted between the server objects and application objects.
Requirements imposed on these modules from the server-side follow the method of data insertion
into dynamic pages. Any useful information extracted from the application should be transformed into a
scalar data having one of the basic types usable with the page templates, that is, integer numbers, oating
point numbers, and character strings. Complete requirements imposed on the interface software from the
application-side are impossible to determine due to the diversity of application types.
Some basic principles can however be identied. Data sent by the client are transported by the elements
of CGI interface included in the body of POST service. These elements are normally constructed as a
series of pairs (name, value). These data are automatically recovered from the client request PDU and
stored within a special memory region accessible via the modules from the server engine package. The
modules provide a set of functions that allow the programmer to recover and handle the requested data.
The basis of modules interface is built on one data type represented by the code in Figure 44.12 and two
functions which provide the application program with access to the memory region in reading and in
writing. As it can be seen, data in the region are identied by their alphanumeric identiers recovered
from the POST service request PDU.
The implementation of ve components of the embedded site described above provide all operations
necessary to put in action the front-end tier of the application. Now the operations should be activated in
a good sequence, initializations rst, followed by the main server loop activation.
The operation of initialization should set four variables exported by the server to the values that
reference passive object repository root, active objects repository root, page composition routine, and
script launching routine. An example of such an initialization routine is presented in Figure 44.13.
typedef struct tdbtag_res{
int result;
union{
char* string;
int integer;
float real;} value;
}tdb_result
FIGURE 44.12 Data type supporting interface with client provided PDUs.
void init_VFS(void){
server_root = db_page_root_gen();
send_page_routine =send_page;
cgi_bin = db_cgi_bin_gen();
send_script_routine =send_script;
}
FIGURE 44.13 The interface of initialization routine.
int embedded_server_launcher(unsigned short service)
{
init_VFS();
init_application_wrapper();
return server_boot(service);
}
FIGURE 44.14 Server launching routine.
This piece of code, a series of four simple assignments, assumes that the application programmer has
provided four routines: db_page_root_gen, which generates passive object repository, db_cgi_bin_gen,
which generates active object repository, send_page, which implements page composition method
send_script, which implements a method of script launching. The two functions db_page_root_gen()
and db_cgi_bin_gen(), which generate VFS repositories, are called and executed by the initialization
routine. The result of their execution is the immediate creation of VFS tree structures and assignment
of their references to the pointers server_root and cgi_bin. This is not the case of page composition
(send_page) and script activation (send_script) routines for which only references are affected to the
server API pointers. In the presented solution, it is assumed that these four functions are exported from
the modules that implement them.
Now that the initialization routine is implemented, the code of server process can be proposed. The
routine in Figure 44.14 is an example of such a code. It is a simple piece of code which undertakes two ini-
tialization functions: that of server structures via the call to init_VFS() routine and application interface via
init_application_wrapper() routine before launching the server loop via the call to server_boot(service)
routine. The only parameter of this routine signies the number of TCP/IP port on which server expects
to receive clients requests.
44.4.7 Putting Pieces Together
The preceding sections provide all necessary details concerning the creation of custom elements of the
content of an embedded server. It is important to show how the steps of this development process are
sequenced. This sequence of steps is presented by the graph shown in Figure 44.15. The process presented
.html
.htm
.gif
.jpeg
.class
Look-and-feel
VFS
conf
init
Organisation
Links with application
scripts
Appl.
wrapper
.html
.htm
Text
editor
embedpage
embedbin
compilVFS
.c
.c
.c
C/C++
compiler
.o
HTTPreuse
Server
driver
Linker
Loadable
object
file
Application
FIGURE 44.15 Development process for an embedded server application.
below shows how to obtain a nal result, which is a loadable object code le from the primitive elements
which are grouped in three categories:
1. Collection of passive elements (pages, images, applets).
2. VFS creation and management: page composition and script activation.
3. Interface with organic application, initialization code, and CGI scripts routines.
These categories of application elements are developed with the means which are appropriate to the nature
of each element. This signies that:
HTML pages were created with an HTML editor.
Gif and Jpeg objects were developed with image editing tools and devices.
Java Applets and Beans were developed with the standard tools included in JDK.
C/C++ code implementing application wrapper software and routines playing the role of
CGI scripts should be developed with the suit of tools (editor/compiler/debugger/loader) used
for the hosting platform.
In order to create the integrated component loadable to the hosting platform, the site components of
should be preliminarily transformed from their initial storage format to a common format which is the
collection of compilable C-coded modules. The preceding section proposed the method of transformation
of passive components to C-coded modules by means of specialized processors. There is no problem for
the modules originally coded in C: that is, for initialization code, script routines, and application wrapper.
They should be designed and coded in accordance with the usual principles of efcient implementation.
The method of development of VFS component poses the biggest challenge in the development of
embedded site. VFS design is straightforward and relatively easy. Also its manual implementation seems
to be a simple chain of repetitive actions. It can be directly deducted fromthe graphical site representation
(like the one in Figure 44.7). The implementation process is so regular that it can be easily automated;
HART/FIP gateway
HART channels
WorldFIP
Protocol gateway
WorldFIP
FIGURE 44.16 Automation cell with a HART/FIP gateway and its conguration console.
that is, the repository tree can be transformed to a sequence of procedure calls by a processor a VFS
compiler. This compiler transforms the textual description of the VFS repository structures into appro-
priate C-coded modules that implement all four functions necessary to initialize the server API. More
detailed description of the compilers operation is placed in the Appendix.
44.5 Example of Site Implementation in a
HART Protocol Gateway
In order to illustrate a real-life application, a site embedded in an industrial device is presented. The chosen
device is the instrumentation gateway to process control cells. Its role consists in linking the network of
sensors and actuators with the process computers and PLCs. To do so the gateway collects the information
from the instruments connected to it via HART instrumentation protocol and transfers it to automation
cells via WorldFIP protocol. Each gateway can provide connections up to eight HART channels. In this
example one of two compositions of channels is possible: eight input channels or six input channels and
two outputs.
Operation of the gateway is controlled by a collection of parameters that tune its performance to the
needs of given installation. Through the set of parameters one can set-up characteristics of HART and
WorldFIP protocols, modify certain translation parameters, gateway operation modes, etc. Each HART
channel can also be tuned to the type of HART transmitter connected to it.
All these tuning operations are usually done by a special purpose device console conguration (see
schematic in Figure 44.16). The console communicates with gateway via a proprietary protocol based on
UDP/IP transport.
The idea of the application described below is to replace the special purpose tuning console by a
standard web browser and implement all tuning functions on the basis of three-tier architecture described
in the introductory section of this account. All the functions related to the manmachine interface and to
the tuning operations will be implemented by the front-end tier based on embedded HTTP server. The
architecture of such an application would be then transformed to the one shown in Figure 44.17.
The server embedded in the gateway together with a standard web browser should replace the operation
of the tuning console. For this reason, it should give the user access to all necessary functions implemented
by the special purpose conguration console. The set of the functions is described below.
The console gives access to the parameters of the gateway via an appropriate screen (Table 44.2). The
front-end server placed within the gateway should provide access to the same set of parameters, respecting
read only and read/write mode.
The console-provides also the means of monitoring the status of all HART channels in real time
by displaying the nature of the transmitter connected (HART type/non-HART type), its signaling
TCP/IP router
HART/FIP gateway
HART Channels
WorldFIP
WWW browser
Ethernet
FIGURE 44.17 Gateway parameter tuning functions realized by the three-tier architecture.
TABLE 44.2 Access to Gateway Parameters
Type of parameters Parameter name Access
Identication Tag Read/write
Product name Read only
Manufacturer name Read only
Software version Read only
Hardware properties Power supply type Read only
I/O mode Read only
WorldFIP mediumtype Read only
WorldFIP mediummode Read only
WorldFIP bit rate Read only
WorldFIP protocol Refreshment time Read/write
Promptness Read/write
HART protocol Timeout Read/write
No. of retries Read/write
Processing parameters Measure format Read/write
Operation mode Read/write
Antialiasing lter Read/write
Conguration version Read/write
Conguration revision Read/write
(type, manufacturer name) and its status (active/inactive). The same function is to be placed in the
front-end server of the protocol gateway. The gateway provides the possibility of parameter tuning of
every active HART channel. The appropriate screen gives access to the set of channel parameters as in
Table 44.3. Access to all the devices should be provided by the front-end server of the gateway.
The operation of ALSPA P80H console is in principle oriented toward parameter-tuning functions. In
some cases, however, it gives the possibility of direct monitoring of process variables by gaining access to
transmitter primary variable. The same function is required from the server embedded in the protocol
gateway.
44.5.1 Structure of the Site Embedded in the Protocol Gateway
The architecture of the server embedded in the gateway is strongly inuenced by the functional require-
ments presented above. It is composed of a collection of HTML pages, corresponding to the console
screens, which are organized in ve directories. Three of the ve directories group the pages according to
the functional criterion: there is one directory (di80_parameters) holding the pages provided for access
TABLE 44.3 Access to Channel Parameters
Type of parameters Parameter name Access
Identication Manufacturer name Read only
Transmitter model Read only
Transmitter tag Read/write
Descriptor Read/write
HART unique identier Read only
Cell limits Upper cell limit Read only
Lower cell limit Read only
Minimum span Read only
Transmitter conguration Damping factor Read/write
Transfer function Read/write
Primary variable units Read/write
Lower measurement range Read/write
Upper measurement range Read/write
Home Page
images
16 Virtual files
(.gif and .jpeg)
_fpclass
3 Virtual files
(.class)
transmitters
36 Virtual files
(.htm)
measures
8 Virtual files
(.htm)
di80_parameters
6 Virtual files
(.htm)
Passive object
repository
root
FIGURE 44.18 Passive object repository of the embedded server.
to the gateway parameters, one (transmitters) grouping the pages accessing active channel list together
with channel-parameter-tuning pages and one (measures) that groups the pages which monitor channel
measures. Two remaining directories group the pages according to structural criterion: one (images)
is provided to store all embedded images included within the pages, the other (_fpclass) contains all
embedded Java applets.
This architecture is presented by the graph of the passive object directory shown in Figure 44.18.
The VFS of the server contains also another repository which contains three scripting routines. These
script les are placed directly under the repository root (see Figure 44.19).
Totally, the server contains 70 embedded les collected within 5 directories.
CGI script
repository root
3 Virtual files
(.cgi)
FIGURE 44.19 CGI (script) repository of the embedded server.
FIGURE 44.20 Home Page of the embedded server.
44.5.2 Detailed Implementation of Principal Functions
The sections below present all important pages that give the user access to functions implemented by the
front-end server. All the pages were developed using Microsoft Front Page HTML editor and incorporate
graphical elements provided by this tool (page background, fonts, banners, buttons, etc.).
44.5.3 Access to Site Home Page
The access to the site is obtained via the default page presented in Figure 44.20. This page has rather
informative character; it displays the photo of the gateway and the list of the principal functions imple-
mented by the embedded server. The direct access to the functions can be obtained via three buttons
placed above the photo.
FIGURE 44.21 Page giving access to the parameters of the gateway in read only mode.
44.5.4 Access to Parameters of the Gateway
The rst and the third button of the welcome page of server gives the access to two server pages that give the
user the possibility of reading parameters (Parameters button) or modifying parameters (Set Parameters
button). Both buttons link the default page with two pages residing in the directory transmitters.
The page presented in Figure 44.21 gives the access to the parameters of the gateway. It is implemented as
a frame of three panes, that is, its implementation requires four embedded les. The upper pane identies
the screen via the large title banner realized as an animated gif image. Left lower pane contains the menu
composed of ve hyperlinks that provide the convenient access to ve groups of gateway parameters. The
parameters, distributed between ve tables, are displayed in the right lower pane in the form of tables.
This pane is too small to display all ve tables at the same time. This explains the need for menu pane that
avoids using scrollbars for access to parameter tables.
Parameter modication is implemented by the page presented in Figure 44.22. This page is separated
from the previously described page for security reasons. It is to be used in case one wants to change some
parameter values. He will then enter into a transaction with the server resident components which should
nish up with the modication of a chosen parameter or parameter set.
The set of parameters displayed in this page is narrower than the one displayed in the page previously
described. This is normal since only modiable parameters are displayed on the screen.
The page editing environment provides the facility to add to the page some form-element controls
coded automatically in JavaScript. This concerns obligations to ll in some cases (password see
Section 44.5.5), keep introduced value within a given interval of values (e.g., Retries eld value should be
kept between 1 and 6), respect the format of some of information input to certain elements (e.g., only
gures are accepted in timeout, retries, refreshment, etc. cases).
FIGURE 44.22 Gateway parameter setting form.
The contents of the form elements lled in by the user willing to modify corresponding gateway
parameters is sent to the server using POST service of HTTP protocol when the SUBMIT button of the
page is pressed. The modication of the parameters and update of the browsers screen with the new
values is implemented by a specialized routine invoked via one of the CGI scripts.
44.5.5 Access to Active Channel List
The middle button in the Home Page of the embedded site gives access to the page that displays the list
of all active channels connected to the gateway at given time. The status of the channels is described in
the tabular form, as shown in Figure 44.23. Each channel corresponds to a row in the table. Each row is
composed of four elements that indicate the high-level descriptions of connected channels.
The position of the channel descriptor in the table corresponds to the channel number. Third column
of the table indicates transmitter status. The remaining columns of each line are lled in only if the third
one is set to ACTIVE. In such a case, the rst column contains the name of the transmitter manufacturer
(if recognized), the second the device type, and the fourth the unique HART identier (normalized
by HART protocol description). This identier serves as the link to the page describing HART channel
parameters.
In the case when the status of the channel is recognized as non-HART (analog 4/20 mA current loop)
or NO CURRENT (current loop not connected), the three signicant columns of such a row are empty.
44.5.6 Access to Channel Parameters
This page provides the possibility to display the parameters of an active HART channel. The page that
interfaces the user with this facility is presented in Figure 44.24. The page is organized in the form of a
frame composed of three panes: heading pane with the title banner, menu pane giving access to groups of
parameters, and parameter pane accessed either via menus or via scroll bar of browsers window.
FIGURE 44.23 Table displaying channel status.
FIGURE 44.24 Page giving access to HART channel parameters.
The page externally resembles the one that gives access to gateway parameters in read only mode
(three panes, one of it accessed via menus placed in the other). Functionally, there are two fundamental
differences between them.
First, the channel parameter page gives access to all channel parameters respecting their mode. Read
only parameters are displayed as plain text sections while read/write parameters correspond to the active
form entries.
Second, the page design ensures that only potentially modiable parameter values can be sent back to
the server while button SUBMIT is pushed. The process of parameter update is under the control of a
CGI (script) routine.
44.5.7 Monitoring of Principal Channel Measure
All the above-described functions are fully oriented toward parameter tuning. Service offered concerned
either global properties of the HART/FIP converter or acted on characteristics of an individual channel.
The quality of service was comparable with the one offered by the original conguration console.
The function of recovery of primary measure of a HART channel described in this section differs from
the others both in the nature and quality of service offered. Functionally, it is no longer a parameter-tuning
operation. Data handled in this operation do not concern the status of the channel itself but reect the
evolution of the phenomenon measured by the channel transmitter. For this reason, doing this operation
from time to time, in irregular time points, and displaying numerical values on the screen does not provide
much valuable information. Unfortunately, this is the only mode in which this function can be exploited
via an ordinary-tuning console.
The three-tier architecture enables a totally different implementation. The monitoring function is
implemented by an HTML page accessible from the channel-tuning page via the link primary measure
(see Figure 44.24). This page contains a Java applet that does more than displaying a numerical value. Its
operation involves periodically fetching channel measures and displaying them in the form of a trend curve
(Figure 44.25). Each curve point corresponds to a complete transaction between the applet and the server.
The transaction is initiated by the HTTP request that activates an embedded script routine (servlet), which
elaborates the primary measure of the channel by activating an appropriate HART command. Measure
values obtained by execution of this command are transported via an HTTP response (tunneled in HTTP
PDU). This solution ensures that the communication remains operational even in the case when the
browser machine executing the applet is connected outside of the systems security barrier.
44.5.8 Access Control and Authentication
HTTP-based systems are on principle open and accessible to any client that knows the URL of the server.
This fact makes the system prone to unauthorized accesses and needs the implementation of access control
functions.
In the case of this application, the protection is implemented in two ways:
Natural authentication mechanism exploiting HTTP standard feature; on the client side this
mechanism is built into any standard web browser.
Supplementary protection by password entry, which is built-in into forms and handled by a
specialized script.
The rst mechanism is based on the procedure of access control standard for HTTP protocol. This
procedure is based on so-called authentication challenge transaction. According to the standards of HTTP
1.0 protocol any Universal Resource Locator (URL) can point to a server resource that is accessible to a
restricted set of users, each being identied by his name and his password. When the browser accesses such
a resource for the rst time since its activation, the server will produce the response initiating the challenge
(declaration of unauthorized access see [2]). The browser will react to this response by displaying a dialog
box as in Figure 44.26.
FIGURE 44.25 Applet monitoring primary variable in a HART channel.
FIGURE 44.26 Authentication box for the French version of Internet Explorer.
The user is expected to ll in both text zones of the box and press the button OK. This operation
will repeat the previously executed request with a PDU option that presents users credentials to the
server engine. The credentials contain the pair of user name + user password, encoded according to the
algorithm corresponding to the authentication mode. In the most popular authentication mode, called
Basic Authentication, the credentials are coded according to so-called base64 encoding. If on the server
side the pair user name:password corresponds to the contents of one of the authentication records
attached to the resource, the response PDU will contain the resource contents. In the opposite case,
the authentication process fails, the access to the resource is denied, and the authentication transaction
reiterated.
FIGURE 44.27 Message from the script that forces the user to ll in the password box.
FIGURE 44.28 Page signaling bad authentication password.
It is worth noting that the authentication transaction for a given subtree is done only once per client
session. This means that for a protected subtree of aVFS repository, the challenge will take place only during
the rst request. Any request that follows will automatically contain the authentication information.
Basic authentication mode is not a reliable protection against unauthorized accesses since the coding
scheme of credentials is simplistic and can be easily overcome. There are more powerful access control
schemes, such as the one called Digest Authentication in which the decoding of credentials is more
complex and provides higher level of security against undesirable intrusions.
In any case, for the protected access, an authentication procedure based on HTPP standard features is
not restrictive enough since once identied, the client station can operate without further authentication
while accessing a given URL. If the station in question passes under the control of unauthorized user, the
server will still answer positively to clients request, since the credentials are memorized and kept ready for
each subsequent URL access till the end of the clients operation. To avoid any problem with this mode of
authentication another model, operating on authentication per request basis is to be used.
This is implemented by insertion into some pages (forms) a supplementary text zone of password
type, which requires to be lled each time the form is submitted. The password is veried at each activation
of associated script. Form page should be edited in a manner that constrains the submission of the form
on sending the password. This is frequently done by a page embedded script that blocks the submission
process when the password is not provided and prompts the user by a warning message (see Figure 44.27).
Submission of the page with the wrong password provokes the contents of the form to be rejected by the
server which sends back the refusal page, as in Figure 44.28.
44.5.9 Application Wrapper
All the components described above contribute to implementation of user interface to the organic applic-
ation of the protocol gateway. They rely on the data provided through the basic application interface but
Tag Descriptor
HART
timeout
HART
retries
Filter
constant
Operation
mode
Version Revision
di80_get_parameters
di80_set_parameters
get_manufacturer
set_manufacturer
get_io_mode
set_io_mode
B
a
s
i
c

a
p
p
l
i
c
a
t
i
o
n
S
e
r
v
e
r

p
a
g
e
s

a
n
d

s
c
r
i
p
t
s
FIGURE 44.29 Part of application wrapper providing access to the gateway parameters.
they impose some requirements on the formats of these data:
Access to the gateway parameters, to some of them in read only mode, on the individual basis.
Access to the list of active HART channels, which should be updated before being served via
an HTML.
Access to channel parameters in reading and in writing on the individual basis.
It is important that the values read from the application interface for all three types of data listed above
should have either the form of scalar integer values, scalar oating point values, or zero-ended character
strings.
The original interface to the gateway API does not fulll these requirements. In its original version it
provides the following functions:
Global access to all parameters of the gateway, as well as in reading (access routines return packets
of coded values in read mode and accept only records of coded values in write mode).
Global access to channel status data; it returns a packet of coded data in read mode.
Low-level mode of access to HART channels via blocks of octets specically coded according to
HART protocol standards.
There is denitely a need for a supplementary adaptor module that transforms low-level data produced
by the application tier into structured formats adapted to the mode of operation of the server. This
module the application wrapper is split into two parts:
Part providing access to the parameters, both in reading and in writing.
Part managing access to channels this part groups the function of displaying the channel status
table and the function of accessing channel-specic data via HART protocol.
The part providing access to the gateway parameters is built around the data structure representing the
gateway parameters as a persistent record.
The record is updated either by the server scripts modifying individual elds or by the call to the API
routines. The server side routines manipulate data as individual scalar values and for this reason need
access (both in reading and in writing) to the individual items of the record. The application side operates
on the basis of global access to the data, that is, the gateway parameters are all set and read at once, by the
activation of one of two interface routines (see Figure 44.29).
The second part of the application wrapper provides access to eight HART channels connected to the
protocol gateway. This part is structured around a table of eight records that represent eight potentially
active channels. Each record groups the parameters of an individual HART channel (see Figure 44.30).
send_sensor_command
B
a
s
i
c

a
p
p
l
i
c
a
t
i
o
n
S
e
r
v
e
r

p
a
g
e
s

a
n
d

s
c
r
i
p
t
s
Channel 1
.
.
.
.
.
.
Channel 8
di80_get_instrumentlist
HART
channel
buffer
get_transmitters
get_sensor_tag
set_sensor_tag_and_desc
FIGURE 44.30 Access to HART channel parameters.
transmitters
Home
Page
di80_params
measures
images _fpclass
FIGURE 44.31 Summary of conguration of the site embedded in the protocol gateway.
In general, the update of all parameters in this table is obtained via service requests sent to HART trans-
mitters via eight channels of the gateway. Fromthe user point of view not all the parameters are accessible
by the same means. Parameters that dene channel status and transmitter identity are obtained by one
global command that updates one part of each record of all the table simultaneously. Other parameters
are grouped into collections that correspond to one aspect of transmitter operation (cell conguration,
primary measure characteristics, measure units, etc.). Access to each parameter collection is ruled by one
command in writing and one command in reading. Collections are not disjoint and for this reason access
to a parameters can be done by different commands.
44.6 Architecture Summary and Test Case Description for the
Embedded Server
44.6.1 Embedded Site Architecture
The overview of conguration of the server embedded in protocol gateway is presented in Figure 44.31.
The diagrampresented in this gure shows the functional relationships among different site components.
The components are grouped in ve directories, as shown above.
The entry point of the embedded web of server pages is the HTML page named in the diagram above
Home Page. This page contains the hypertext links to the objects placed in the directories transmitters
and di80_params that represent the web domains responsible for browsing, respectively, HART chan-
nels and the gateway parameters. Server objects placed in the directory measures are referenced by the
links embedded in the objects in the directory transmitters and are in charge of graphical representation
of channel measures.
Objects placed in the directory images (embedded images) and in the directory _fpclass (embedded
applets) have different relationships with respect to other site objects. They are incorporated to the server
pages rather than linked to them and from the functional point of view play the auxiliary role in the site
operation.
The contents of all ve directories are presented below.
44.6.1.1 Directory Transmitters
This directory contains the set of embedded HTML pages that enable browsing of parameters of
active HART transmitters. The natural entry point to this realm of the embedded site is the page
transmitter_list.htm that represents the status of eight HART channels, which can potentially be con-
nected to eight HART transmitters. The interface to each potentially active channel is implemented by
a collection of four pages. Potentially, there are eight groups of pages, one per channel, but only pages;
corresponding to active channels can be displayed.
The group of pages accessing parameters of a channel is organized according to the following pattern
(see Figure 44.32):
Top-level channel front page of frame type incorporating three component HTML pages
(HART_sensor0.htmHART_sensor7.htm one per channel):
(a) Banner page set.html (shared by all channels),
(b) Menu page (menu0.htmmenu7.htm one per channel)
(c) Parameter browsing form (sensor0.htmsensor7.htm one per channel)
Password error signaling page (password0.htmpassword7.htm one per channel)
Pages contain links to other directories (see Figure 44.33) and incorporate elements from other directories.
There is no direct links from one channel browsing page set to other channels. All links should pass via
transmitter_list.htmpage.
44.6.1.2 Directory di80_params
This directory contains the set of embedded HTML pages that enable browsing of parameters of the
gateway. The pages are organized according to the diagram shown in Figure 44.33. The pages provide
access to two functions:
Displaying all gateway parameters in read only mode
Modifying some of the parameters
The rst function is accessible via the collection of four pages:
Top level, frame type page get_di80_params.htmthat wraps three other pages located in the frame
panes.
Page upper_page.htmcontaining banners and links to other domains of the site.
Menu page left_page.htm that supports direct selection of parameter groups; this page contains
direct links to each of the ve tables that group gateway parameters.
Page tables.htmthat display actual values of parameters.
transmitter_list.htm
HART_sensor0.htm
menu0.htm
set.htm
sensor0.htm
password_0.htm
Channel 0
Home Page
measure0.htm
HART_sensor7.htm
menu7.htm
sensor7.htm
password_7.htm
Channel 7
measure7.htm
di80_params
FIGURE 44.32 Overview of pages placed in the directory transmitters.
get_di80_params.htm
Home Page
set_di80.htm
upper_page.htm
left_page.htm
tables.htm
transmitter_list.htm
password.htm
FIGURE 44.33 Overview of pages placed in the directory di80_params.
measure0.htm
measure1.htm
measure6.htm
measure7.htm
HART_sensor0.htm
HART_sensor1.htm
HART_sensor6.htm
HART_sensor7.htm
HART Trend.class
FIGURE 44.34 Pages organized in the directory measures.
The second function is realized by two pages:
Page set_di80.htm displays all modiable parameters of gateway. The page is organized as a
parameter browsing formwhich, while submitted, activates a script routine in charge of updating
the parameters.
Page password.htmsignals password error in the submission of page set_di80.htm.
Each of the two functions is attained by a separate entry point. It is possible to leave smoothly the functions
via the links to Home Page of the server and to the transmitter list page.
44.6.1.3 Directory Measures
The directory measures groups eight pages that correspond to the function of monitoring the values
of primary measure for every HART channel. The pages have no other functions than wrapping the
appropriate applets and providing exit links back to transmitter parameters page. All eight pages are
independent; there is no link among them, each is attained via a separate entry point, and each has its
own exit link (see Figure 44.34).
44.6.1.4 Directory _fpclass
The directory _fpclass is the server domain whose name and existence is inherited fromthe server design
pattern suggested by the site development tool. It contains three applets: two of them, proposed by the
tool, implement active buttons that link pages.
The third applet implements the display of signal trends and is used to monitor primary measures
recovered from active HART channels. The design of this applet is nontrivial since it actively samples
channel measures by activating a script routine on the server side. The routine is in charge of getting
channel measure via an appropriate HART command and in transferring it to the applet wrapped in an
HTTP response PDU. Thus the data exchange between the applet and the server script passes via an HTTP
tunnel and can easily pass by the security barriers of the site (rewalls).
Intranet
CCD/Clamart
Ethernet/WorldFIP
router
HART/FIP gateway
Rosemount 3051C
pressure transmitter
Rsistance
Channel 0
Channel 5
Channel 6
Channel 7
WorldFIP
Rosemount3144
temperature transmitter
Http client
(browser)
Fisher DVC 5000
valve
FIGURE 44.35 Schematic of the test platform.
44.6.1.5 Directory Images
The directory images is a at collection of 16 images used by other pages. There is no links between objects
in this directory. Some of the images are used by many different pages placed in different server domains.
The idea of such an organization of images is inherited fromthe site development tool.
44.6.2 Test Description
44.6.2.1 Test Platform
The summary of architecture presented above serves as a reference for the description of test scenarios
described in this section. The system in which the tests were done is represented by the schematic from
the Figure 44.35. The embedded server is placed within the HART/FIP gateway connected to a segment of
1 Mbit/sec twisted pair, dual-mediumWorldFIP eldbus. Data transfers over the segment are organized
by the bus arbitrator operating with the basic cycle of 20 msec.
The HARTinterface of the gateway is congured inthe mode 6 inputs 2 outputs. The HARTchannels
are connected as follows:
Channel 0: active connected to Rosemount 3051C pressure transmitter
Channels 1 to 4: empty
Channel 5: active connected to Rosemount 3144 temperature transmitter
Channel 6: active connected to Fishers DVD 5000 valve
Channel 7: simulated active by a resistance enabling closed-current loop
The WorldFIP segment is connected to the CCDEthernet-based Intranet via the router node implemented
on Compaq Deskpro computer under Windows NT4. Routing of TCP/IP trafc is done by the native
TCP/IP protocol stack of Windows NT4 that works with the standard Ethernet PC board on the Intranet
side and with the WorldFIP CC121 board controlled by WorldFIP NDIS driver.
Test scenarios are executed using a standard Internet browser connected to the Intranet. Two most
popular Internet browsers were used for the tests: MS Internet Explorer V5 and Netscape Navigator V4.5.
44.6.3 Test Scenarios
This section describes nine test scenarios that comprise a necessary and sufcient set of operations which
should prove the correctness of servers operation. The tests activate all phases of servers life cycle and
under trial all designed functions embedded within the protocol gateway.
Some functions of the server are tested by almost all test scenarios, except the rst one. This concerns
the generic functions of the server loop that are: request parsing, requested object search, and retrieval,
response generation.
Another generic test concerns the process of merging static page templates with dynamically retrieved
process data. Almost all HTML pages in the server structure are obtained by this operation, except Home
Page. These generic test objectives are not repeated in the description of test scenarios below.
44.6.3.1 Execution of Server Initialization Phase
The objective of this scenario is to test smooth initialization of servers data structures and creation
of access point to the network. To execute the scenario launch the server process and observe servers
console. No error messages means that the software was executed till the beginning of the server loop:
that is, servers execution thread was created, VFS tree is instantiated, application wrapper is ready to
communicate with the application, server socket is created, and the server thread is waiting for clients
connections.
Possible erroneous reactions are: system exception on no available heap space or server thread error
message on impossibility of creation of servers passive socket. Only these two fatal errors can be put into
evidence by this scenario. Absence of any error messages does not prove its correct operation. To be sure
that all the above operations are correctly executed it is necessary to pass by all eight scenarios.
44.6.3.2 Server Access via Home Page
The objective of this test is to prove denitely the correct execution of some initial operations and show
that page composer works on plain HTML pages with no data taken fromthe process interface.
In this scenario the server machine should be called by the test browser and the server should answer by
sending the Home Page. Received page elements should be examined visually in order to detect a visible
defect (incoherent text, jammed background, distorted images, applets that fail to operate).
Move the mouse over the buttons of the main menu. The image should change taking the form of
a selected button. Click on a button, this should activate a hyperlink to one of the three pages of the site.
It is necessary to test all three buttons fromthe page.
44.6.3.3 Authentication
The objective of this test is to prove the correct activation of challenge transaction on protected realms.
In this server, the access to realms di80_params and transmitters requires the client to present the
authentication credentials (registered user name and correct password). The access to the Home Page of
the server is not protected but any of the three possible links leading fromthis page to three destinations
should trigger the authentication request. To start the test sequence, rst activate any of three links, for
example, the one to the transmitter list. This should force the server to return the response that activates
the dialog box onthe clients screenand forces the user to provide its credentials. Submissionof credentials:
username =hartp and password =alstom should open the access to the page with the list. Fromnow
on any path within the servers structure should be opened and no authentication should be requested
any longer till the execution of the client browser is stopped.
If the protection on all three paths fromthe Home Page is to be tested, the browser should be restarted
before the tests of each path, in order to make it lose the credentials entered during the rst authentication.
Otherwise, the rst authentication will open the access for all subsequent links from the Home Page to
the realms of restricted access.
44.6.3.4 Page of di80_params in Read Only Mode
The objective of this test involves testing the part of the application wrapper module that retrieves the
parameters.
Activation of a link to this part of the server should cause the reception of a three pane frame type
page with the ve parameter pages in the lower right pane and ve links in the lower left pane. Examine
all page panes visually. The displayed image should be regular with no apparent defects, all tables in
the parameter pane should be lled in with coherent values. All links in the menu pane should move an
appropriate table to the top of parameter pane.
The unique upper pane shouldcontainthe page banner andthree links: toHome Page, tothe transmitter
list page, and to the page that enables parameter modication. All links in this pane should be active and
should lead to the expected functions.
This page should be refresh every 60 sec.
44.6.3.5 Page of Gateway Parameters in Modication Mode
The objective of this test is triple: primarily it tests the script that analyzes the set of data provided by the
HTTP POST service, secondarily it tests the application wrapper function that modies the parameters
of the gateway, nally it tests the technique of dynamic generation of some sophisticated parts of HTML
pages such as pop-up menus, checkboxes, and groups of radio buttons.
The page that corresponds to the function contains a form that is composed of ve groups of items,
each formitemcorresponding to a modiable gateway parameter. In the part of the scenario concerning
the script activated by the POST request, the following test cases are incorporated:
Test of functions that control the coherence of formats of parameters entered into the formitems.
Test of efciency of parameter modications.
Test of control of access protection by password.
The external links to other server functions should also be tested.
44.6.3.6 Retrieval of List of Active Channels
The objective of this test is to verify the part of application wrapper module that is responsible of providing
the global status of eight HARTchannels connected to the gateway and in testing the technique of dynamic
generation of large context dependent sections of HTML page.
As a result of the request of the page with the channel list, all three HART transmitters should be iden-
tied and described. The channel with the resistance simulating closed-current loop should be declared
as Non-HART device. All four empty channels should be labeled as inaccessible. Links giving access to
individual transmitter pages should be displayed as active. Disconnection of a channel should be seen in
the table after the page update.
The test scenario of this page includes the test of effectiveness of all the links in this page: those to
channel descriptions and those to other server functions.
44.6.3.7 Access to Channel Parameters
The primary objective of this test is to verify the part of application wrapper module that is in charge of
controlling the HART channel parameters. It also tests the correct operation of the script that coordinates
the process of channel parameter handling as well as the dynamic generationof pop-upmenus, checkboxes,
and radio buttons. The nature of tests done in this scenario resembles to those described for the gateway
parameter modication, since one of the parts of this page has a form incorporated. Test cases for
this page also concern the coherence of data entered into the form and function of protection by the
password.
The test for the page takes into account the verication of external links to Home Page, transmitter list,
and to parameter setting function.
44.6.3.8 Trend Applet
The objective of this test is conceptually more sophisticated than other scenarios. It puts under verication
the data exchange between a Java applet and a server script based on the principle of data tunneling via
HTTP protocol PDUs.
The test scenario includes activation of this page and verication of the following cases:
Applet initialization and start-up
Trend refreshment
Applet stop phase
Applet restart
Efciency of the link back to channel description
This test should be done for every active channel.
44.6.3.9 Call of a Nonexistent Servers Object
The objective of this test is toverify the correct reactionof the server tothe request concerning a nonexistent
object.
To initiate this test the browser should request a nonexistent server object. This can be done when the
objects URL is manually entered to the browsers address box. The correct reaction of the server should
be the transmission of the page signaling the absence of the requested object.
44.7 Summing Up
This account was conceived as a complement of the reference manual describing the FDWS software
modules. It assists the developers using the FDWS software library in designing clean and efcient imple-
mentations of embedded servers. The structure of the document is organized in a manner in which it
can be used as a self-standing guide to comprehension of the technique deployed by the FDWS soft-
ware. It contains the presentation of principle of operations of embedded servers, sketches the basis
of the technology, and leads the designer through a real-life example toward a solution of a concrete
design case.
The major mission of the document is to facilitate the FDWS module library, which contains a lot of
routines and is not so easily accessible without a guide. The presentation of the technology is voluntar-
ily conceived as being platform independent (no links to a development tool or to a target platform).
The reason for this is that the basic idea of the software is its platform independence. With the same
facility, embedded servers based on FDWS technology can be incorporated to a PLC, an I/O nest, or an
industrial PC.
The developed software library constitutes the rst step on the way to the complete and univer-
sally applicable technology. A supplementary effort is necessary in order to increase the softwares
utility. The biggest progress is needed in the domain of conguration tool (or tool suit) that
would signicantly increase the users comfort in the process of implementation of embedded HTTP
servers.
References
[1] Jeremy Bentham, TCP/IP Lean: Web Servers for Embedded Systems, C MP Books, May 2002.
[2] T. Berners-Lee, R. Fielding, and H. Frystyk, Hypertext Transfer Protocol HTTP/1.0, Network
Working Group, RFC 1945.
44.A1 Appendix: Conguration of VFS
44.A1.1 Programming of VFS Component
One of the most important components of an embedded server structure is the VFS. This component
hosts the central data structure which manages embedded objects being the targets of remote client
requests.
Building such a component consists, in dynamic generation, of adequate data structures that are
organized in the formof repository trees.
It is to be recalled that the VFS is composed of three basic elements, realized in general by disjoint
modules:
1. Tree-like data structure whose role is equivalent to one of the le system management tables.
Through this data structure called repository skeleton the user can nd, read, and modify the les
embedded in the host environment.
2. Collection of routines which process this data structure.
3. Collection of memory regions storing the embedded les.
The generation process of repository skeletons is programmed as a sequence of calls of specialized routines
which create and link together nodes of repository (repository root, directory nodes, le nodes, and script
nodes) and attach memory regions storing embedded data to the le nodes. It is important to state that
the routines that process embedded le nodes in order to satisfy remote client requests should be gen-
erated coherently with the structure of repository. The method of programming of the VFS generation
is straightforward, but for big and somewhat complicated repositories, manual maintenance of the gen-
eration code (especially maintenance of coherence of its three above mentioned elements) can become
awkward and time consuming.
The regularity of operations required to obtain the complete VFS component suggests the possibility of
automationof its productionfroma higher level specication. The idea consists inspecifying the repository
structure and operations necessary to generate the requested server entities within the same description
expressed in a specialized high-level language. The process of transformation of such a specication can
be realized by a tool that compiles the specication le into executable modules of C-language code
(Figure 44.A1).
The specication le is expressed in the language for which syntax is described below.
Specification
Page
generation
module
VFS
generation
module
VFS
compiler
FIGURE 44.A1 Compilation of VFS description into executable modules.
C code
Global section
Repository
spec
Script
repository
File
repository
FIGURE 44.A2 VFS specication structure.
44.A1.1.1 Specication Structure
VFS specication is composed of two sections (Figure 44.A2):
Optional global declaration section
Obligatory repository specication section
The global declaration section, which can be omitted in some simple specications, data structure den-
itions, and routines, programmed directly in C language. These routines can be invoked in some part of
the second section of the specication.
The secondsectionof the descriptionof one or tworepositories (le repository and/or script repository).
Specication of at least one repository is obligatory (Figure 44.A2).
Repository specication is composed of three elements obligatory and one optional:
Repository type this elements allows the compiler to distinguish between le repository and
script repository.
Repository name character string necessary to identify the repository tree.
By-default node description le node specication which is necessary for the description of le
repository.
Repository body list of nodes composing the repository; this list is composed of a sequence of
le, script, and directory node specications; the list can be empty.
Each element of the list is described according to a specic syntax. All the elements have the name and
the qualier that identies the node type (le, script, or directory). Script specications contain the
pointer to the script routine. File descriptions may contain pointers to memory region containing le
data and le qualier that permits to determine the nature of the le contents (page HTML, embedded
image, embedded applet). Embedded directory descriptors contain directory body of exactly the same
nature as repository body and optionally the list of records holding access credential descriptors (pairs
usernamepassword).
Repository root together with the list of nodes from the repository body spans the rst level of the
repository tree. File (and script) nodes are the leaves of the tree while the directories span the sub-trees of
the repository.
To eachleaf of the node qualiedas anHTMLpage one canattachthe following two optional sections:
List of variables that convey the data from outside of the HTML page structure and that serve to
inject the data to the page structure.
Sections of C-code that process data before merging themwith the page template.
This brief textual description of the VFS specication language is followed by the more rigorous syntax
specication.
44.A1.2 BNF of Specication Language
The syntax of the VFS specication presented below uses the widely accepted version of specication
meta-language EBNF (Extended Bakhus Naur Form). This version uses the following conventions:
Keywords are in bold uppercase and are closed in double quotes.
Nonterminal symbols are in lower case.
Single character terminal symbols are enclosed in single quotes.
Symbols ::= (derivation) and [ ] (optional section) are part of meta-language.
IDENT, STRING, NUMBER, and SPECIAL_STRING are meta-keywords (keywords transporting
a value).
specication ::= [GLOBAL{ target_code }] le_repository_spec
[script_repository_spec ]
le_repository_spec ::= <REPOSITORY>MAIN rep_name def_le_spec
cplx_node_body </REPOSITORY>
script_repository_spec ::= <REPOSITORY>CGI rep_name cplx_node_body
</REPOSITORY>
def_le_spec ::= le_spec
rep_name ::= IDENT
cplx_node_body::= [access_list] node_list
node_list ::= node_spec | rep_list node_spec
node_spec::= le_spec | script_spec| directory_spec
directory_spec ::= <DIRECTORY> dir_name cplx_node_body
</DIRECTORY>
dir_name ::= IDENT
le_spec ::= <FILE> le_name [CONTENTS = region_name le_proc]
script_spec ::= <SCRIPT> script_name ROUTINE= routine_name
le_proc ::= qualier [( param_section)][{ target_code} ]
qualier::= TEXT | SIZE= NUMBER nature
nature ::= GIF |JPEG | JAVA | TEXT | PLUGIN
param_section ::= param_spec | param_section param_spec
param_spec::= par_qualif par_name : type [= init_value]
par_qualif::= DATA | FREE
par_name ::= IDENT
int_value ::= NUMBER | STRING
taget_code::= SPECIAL_STRING
type::= INTEGER | FLOAT | STRING
44.A1.3 Specication Example
The specication text below gives the complete description of the VFS component presented in
Figure 44.A3. The gure shows the repository that contains the following nodes:
One by default page named ROOT.
Directory public that contains ve HTML pages: gauge1, gauge2, di80_param_form, dvc5000_1,
dvc5000_2.
Directory images that contains six images in gif and jpeg format: alstom, DI80Mimic, ccd,
HartMimic1, sensor, and valve.
Director javadir that contains one le: Trend.
global{
#include "env_var.h"
}
main
<repository> page_root
<default> ROOT CONTENTS =indexnew_str TEXT
<directory> public.one
<file> gauge1.htm CONTENTS = Hello_page_str TEXT(FREE Ititle:STRING ="Furnace Temperature"
FREE legend:STRING ="Furnace Temperature" FREE IButton:STRING =" gauge 1"
FREE Imval:INTEGER =550 FREE Iinit:INTEGER = 80
FREE Iinterval:INTEGER = 5 FREE Iscriptname:STRING ="gen.cgi?genPar=22"
FREE yaxlegend:STRING ="I/sec")
<file> gauge2.htm CONTENTS = Hello_page_str TEXT(FREE Ititle:STRING ="Cooling Fluid Temperature"
FREE legend:STRING="Cooling Fluid Temperature" FREE IButton:STRING ="gauge
2"
FREE Imval:INTEGER = 150 FREE linit:INTEGER = 55 FREE Iinterval:INTEGER = 2
FREE Iscriptname:STRING ="gen.cgi?genPar=69" FREE yaxlegend:STRING ="m")
<file> di80_param_form.htm CONTENTS = di80_param_form_str TEXT(DATA tag_name:STRING ="cooler"
DATA time_out:INTEGER = 300 DATA retries:INTEGER = 3
DATA refreshment:INTEGER = 75 DATA promptness:INTEGER = 100
DATA version:INTEGER = 1 DATA revision:INTEGER = 1
DATA filter:FLOAT = 0.44 FREE Icheckstr:STRING ="checked"
FREE Icheckstr_bis:STRING =" " FREE Iselstr:STRING = "selected"
FREE Iselstr_bis:STRING = ""
{char* interm;
tdb_result Ires;
get_db_data("OperatingMode",&Ires);
if(Ires.result==c_string && Ires.value.string !=NULL)
if(strcmp(Ires.value.string,"OPERATIONAL")==0 &&strcmp(Icheckstr,"checked")!=0
ll
strcmp(Ires.value.string,"INITIALISATION")==0&&
strcmp(Icheckstr,"checked")==0){
interm = Icheckstr; Icheckstr = Icheckstr_bis; Icheckstr_bis= interm; };
get_db_data("MesureFormat",&Ires);
if(Ires.result==c_string && Ires.value.string !=NULL)
if(strcmp(Ires.value.string,"ANALOG")==0 && strcmp(Iselstr,"selected")!=0 ||
strcmp(Ires.value.string,"DIGITAL")==0 && strcmp(Iselstr,"selected")==0){
interm = Iselstr; Iselstr = Iselstr_bis; Iselstr_bis= interm;
}; })
<file> dvc5000_1.htm CONTENTS = dvc5000_str TEXT (FREE Iact:STRING = "dvc5000_1"
FREE Ititle:STRING = "Input Valve" FREE Imamps:FLOAT = 0.0
FREE Itrav:FLOAT = 0.0 DATA DriveSign:FLOAT = 0.0
{Imamps = 4.0+16.0*IDriveSign.value.real/100.0;
Itrav=IDriveSign.value.real;})
<file> dvc5000_2.htm CONTENTS =dvc5000i_str TEXT (FREE lact:STRING ="dvc5000_2"
FREE ltitle:STRING = "Output Valve - inversed drive"
FREE Imamps:FLOAT =0.0 FREE Itrav:FLOAT =0.0 DATA DriveSignl:FLOAT
=0.0
{Imamps = 4.0+16.0*(1-IDriveSignl.value.real/100.0);
Itrav=IDriveSignl.value. real;})
</directory>
<directory> images
<file> alstom.gif CONTENTS =alstom_img SIZE = alstom_img_length GIF
<file> DI80Mimic.gif CONTENTS = DI80Mimic_img SIZE = DI80Mimic_img_length GIF
<file> ccd.gif CONTENTS = ccd_img SIZE = ccd_img_length GIF
<file> HartMimic1.jpeg CONTENTS = HartMimic1_img SIZE = HartMimic1_img_length JPEG
<file> sensor.gif CONTENTS = sensor_img SIZE = sensor_img_length GIF
<file> valve.gif CONTENTS = valve_img SIZE = valve_img_length GIF
</directory>
<directory> javadir
<file> Trend.class CONTENTS = Trend_bcode SIZE = Trend_bcode_length JAVA
</directory>
</repository>
cgi
<repository> script_rep
<script> set.cgi ROUTINE = first_script
<script> xy_coordinates.cgi ROUTINE = coordinates
<script> gen.cgi ROUTINE = generator
</repository>
FIGURE 44.A3 Conguration script of the embedded server contents.
The conguration le contains also the CGI repository that contains three scripts.
The compilation of this specication le produces two les in ANSI C language: one generating the
VFS skeleton and the other generating the page composition routine. Below we present the contents of
the le generating the VFS skeleton (Figure 44.A4).
#include "DI80_VFS_0.h"
#include "data_base_processing.h"
extern tdata_base_struct server_root; extern tdata_base_struct cgi_bin;
extern char* indexnew_str; extern char* starter_str;
extern char* Hello_page_str; extern char* di80_param_form_str;
extern char* dvc5000_str; extern char* dvc5000i_str;
extern int alstom_img_length;
extern int DI80Mimic_img_length;
extern int ccd_img_length;
extern int HartMimic1_img_length;
extern int sensor_img_length;
extern int valve_img_length;
extern int Trend_bcode_length;
extern void first_script(int,...);
extern void coordinates(int,...);
extern void generator(int,...);
static tdata_base_struct db_page_root_gen_0(void)
{tdata_base_struct Irepository;
tdata_base_struct Iptrstack[10];
Irepository = InitRepository(NULL,"ROOT","page_root");
Iptrstack[0]=BuildFileNode("ROOT",indexnew_str,0,0,1);
AppendNode(Irepository,Iptrstack[0]);
Iptrstack[0]=BuildFileNode("starter.htm",starter_str,0,1,1);
Iptrstack[0]=BuildDirNode("public.one");
Iptrstack[1]=BuildFileNode("gauge1.htm",Hello_page_str,0,2,1);
InsertNode(Iptrstack[0],Iptrstack[1]);
Iptrstack[1]=BuildFileNode("gauge2.htm",Hello_page_str,0,3,1);
Iptrstack[1]=BuildFileNode("di80_param_form.htm",di80_param_form_str,0,4,1);
Iptrstack[1]=BuildFileNode("dvc5000_1.htm",dvc5000_str,0,5,1);
Iptrstack[1]=BuildFileNode("dvc5000_2.htm",dvc5000i_str,0,6,1);
Iptrstack[0]=BuildDirNode("images");
Iptrstack[1]=BuildFileNode("alstom.unused.gif",alstom_img,alstom_img_length,7,2);
Iptrstack[1]=BuildFileNode("DI80Mimic.old.gif",DI80Mimic_img,DI80Mimic_img_length,8,2);
Iptrstack[1]=BuildFileNode("ccd.unused.gif",ccd_img,ccd_img_length,9,2);
Iptrstack[1]=BuildFileNode("HartMimic1.jpeg",HartMimic1_img,HartMimic1_img_length,10,3);
Iptrstack[1]=BuildFileNode("sensor.gif",sensor_img,sensor_img_length,11,2);
Iptrstack[1]=BuildFileNode("valve.gif",valve_img,valve_img_length,12,2);
Iptrstack[0]=BuildDirNode("javadir");
Iptrstack[1]=BuildFileNode("Trend.class",Trend_bcode,Trend_bcode_length,13,4);
return Irepository;
}
static tdata_base_struct db_script_rep_gen_1(void)
{tdata_base_struct Irepository;
tdata_base_struct Iptrstack[10];
Irepository = InitRepository(NULL,NULL,"script_rep");
Iptrstack[0]=BuildScriptNode("set.cgi",16,first_script);
Iptrstack[0]=BuildScriptNode("xy_coordinates.cgi",17,coordinates);
Iptrstack[0]=BuildScriptNode("gen.cgi",18,generator);
return Irepository;
}
FIGURE 44.A4 Code generated by the conguration tool set from the script in Figure 44.A3.
45
HTTP Digest
Authentication for
Embedded Web
Servers
Mario Crevatin and
Thomas P. von Hoff
ABB Switzerland Ltd.
45.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-1
Motivation Security Objectives Outline
45.2 Security Extensions in the TCP/IP Stack . . . . . . . . . . . . . . . 45-3
Link Layer Security IPSec Secure Sockets Layer/Transport
Layer Security Application Layer Security
45.3 Basic Access Authentication Scheme . . . . . . . . . . . . . . . . . . . 45-4
45.4 DAA Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-5
Cryptographical Prerequisites Digest Authentication
Digest Authentication with Integrity Protection Digest
Authentication with Mutual Authentication Summary
45.5 Weaknesses and Attacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-9
Basic Authentication Replay Attacks Man-in-the-Middle
Attack Dictionary Attack/Brute Force Attack Buffer
Overow URI Check
45.6 Implementations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-11
Servers Browsers DAA Compatibility
45.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-12
Appendix: A Brief Review of the HTTP . . . . . . . . . . . . . . . . . . . . . . . 45-12
Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45-14
45.1 Introduction
45.1.1 Motivation
The application area of the Hypertext Transfer Protocol (HTTP) becomes larger and larger. While it was
originally intended as the protocol to transfer HTML les, it is increasingly used by other applications.
One reason is that the port assigned to HTTP is almost never blocked by a rewall. Thus, running an
application on top of HTTP allows to communicate through network security elements such as packet
lters. Examples for such applications are web mail and Web-based Distributed Authoring and Versioning
(WebDAV) [1, 2]. Since these web services contain no security features in their specication, they depend
45-1
on security provided by HTTP or lower protocol layers. Most implementations of protocols below HTTP
do not provide user authentication, hence this service is offered by extensions to HTTP, namely basic and
Digest Access Authentication (DAA) [3].
Intodays industrial communication, the trendis toreplace proprietary communicationprotocols by the
standardized TCP/IP protocol stack [4]. This is also owing to the increased connectivity of automation
networks, thus opening new opportunities to improve the efciency of operations and maintenance
of automation systems. In the course of this development the number of embedded web servers has
increased rapidly. These web servers allow web-based conguration, control, and monitoring of devices
and industrial processes. Owing to the connectivity of the communication networks of the various
hierarchy levels (control network, Local Area Network [LAN], Wide Area Network [WAN]), establishing
access to any device from any place in the plant or even globally becomes technically feasible. However,
in addition to many opportunities, this technology also leads to many security challenges [13].
Usually embedded web servers are run on processors with limited resources, both in terms of memory
andprocessor power. These restrictions favor the deployment of lightweight security mechanisms. Vendors
offer tailored versions of the comprehensive security protocol suites such as Secure Sockets Layer (SSL)
and IP Security Protocol (IPSec). However, these versions may still not be suitable for all types of pro-
cessors and applications, owing to their requirements on memory and computational power. In the case
where applications are restricted to HTTP, DAA is an alternative solution [5]. This protocol extension
to HTTP is economical, concerning memory and processor power requirements. Although designed for
user authentication in particular, many more services have been included in its original denition. In this
chapter, we focus on the mechanisms and services as well as on the potential applications of HTTP digest
authentication.
45.1.2 Security Objectives
We distinguish the following security objectives for communication systems:
Condentiality: Guarantees that information is shared only among authorized persons or
organizations. Encryption of the transmitted data using cryptography prevents unauthorized
disclosure.
Integrity: A system protects the integrity of data if it makes any modication detectable. This can
be achieved by adding a cryptographic check sum.
Authenticity: Guarantees that a receiver of a message can ascertain its origin and that an intruder
cannot masquerade as an authorized person. Authenticity is a prerequisite for access control.
Access control: Guarantees that only authorizedpeople or devices have access tospecic information.
Availability: Guarantees that a resource is always available.
In business and commercial environments, auditability, nonrepudiability, and third-party protection also
belong to the set of security objectives. Note that the relevance of the individual security objectives vary
fromcase tocase anddependmuchonthe specic application. Abusiness webapplicationwhere monetary
transactions may be involved has different security requirements than an industrial application. While
for the former application condentiality of the data transfer is a major issue, this is less sensitive in the
latter case. In turn, other security objectives such as user authentication and integrity protection are much
more critical in industrial communication. These considerations become an issue in particular, when the
embedded web server is not within a well-protected network, but is installed at a remote location. Such
situations may occur in distributed applications.
45.1.3 Outline
First, an overview of the services of the security extensions in the TCP/IP protocol suite is given, with
a focus on SSL and IPSec. Starting with a brief review of the HTTP message exchange, the mechanisms
of HTTP basic and digest authentication are detailed and all their additional useful options (integrity
protection and mutual authentication) are discussed. Furthermore, the current implementation status
HTTP Digest Authentication 45-3
of some (embedded) web servers (Apache 2.0.42, Allegro Rom Pager 4.05, GoAhead 2.1.2) and browsers
(Mozilla 1.01, Internet Explorer 6.0.26, Opera 6.05) are investigated. The results of functionality and
inter-operability tests are presented.
45.2 Security Extensions in the TCP/IP Stack
Security services are provided at different layers in the TCP/IP communication protocol suite by appropri-
ate protocol extensions [6]. An overview of those extensions is shown in Figure 45.1. The communication
protocol stack concept makes the security services of a given layer transparent to the upper layers. The
security extensions on the Internet layer and the transport layer, IPSec and SSL, respectively, provide
a large range of security services and have therefore been widely implemented.
45.2.1 Link Layer Security
As extensions to the Point-to-Point Protocol (PPP), the (cryptographically weak) PasswordAuthentication
Protocol (PAP) and the stronger Challenge Handshake Authentication Protocol (CHAP) provide authen-
tication. To establish secure tunnels with a PPP connection into a LAN or a WAN, the Point-to-Point
Tunnel Protocol (PPTP) or the Layer 2 Tunnel Protocol (L2TP) can be used.
45.2.2 IPSec
This network layer security protocol is particularly useful if several network applications need to be
secured. As protection is applied at the IP layer, IPSec provides a single means of protection for most
data exchanges (UDP and TCP applications). It is transparent to all upper layers. The security services
provided by IPSec are:
Access control (IP ltering)
Data integrity
Encryption (optional)
Data origin authentication (optional)
These services are based on cryptographic mechanisms guaranteeing a high security level when used with
strong algorithms. However, a drawback of IPSec is that for each host-to-host link, a specic conguration
is required. While IPSec provides machine-to-machine security, it cannot perform authentication of the
user. Therefore, IPSec is mainly deployed to establish Virtual Private Networks (VPNs).
An IPSec implementation on a Coldre MCF5307 65 MHz processor showed a program memory
requirement of 64 KB, without support of the Internet Key Exchange (IKE) Protocol. Experiments
Application
layer
Basic/digest
authentication
PGP
SSH
Transport
layer
Internet
layer
Link
layer
SSL
TLS
IPSEC
PPTP
L2TP
PAP
CHAP
FIGURE 45.1 Network layers and associated security protocols.
consisting of ping requests between two Coldre processors were performed. The delay between a ping
request and the reception of its reply was observed to become twice or even three times longer com-
pared with the unprotected case, when IPSec was activated using the Authentication Header (AH) or the
Encapsulation Security Payload (ESP) conguration, respectively.
45.2.3 Secure Sockets Layer/Transport Layer Security
The SSL is a protocol created by Netscape Communications Corporation. The standardized version is also
known as Transport Layer Security (TLS). SSL is transparent to the end user and to the upper protocol
layers. It protects all applications running on top of TCP, but does not protect UDP applications. The
https prex on the URI (Uniform Resource Identiers) and the lock icon on the browser GUI indicate
that the SSL protocol is in use. If the servers certicate is not signed by a certicate authority trusted by
the client (browser), the user is prompted to accept or to refuse the certicate.
The security services provided by SSL/TLS are:
Session key management and negotiation of cryptographic algorithms
Condentiality using encryption
Server authentication using certicates
Data integrity protection
Secure Sockets Layer includes optional client authentication, which is rarely performed in practice. Under
the protection by encryption provided by SSL, user authentication is often implemented at applica-
tion level. In summary, SSL provides a high level of security, but has high memory and computation
requirements, particularly when considering the constraints of embedded web servers.
45.2.4 Application Layer Security
The procedures described in the previous sections operate on the lower layers and focus on the authen-
tication of machines. On the application layer, individual applications may provide their own security
enhancements. Typical security tools are PGP/GnuPG to secure mail transfer and SSH (secure shell).
For HTTP, there exist the protocol extensions HTTP basic and digest authentication, which authenticate
users to control their access to protected documents. Authentication of the user is in contrast to machine
authentication provided by the protocols described in the sections above. Since the remainder of this
chapter focuses on the protocol extensions of HTTP, a brief review of HTTP is given in the Appendix.
45.3 Basic Access Authentication Scheme
The HTTP basic authentication scheme [3] is the simplest authentication scheme and provides some weak
protection. This is owing to the fact that username and password can be discovered by eavesdropping on
the message exchange. The HTTP message exchange for basic authentication is depicted in Figure 45.2. On
reception of a 401 unauthorized message, the client (browser) prompts the user for his or her username
and password. These are transmitted in clear over the link in the authorization request-header-eld, for
each accessed document within the same protection space (Figure 45.3).
1. The browser issues a HTTP GET command to the server, with the requested URI.
2. The server answers with a 401 unauthorized HTTP error code and requests the browser to send a
valid username and password (credentials)
1
using basic authentication. The realm
2
(string) is also
included in the challenge sent to the client. These parameters are part of the WWW-authenticate
request-header-eld.
1
Credentials: Information that can be used to establish the identity of an entity. Credentials include things as private
keys, tickets, or simply a username and password pair. This is also known as the shared secret.
2
Realm: Name identifying a protection space (zone) on a server. Usually shown to the user at the password prompt.
Client
1
Server
GET URI HTTP/1.1
GET URI HTTP/1.1
Authorization: Basic dGVzdDp0ZXN0
HTTP/1.1 200 OK
<data>
HTTP/1.1 401 unauthorized
WWW-authenticate: Basic realm=Basic Test Zone
2
4
3
5
FIGURE 45.2 HTTP basic authentication negotiation.
FIGURE 45.3 Internet Explorers basic authentication prompt.
3. The browser prompts the user for username and password. The realm (here Basic Test Zone) is
usually displayed to the user. The credentials are sent with a new GET request, encoded in Base64
format. Decoding is trivial because Base64 is a simply invertible encoding scheme. The credentials
are sent in the authorization response-header-eld.
4. After the server has checked and accepted the password, the requested document is sent to the client
with a HTTP 200 response.
5. The client displays the document, and automatically sends the same credentials for any subsequent
request made under the same protection space. Hence the password is sent unencrypted with each
request.
Note that, unless condentiality is provided by some other security protocol on a lower layer (see
Section 45.2) username and password are transmitted in an unprotected way.
45.4 DAA Scheme
45.4.1 Cryptographical Prerequisites
Unless public keys are used, authentication is based on a shared secret (credentials) between the authen-
ticating and authenticated entity. Usually, these credentials consist of the relation between a username
FIGURE 45.4 IEs digest authentication prompt.
and its password (Figure 45.4). One possibility to authenticate a peer over the network is the submission
of username and password as executed in the basic authentication scheme (see Section 45.3). However,
since they are transmitted in clear over the network, an attacker having access to the network trafc can
eavesdrop the credentials. The challenge/response concept solves this problem by avoiding to send the
password in clear. Instead, the authenticating entity A sends a challenge x to the entity B to be authentic-
ated. B calculates the response z
B
= f (x, y ), where y is the shared secret between A and B. A also calculates
z
A
= f (x, y ) and checks whether z
A
coincides with z
B
. If so, the identity of B is proven to A.
To make the procedure of challenge and response secure, there are two requirements. First, x needs to
be random, so that z
B
is of no value to any attacker intercepting it. Second, f should be a one-way hash
function [7]. The properties of hash functions are:
A nite-length output message (hash) is calculated from an arbitrary-length input message.
It is easy to determine the output message.
Given an output message it is hard to nd a corresponding input message.
It is hard to nd another input message with the same output message (hash).
Functions having these properties meet the requirements for cryptographic check sums. It is their quasi-
uniqueness (hard to nd two input messages producing the same output) that allows an owner B of a
message z to show only the messages hash to another user A and A can still trust that z is known by B.
The same property is used to protect the integrity of data storage or message transmission. A comparison
of the hash of the received message or the stored data with the original hash will detect any change in the
message or the data. The most frequently used hash functions are MD5 and SHA-1 [7].
45.4.2 Digest Authentication
Although very similar to the basic authentication scheme, DAA [3] is much more secure. Instead of
sending the username and the password in an unprotected way, a unique code is calculated fromusername,
password, and a unique number received from the server. Figure 45.5 shows the HTTP DAA transactions
between a web server and a browser:
1. The browser requests a document in the usual way, with a HTTP request message.
2. The server sends back a 401 unauthorized challenge response message. The server generates a
nonce (number used once) and sends it to the client. Note that the nonce must be different for
every 401 message.
HTTP/1.1 401 unauthorized
GET/protected/test.html HTTP/1.1
<data>
WWW-authenticate: Digest realm=DigestZone,
nonce=3gw6..., algorithm=MD5, domain=/protected,
qop=auth
GET/protected/test.html HTTP/1.1
Authorization: Digest username=test, realm=DigestZone
nonce=3gw6..., algorithm=MD5, uri=/protected/test.html,
qop=auth, response=65bia..., nc=0001,
cnonce=82c...
HTTP/1.1 200 OK
<data>
Authentication-Info: rspauth=d9260....
qop=auth, nc=0001, cnonce=82c...,
Client
1
Server
2
4
3
5
FIGURE 45.5 HTTP digest authentication: negotiation.
3. The browser prompts the user for its username and its password, and computes a 128-bit response
using the MD5 algorithm[7] as a one-way hash function:
response=MD5[MD5(username:realm:password):nonce:nc:cnonce:qop:MD5(method:uri)]
This response is sent to the server along with the nonce received, the uri requested, its own
generated cnonce (client nonce), and the username. qop stands for Quality of Protection and
indicates whether additional integrity protection is provided. In this example, this is not the case,
hence qop=auth.
4. The server calculates an own response following the same scheme as given in Step 3, using the
information sent to the client before and its own version of the username and password (or option-
ally a hashed form of them). It compares the received version with the computed one and grants
access to the resource (HTTP 200 OK response) if the results match.
If the authorization fails, a new 401 error message is sent. All 401 error messages include an
HTML error page to be displayed by the browser. Browsers usually reprompt the user for a new
username and password three times before giving up and displaying the error page.
5. For any subsequent request, the client usually generates a different cnonce. A counter, nc, is
incremented. This new cnonce and counter, along with the new uri is used to recompute a new
valid response value. Usually the browser stores the username and the password temporarily in the
memory in order to allowa user to reaccess a given protection space without retyping the password.
45.4.3 Digest Authentication with Integrity Protection
The RFCfor digest authentication [3] provides the capability to include the hash of the entity (the payload,
usually HTML code) in the computed MD5-hash response:
response=MD5[MD5(username:realm:password):nonce:nc:cnonce:qop:MD5
(method:uri:MD5{entity})]
In this way, any modication of the transmitted information will result in a different MD5-hash response,
which would be easily detectable. While the integrity of the document from the server is insured in a
response message, the integrity of POST data is protected in a request message. To indicate that the server
supports integrity protection, the argument QoP is set to auth-int.
Request document
Challenge: nonce
Authorization: response
Challenge: cnonce
Authorization: rspauth
Client Server
FIGURE 45.6 Mutual authentication mechanism.
Note that, for GET requests with arguments, the integrity of the payload (the arguments) is already
protected without the option qop =auth-int, because the URI with its arguments is included in the
MD5-hash. For integrity protection of the response from the server the rspauth eld must be present.
See Section 45.4.4 on mutual authentication on rspauth.
45.4.4 Digest Authentication with Mutual Authentication
We have already seen that digest authentication identies the client. However, DAA foresees authentication
of the server by the client, providing mutual authentication. The server already knows the client is
trustworthy, because the browser has sent the proof that the user knows their shared secret. This occurred
when sending the response to the server (see Step 3 in Figure 45.5). Exactly the same mechanism is
used to authenticate the server. After receiving correct client credentials along with the GET request, the
server sends back a proof that it also knows the shared secret. This is done via the rspauth eld sent
in the Authentication-Info header of the HTTP 200 OK message, along with the document previously
requested. The challenge initiating the server response is the cnonce fromthe client.
rspauth=MD5[MD5(username:realm:password):nonce:nc:cnonce:qop:MD5(:uri)]
The browser uses this information to authenticate the server. If integrity protection is activated, the hash
for the entity is included in rspauth.
rspauth=MD5[MD5(username:realm:password):nonce:nc:cnonce:qop:MD5(:uri:MD5{entity})]
3
The nonce fromthe server is used to challenge the client with an unpredictable number. In the same way,
when server authentication is used for mutual authentication with DAA, the cnonce from the client is
used as a challenge that the server cannot predict. Therefore, it is not possible to precompute responses to
those challenges.
This is known as a mutual challenge/response mechanism. Its concept is depicted in Figure 45.6.
45.4.5 Summary
DAA offers a secure authentication scheme with low implementation complexity, well adapted to
embedded systems. In this environment, authentication and integrity protection is important, whereas
condentiality is often not required. Unfortunately, integrity protection and mutual authentication are
not yet supported by todays typical DAA implementations on clients and servers.
3
The only difference to the client response is the missing eld method.
TABLE 45.1 Comparison Between DAA and SSL on
Functionality and Footprint Size
DAA SSL
Mandatory features
Client authentication
Server authentication
Data integrity
Data condentiality
Optional features
Client authentication
Server authentication
Data integrity
Data condentiality
Memory requirements
RAM 1648 Byte
a
100 KB250 KB
b
ROM 6312 Byte
a
200 KB
c
a
According to own measurements on a Coldre MCF5307.
b
RAM required depending on the number of simultaneous
secure connections (1 to 24) according to Reference 8.
c
According to information of Allegro.
Table 45.1 compares mandatory and optional features of DAA and SSL, and also their memory require-
ments. Note that, at the time our tests were conducted the optional features were supported by some
implementations only.
45.5 Weaknesses and Attacks
45.5.1 Basic Authentication
If an attacker has access to the network, he can eavesdrop the HTTP transaction very easily and obtain
the username and the password if there is no encryption provided, for example, by SSL or IPSec. In those
cases, HTTP basic authentication should be replaced by the digest scheme (Table 45.2).
45.5.2 Replay Attacks
Vulnerabilities appear when data, for example, sensitive commands, are sent to the server. With POST and
some other HTTP methods such data is transmitted in the entity part of the message. An attacker could
replay credentials from an intercepted POST request, with tampered form data (commands) thus taking
control of the remote server (automation system). In contrast, the arguments sent in a GET request are
part of the arguments to compute the digest. Thus, GET requests are safer and should preferably be used
to send data to the server. The fact that the credentials are stored in the browsers cache when using the
GET method has no effect to the security, because of the uniqueness of the nonce.
Proper nonce generation together with a reliable check for uniqueness provides good protection against
replay of previously used valid credentials. Although the denition of a nonce requires its uniqueness,
implementers might be tempted to reuse a nonce, this must be avoided. Within a session, a replay attack
is prevented by incrementing the counter nc, making the previously calculated hash value invalid. Note
that, the change of 1 bit in the argument of a one-way hash function changes in the average half of the bits
of the hash value [7]. In addition, integrity protection would prevent fromtampering information when
POST is used.
On an embedded device, where usually only one legitimate user at a time is likely to access the system,
locking the protection space (realm) to only one client at a time also contributes to protection.
TABLE 45.2 Comparison between DAA Server
Implementations
Server
Features Apache RomPager GoAhead
Basic access authentication
DAA
Mutual authentication
Integrity protection
Nonce check
Uri check
45.5.3 Man-in-the-Middle Attack
An attacker might be able to insert himself into the path between a client and a server. Capturing the
packet containing the servers response with the challenge for digest authentication, he can replace the
WWW-authenticate eld with a eld requesting basic authentication from the client. Username and
password can thus be gained as the client returns them in an unprotected form, owing to the faked
WWW-authenticate eld.
Disabling of basic authentication in the browser and requiring mutual authentication could prevent
such an attack.
45.5.4 Dictionary Attack/Brute Force Attack
This attack assumes that the user chooses a simple password. In the clients request message (Step 3) all
information to calculate the response eld, apart from the password, is available. Having such a message
in his hands an attacker might thus compute thousands of responses generated with a list of possible
passwords from a dictionary, and see if it coincides with the response sent by the browser. The success
probability of such a dictionary attack can be decreased by proper selection of the password, for example,
avoiding common words and including lower and upper case letters as well as special characters.
For a brute force attack the list of possible passwords is replaced by all combination of characters.
However, the fact that a hash must be calculated for each password guess makes a brute force attack very
expensive.
45.5.5 Buffer Overow
In a buffer overow attack the attacker sends very long commands to the server [9]. In embedded web
servers, web pages often serve as an entry to CGI programs. Input data for such a programare transmitted
in the HTTP message. If the size of such data is too large, the data can be overwritten in the memory
over the original executable code of the browser/server, and be executed by the processor, unless a careful
error check is performed. Mostly, this will crash the server, resulting in a denial of service. However, it can
also allow a skilled attacker to gain access to everything the server has access to, for example, condential
information or the control of the devices in the automation system.
Buffer overow can be avoided when all applications performcareful data range checks. For integrated
third-party code it is crucial to update the code to the latest available version and to apply any security
patches or service packs as soon as they become available.
45.5.6 URI Check
Some web servers do not check whether the uri eld located in the authorization header corresponds
to the requested URI in the GET request. An attacker may replay a stored valid request from a client for
a given uri, but modify the GET line to obtain some other protected documents he wants or, possibly
even more dangerous, the server may accept modied parameters in a GET request. This might allow an
attacker to send arbitrary commands to an embedded automation device. See uri check in Section 45.6.
45.6 Implementations
Some available embedded web server implementations and browsers have been tested for their support of
DAA. Investigations have shown that the implementations do not match the specication in every aspect.
This section briey outlines the results.
45.6.1 Servers
1. Apache 2.0.42. Among the DAA implementations tested, Apache has the best one. While mutual
authentication is in place and working, integrity protection is not yet implemented. The nonce lifetime is
adjustable and the uri is checked. Apache is also the most robust server in terms of resistance to exploits,
since a large user community uses and tests it continuously. Note that DAA is compatible with the Opera
browser only if the AuthDigestDomain directive is congured in the le .htaccess. The Internet Explorer
(IE) removes the arguments when copying the GET query in the uri eld. Therefore, DAA is not compatible
between IE and Apache for GET requests with parameters. The nonce is composed of a timestamp (in the
clear), followed by the = sign, and the SHA1 hash [7] of the previous timestamp, realm, and a server
secret (private key):
nonce=timestamp:=:SHA1(timestamp+secret)
2. Allegro RomPager 4.05. In the RomPager web server [10] DAA is implemented without mutual
authentication and integrity protection. There exists the option StrictDigestAndIp, where the valid-
ity of the unique nonce is time limited and never more than one IP address is granted access in a given
instant. This feature is appropriate in embedded systems because it prevents replay attacks. RomPager
makes a full uri check. In addition, it is able to cope with the less secure uri processing of the IE. Further-
more, RomPager assumes that requests received via HTTP 1.0 originate from a browser not supporting
digest. As a consequence, DAA does not work with IE and Opera when connected via a proxy. Therefore,
Opera needs to be congured without proxies. On IE the Use HTTP 1.1 through proxy connections
option can be set. The nonce is generated using the time, the server IP address, the previous nonce (if
there was one), and the server name.
nonce=MD5(Time:Server-IP-Address:[previous nonce:]Server-Name)
3. GoAhead 2.1.2. GoAhead is a free software and open source server, developed for embedded devices
on a variety of platforms. No mutual authentication and integrity protection is supported. A given
nonce never expires and is never checked. Hence, there is no protection from replay attacks. The server
removes the parameters of a GET query request in the digest uri eld. This prevents Mozilla from being
compatible with those types of requests.
nonce=MD5(RANDOMKEY:timestamp:myrealm)
45.6.2 Browsers
All three browsers tested here have their strengths and weaknesses. IE is the only one using a different
prompt for basic and digest authentication (see Figure 45.3 and Figure 45.4). Being made aware of this
difference, an attentive user can recognize a Man-in-the-Middle-Attack (see Section 45.5.3). However, the
IE removes GET arguments in the uri. Mozilla is an open-source browser, and thus very easy to modify.
On the other side, the current implementation is not very user friendly (slow and continually asking for
the username and password). Opera is the strongest one in terms of security and the only one supporting
TABLE 45.3 Compatibility of Client and Server
Implementations of DAA
Clients
Mozilla 1.01
Servers Netscape 7 IE 6.0.26 Opera 6.05
Apache 2.0.42 (win32)
a
b
RomPager 4.05
GoAhead 2.1.2
b

a
Not working for GET with parameters.
b
Requires valid domain.
mutual authentication. Concerning DAA, a combination of the three products mentioned above would
probably meet the features expected from a perfect DAA implementation. These are:
Option to disable basic authentication.
The user shall be notied by some visual indication that DAA is used, when he is prompted for
username/password, and also during browsing.
Support of DAA with mutual authentication, with some displayed indication that the server has
been authenticated. Possibility to refuse pages if the server has not been authenticated.
Support of DAA with integrity protection, with a visual indication that it is used.
Verication that the URI requested with DAA is in the protection space.
45.6.3 DAA Compatibility
Table 45.3 summarizes the compatibility of different clients versus different servers. Note that, the
compatibility tests did not include server authentication and data integrity.
45.7 Conclusions
Digest Access Authentication is a light-weight, yet efcient way of providing user authentication. Applic-
ations running on top of HTTP can benet from the services of DAA. Typically these applications are
web services like WebDAV and HMI applications in automation systems, migrating from proprietary
communication protocols towards TCP/IP technology. Wherever basic authentication is still in use and
not protected by a security protocol of a lower layer, it should be replaced by DAA. For embedded web
server applications, it is urgent that browser and web server vendors implement mutual authentication and
integrity protection, as these services are required to achieve a high security level. Where condentiality
is not required, an implementation of DAA including all features dened in the RFC [3], namely mutual
authentication and integrity protection, would provide sufcient, yet light-weight, security for embedded
systems.
Appendix: A Brief Review of the HTTP
The HTTP is widely used to exchange text data across different platforms over a TCP/IP network. The
denition of HTTP 1.1 is given in [11]. HTTP is based on standard request/response messages transmitted
between a client (browser) and a web server. An example of a typical HTTP handshake is depicted in
Figure 45.A1. The procedure is straightforward:
1. The browser sends a GET request to the server, indicating the requested resource.
2. The server responds with a 200 OK message along with the document requested.
Client Server
1
2
GET/simple.html HTTP/1.1
Host: 192.168.0.3
User-Agent: Mozilla/5.0 (...) Gecko/20020530
Accept: text/html (...)
Accept-Language: en-us, en;q=0.50
Accept-Encoding: gzip, deflate, compress;q=0.9
Accept-Charset: ISO-8859-1, utf-8;q=0.66, ;q=0.66
Keep-Alive: 300
Connection: keep-alive
HTTP/1.1 200 OK
date: Sun, 29 Dec 2002 15:21:13 GMT
Server: Apache/2.0.39 (Win32)
Last-Modified: Sun, 29 Dec 2002 15:05:13 GMT
ETag "524d-fa-4921304a"
Accept-Ranges: bytes
Content-Length: 250
Keep-Alive: timeout =15, max =100
Connection: Keep-Alive
Content-Type: text/html; Charset =ISO-8859-1
<!DOCTYPE html PUPLIC"-//W3C//DTD HTML4.01
Transitional//EN">
<html>
<head>
<title>Test page</title>
<meta http-equiv ="content-type"
content ="text/html; charset =ISO-8859-1">
</head>
<body>
This is a simple HTML page
<body>
</html>
Header
Header
Entity
FIGURE 45.A1 Example of a HTTP message exchange.
Using Figure 45.A1, the key aspects of HTTP are briey explained. A HTTP message consists of a header
and in most cases an entity. Note that, while all data in the header is transferred as ASCII text, the entity
might contain non-ASCII data, for example, .jpg-les.
Header
A Header is composed of a start-line and header-elds.
Start-line
There are two types of start-lines:
A request start-line is of the formMethod URI HTTP-Version, for example,
GET/simple.html HTTP/1.1
A status start-line has the formHTTP-Version Status-Code Phrase, for example,
HTTP/1.1 200 OK
Header-elds
Header-elds give various information, including date, language, and security information. In this
document, the security-relevant header elds where discussed in detail. Examples:
Host: 192.168.0.3
Date: Sun, 29 Dec 2002 15:21:29 GMT
WWW-authenticate: Digest realm=abc,...
Method
The most relevant methods are GET and POST.
GET: The most common method to request a document. The GET request can also include
arguments in the URI. This is widely used to transmit commands from the client to the server.
POST: Method used to send information to the server, usually from a web page form.
URI
Uniform Resource Identiers [12]. URIs in HTTP can be represented in absolute form or relative to
some known base URI. Example:
http://www.test.ch/simple.html or /simple.html.
Entity
The rest of the message, for example, a HTML document.
Acknowledgment
The authors wouldlike tothank Emanuel Corthay fromthe Swiss Federal Institute of Technology Lausanne
for his valuable contribution in the examination of the individual implementations.
References
[1] J. Slein, F. Vitali, E. Whitehead, and D. Durand, Requirements for a Distributed Authoring and
Versioning Protocol for the World Wide Web, RFC 2291, 1998.
[2] E. Whitehead, A. Faizi, S. Carter, and D. Jensen, HTTP Extensions for Distributed Authoring
WebDAV, RFC 2518, 1999.
[3] J. Franks, P. Hallam-Baker, J. Hostetler, S. Lawrence, P. Leach, A. Luotonen, and L. Stewart, HTTP
Authentication: Basic and Digest Access Authentication, RFC 2617, June 1999.
[4] M. Naedele, IT Security for Automation Systems: Motivations and Mechanisms, atp Vol. 45,
pp. 8491, 2003.
[5] T. von Hoff and M. Crevatin, HTTP digest authentication in embedded automation systems. In
Proceedings of the IEEE International Conference on Emerging Technologies for Factory Automation
(ETFA03), Vol. 1, pp. 390397, 2003.
[6] W. Stallings, Network Security Essentials: Applications andStandards, Prentice-Hall, NewYork, 2000.
[7] B. Schneier, Applied Cryptography: Protocols, Algorithms, and Source Code in C, John Wiley &Sons,
NewYork, 1996.
[8] Rom Pager Secure Programming Reference Version 4.20, Allegro Software Development
Corporation, Boxborough, MA, 2002.
[9] E. Cole, Hackers Beware, New Riders, 2002.
[10] Rom Pager Web Server Engine Porting & Conguration, Allegro Software Development
Corporation, Boxborough, MA, 2000.
[11] R. Fielding, J. Gettys, J. Mogul, H. Frystyk, L. Masinter, P. Leach, and T. Berners-Lee, Hypertext
Transfer Protocol HTTP/1.1, RFC 2626, June 1999.
[12] T. Berners-Lee, R. Fielding, and L. Masinter, Uniform Resource Identier (URI), RFC 2396,
August 1998.
[13] R.J. Anderson, Security Engineering: A Guide to Building Dependable Distributed Systems,
John Wiley & Sons, NewYork, 2001.
Intelligent Sensors
46 Intelligent Sensors: Analysis and Design
Eric Dekneuvel
46
Intelligent Sensors:
Analysis and Design
Eric Dekneuvel
University of Nice at Sophia
Antipolis
46.1 Introduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-1
46.2 Designing an Intelligent Sensor . . . . . . . . . . . . . . . . . . . . . . . . . 46-2
Analysis The External Model Functional Decomposition
of a Service Sensor Architectural Design
46.3 The CAP Language . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-10
Description Illustration Implementation
46.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-17
Acknowledgment. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-18
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46-18
46.1 Introduction
Today, thanks to the advances in numerical processing and communications, more and more function-
alities are embedded into distributed components with the charge for providing the right access to these
services. Complex systems are then seen like a collection of interacting subsystems embedding control
and estimation algorithms. The inherent modularity concept behind this approach is the key answer to
the increasing complexity of the systems and this has led to the denition of new models and languages
for the formal specication of the components [1]. In this chapter, we are more particularly interested in
intelligent sensors, components associating computing and communication devices with sensing func-
tions [2]. In order to reduce the complexity, the design of an intelligent sensor requires the necessity to
provide a model of the sensor at a high level of abstraction of the implementation. The disparity of the
knowledge encapsulated inside the instrument renders the modeling process very sensitive to the modeling
strategy adopted and to the models used. A real-life component like the intelligent instrument usually
involves the cooperation of three kinds of programs [3]:
A level of data management to performtransformational tasks.
One or more reactive kernels to compute the outputs fromthe logical inputs, selecting the suitable
reaction (computations and output emissions) to incoming inputs.
Some interfaces with the environment to acquire the inputs and processes the outputs. This level
includes interrupt management, input reading from sensors and conversion between logical and
physical inputs/outputs. Communication with the other components of the system will also be
managed at this level.
46-1
Data management covers research elds such as the probability theory, the possibility theory, the
measurement theory, and uncertainty management. Unlike a numeric sensor that provides an objective
quantitative description of objects, a symbolic sensor provides a subjective qualitative description of
objects [4]. This qualitative description, adapted to the sensor measurement, can be used in Knowledge
Based Systems (KBS), checking the validity of a measurement or improving the relevance of a result
[5]. The reactive part is probably the most difcult part of the design of the intelligent sensor. Like all
reactive systems, the intelligent sensor must continuously react to its environment at a speed determined
by this environment. This often involves the ability of exhibiting a deterministic behavior, of allowing
concurrency and of satisfying strict real-time requirements.
A generic intelligent sensor model has been developed to help during the specication step of the
sensor functionalities [6]. The purpose of the intelligent sensor generic model is to provide a high level
of abstraction of the implementation of the sensor, focusing on the fundamental characteristics that the
sensor must exhibit. For this, the generic model uses the point of view of the user to describe the services
and the operating modes in which the services are available [2]. Then, by using a language to compute
the formal description, we are in a position of evaluating the component, from a static and/or dynamic
point of view. The availability of a language to compute the formal model allows the evaluation of the
component. Once the component is validated, a prototyping step can be launched in order to obtain an
operational system prototype. This prototyping step being usually expensive in time and resources, the
nal implementation should be made as much as possible using automatic synthesis from the high-level
description to ensure implementation that are correct by construction [7].
In this chapter, after reviewing the main characteristics of the generic intelligent sensor formal model,
we talk about an implementation of the model provided by the CAP language, a language specically
developed for the design of intelligent sensors.
46.2 Designing an Intelligent Sensor
46.2.1 Analysis
As stated earlier, the diversity of the embedded functions, the exibility, the reuse, argue for a distribution
of the functionalities inside a complex system into areas of responsibility [8]. From an external viewpoint,
an intelligent sensor will be considered as a modular unit behaving as a server. As such, it will be designed
to offer its customers (the operator, other instruments, or other modules) an access to the various
functionalities encapsulated inside the sensor.
Let us consider a simple example. A navigation system has to be designed using a closed loop on a
surface like a wall to control the locomotion. As can be seen in Figure 46.1, the environment of the
system to be designed exhibits various entities or actors, such as the axes, the operator, and the obstacles.
Every occurrence of a start_moving request, a closed loop is activated until a new request like stop_moving
is emitted by the operator. Once the links between the system and the environment are dened, the
functional specications can be established. For this, a dataow diagram (see Figure 46.2) can be easily
dened by identifying the data necessary for the navigation goals: a position measurement value useful
for the closed loop to compute the values of the speed that are to be applied on the various axes.
Suppose we now decide to include another activity that will enable the system to follow a predened
trajectory. By this way, the operator (or a high-level decisional system) is provided with a possibility to
choose between two methods according to the current context of the execution. This trajectory execution
activity can be interested in knowing if unexpected obstacles are met along this trajectory in order to
stop the system before striking one of the obstacles. If both functionalities, the obstacle detection and
the computing of the position, use the same physical resource (a set of ultrasonic sensors for example),
it is better to encapsulate them for homogeneity inside a subsystem in charge of the physical resource.
The intelligent sensor module interacts with the environment through several messages. This is the
interface of the module. The structure of a message (its signature) is generally limited at this level to
a list of parameters such as the sender identity, the communication medium used, and the contents. We
Intelligent Sensors: Analysis and Design 46-3
Excitation
Easy
data
Speed
measurements
Start_moving
Stop_moving
Operator
Axes Obstacles
Speed
values
Navigation
system
FIGURE 46.1 Context diagram of the application.
Trajectory
execution
Obstacle
detection
Position
counting
Wall
following
Control
Unexpected
obstacle
Position
Speed
instruction
FIGURE 46.2 A global dataow diagram.
usually make a difference between messages for data communication and messages for control. Control
messages enable the customers to communicate with the sensor in a bidirectional way, using a clientserver
protocol (see Figure 46.3). The customer requests the launching of an activity through the request link.
The customer receives an identication number for its demand and can be informed about the status of
the request (activity launched, terminated, etc.), owing to another message of control (reply).
To be effective, the intelligent sensor interface description must be complemented by the behavioral
description of the module. While the structural viewpoint describes the internal organization of a complex
system, the behavioral viewpoint will express all the information that characterizes the module to be
designed froman external viewpoint [9]. Ageneric model of an intelligent instrument has been developed
for this purpose, using the concept of external services to qualify the set of operations offered to the outer
entities. Reference 10 gives the following denition:
Denition 46.1 From an external point of view, a service is the result of the execution of a treatment, or a set
of treatments, for which one can provide a functional interpretation.
In other terms, the execution of a service typically results in the production of output values according
to input values consumed by the execution of a processing. The services are not limited to measure-
ment aspects. The set of the services cover a large spectrum of functionalities that we can expect from
intelligent sensors. Intelligent sensors must be congured, calibrated, enabled, so that they can provide
their measurements to the rest of the system. Selecting a particular sensor, an alimentation, a reference
voltage and a sampling frequency, are common examples of conguration services that can be used to
Input data Output data
Reply
Intelligent sensor
component (server)
Wall following
component (customer)
Pilot
component (customer)
Request
Position
computing
Obstacle
detection
FIGURE 46.3 The intelligent sensor interface.
set the value of these parameters. Processing embedded inside services can be as simple as the acquisi-
tion of a value but usually, it involves more complex treatments such as signal processing (to improve,
for example, the resolution of a given value), the data processing, the validation of the measurement,
and so on.
In the generic model of an intelligent sensor, a service will consequently be modeled by two sets of
parameters:
External: That is, how the service communicates with other services. The services are gathered into
User Operating Modes.
Internal: That is, how the external service is decomposed into internal basic processing units.
Let us examine in detail both aspects successively.
46.2.2 The External Model
Figure 46.4 depicts the external model of a service in use. A service is mainly described by the input/output
data and is triggered by an external event. The used data and events can be organized into classes, with the
description of a set of characteristics such as the format, the accuracy, the refresh period, etc., for each data
class. The description of the input and output behaviors exhibits the possible interconnection between the
service and those that precede or follow this particular service. This is then a data-driven representation
of the service relationships equivalent to an explicit representation with the advantage of being more
efcient and general (new services can be added without being obliged to physically interconnect the
entire system).
Control parameters are received from the parent activity that requests the service. The control para-
meters affect the modalities of processing and the modalities of the underlying sensor the service might
encapsulate. The control parameters are usually passed in conjunction with the service activation request.
For example, one can easily imagine the obstacle detection service running in a cyclic shooting modality
or in a single shooting modality, depending on the nature of the activity that requests the service.
The launching of a service can be conditioned by the verication of its activation conditions. The
distinction between a request and a condition is that the request for a service is emitted by the user, while
the condition is processed by the system. Those conditions are often related with the access rights or to
the security aspects and induce the verication of the origin of the request, the mode of transmission
used, etc.
Resources
Request Conditions
Control
parameters
Input
data
Output
data
Service
FIGURE 46.4 Graphic model for representing a service.
Resources include both hardware (sensor, CPU, memory, etc.) as well as software (extended Kalman
lter, etc.). Input and output data can also be considered as resources with the problem of the data
obsolescence.
Other properties can be added such as time and complexity measures that can help to select between
different methods.
The set of services can be easily assimilated to a set of instructions we nd in a regular computer. To
cope with various states of the sensor that can occur during its life (out of order, in conguration, in
manual exploitation, in automatic exploitation, etc.), the different external services are organized into
coherent subsets of User Operating Modes (USOMs). In the model, a sensor service can be requested, and
thus accepted, only if the current active USOM includes this service. This prevents the request of services
when they cannot be available. According to Reference 10:
Denition 46.2 An external mode is a subset of the set of external services included in the intelligent
instrument. An external mode includes at least one external service and each service is included in an external
mode at least.
The operating modes can be easily described with respect to a labeled transition system where the label
is the external event matching a request of commuting the current mode (see Figure 46.5). Moreover, in
each USOM, a notion of context may exist, where the context is the subset of the services that are implicitly
requestedas long as the systemremains inthe givenUSOM. The external services includedinside anUSOM
are supposed to behave independently. We say that they belong to orthogonal regions of a state, sometimes
termed as constellations. As an example, the external services inside the intelligent sensor of the navigation
systemcould be structured into a wall following or an execution trajectory mode. The position computation
and the obstacle detection services would be implicitly executed when entering the corresponding mode.
If the sensor reveals a complex state space, it can be decomposed into nonoverlapping substates to reduce
the complexity. For example, the wall following and execution trajectory states can belong to a more general
measuring state, often called a macro-state or a super-state, belonging to an active macro-state, etc. Some
properties that the design must satisfy and that can be checked against the functional specications have
been elaborated. They complement the formal model. For example, properties may express some axioms
such as:
1. An external mode is a nonempty set of external modes.
2. Each external service belongs to one external mode, at least.
3. In an intelligent instrument, the set of disconnected vertexes in the state-transition diagram is
empty, that is, there is no external disconnected mode.
4. A transition between two modes is unique in the graph.
5. Each external mode must be reachable and each external mode can be left.
6. etc.
Service
y
Service
x
USOM
2
USOM
1
T
21
T
12
Service
x
FIGURE 46.5 The concept of USOMs.
These properties must be veriedtoguarantee a safe productionof the intelligent instrument. For example,
the verication of the property 4 can be easily done by checking if the fan-in and fan-out degrees of every
vertex of the graph are not equal to zero. An incidence matrix can help in doing this.
The external viewpoint of the intelligent instrument will usually be complemented with a second level
of description, to capture the algorithmic ow. This level must exhibit the treatments that concur to the
global functionality which is usually called internal services.
46.2.3 Functional Decomposition of a Service
Complex operations often need to be decomposed into multiple primitive operations in order to produce
the overall behavior. For example, an external measurement service can induce a very complex treat-
ment, probably following a step of initialization and, for self-terminated services, followed by a step of
termination. So, Denition 46.1 is usually complemented with the following denition [11]:
Denition 46.3 An external service is the result of the execution of internal services.
From the viewpoint of the designer, an internal service is an elementary operation, possibly extracted
from a library of components, for which no further decomposition is needed. Its I/O behavior can be
easily described through an algorithm. Depending on the area of applications, such a conceptual unit can
be known under various appellations such as a module, a codel [12], etc.
The functional decomposition of a complex external service into internal services clearly has the
following advantages:
A structured programming helps the designer to describe the different steps of the treatment,
without using internal state variables.
Transitions fromone step to the other explicitly denes the possible interruption points of the ser-
vice. Between these points, the operation is considered to be an atomic transaction. This preserves
the functionality of coherency problems (loss of data and so on).
Reuse of common units of programming: they are common to different services or are the result
of previous developments. For example, an obstacle detection service and a position computing
service can share a lot of common portions of code: the signal emission, the signal acquisition, and
so on. These units can be part of a library of Intellectual Properties (IP) modules.
As the reader can see, there is no mention at this point on the nature of the realization. Design units can be
implemented oncustomized or onsoftware processors. The hardware units canalso be freely implemented
in the discrete domain using FPGA (Field Programmable Gate Array) or DSP (Digital Signal Processor)
components, or by using analog components in the continuous domain. In order to reduce the complexity
of the design, the denition of the executive architecture (problem known as the partitioning problem)
Input data
acquisition
Internal
event
External
event
Processing
Output data
emission
FIGURE 46.6 Example of a functional decomposition.
must be postponed during a detailed design step, taking into account various design constraints such as
economic and real-time constraints. The internal description of a complex service can be expressed using
an activity diagram. The activity diagrams are well suited to show algorithmic details or procedural ow
inside the service. They are often compared with owcharts but are more expressive. Such a diagram
describes the internal operations to be achieved on the incoming ow and their temporal dependencies.
Depending on the complexity of the service, the detailed renement of an activity can be performed
using several hierarchical successive levels. The processing step in the Figure 46.6 can, for example, be
rened into a feature extraction step followed by a data classication step, and a nal decision step in
sequential ordering. Elementary operations, those for which no renement is needed will be described by
their internal behavior, usually through a algorithm. The activation of the internal services is controlled
by internal events.
Denition 46.4 An internal event in an intelligent instrument is an event that is produced and consumed by
the instrument itself.
The producer gives birth to the event. Consumers can react to this event in order to start their processing.
The activation of an operation will often depend on the completion of the previous operation but, more
complex temporal dependencies will also frequently happen like the activation of a signal processing
operation conditioned to the end of an external conversion operation. In this way, an external service
can itself be in the position of a client of another component, dynamically starting an external activity
to request data necessary to the achievement of its mission. While an external event is associated with
a unique external service, an internal event can be associated with several internal services, leading to
a n producerm consumer relationship. In this, activity diagrams differ from conventional owcharts
relatively to their capacity to represent concurrency. In Figure 46.7, the execution of the internal service V
is followed by the simultaneous execution of the internal services Wand Y.
Expressing sequential and parallel compositions of treatments is not always sufcient. The execution
of a service can be affected by the state of the resources. The concept of version has been created with
the aim of providing alternative versions of treatments that will enable the service to operate under
nonnominal conditions. This is a means to take the fault tolerance probleminto account. All the versions
of a given service will share the same request and produce the same output, but the inputs, procedures,
and resources will differ from one version to another. For example, a measurement service uses two
transducers in a nominal mode of the service, to compute a data value using a sophisticated data analysis
method. If a defect is detected on one of the transducers, the measurement service can continue to operate
using a subset of the features extracted from the input data. Of course, the quality of the result will
decrease. The versions are typically ranked and classied into internal modes such as the nominal mode
and the degraded mode. The management of the versions of a service can be straightforward: at time t
Internal
service W
Internal
service U
Internal
service Y
Internal
service Z
Internal
service V
Nominal
Degraded
FIGURE 46.7 Version and internal mode concept.
when the request for service is emitted, the version to be carried out will be the one with the lowest rank
whose resources are all nonfaulty [13]. Like for the USOMs, the description of the internal modes can be
done using a state diagram.
Having reviewed the generic formal model of the intelligent sensor, let us now turn to some validation
aspects of the sensor.
46.2.4 Sensor Architectural Design
We have seen the mathematical properties underlying the intelligent sensor generic model of computation.
These properties can be efciently used to answer questions about system behavior without carrying
out expensive verication tasks. The formal validation generally uses an automata-theoretic approach,
modeling the formal description by Finite-State Machine (FSM) and the language of the automaton
[14]. As stated earlier, the nal implementation of the intelligent sensor should be made as much as
possible, using automatic generation from the generic model, to ensure implementations that are correct
by construction. For example, the protocol dened at the high level of abstraction uses the concept
of message passing, where a message is an abstraction of data and/or control information passed from
one component to another. Various mechanisms (the message signature) can be envisioned at a lower
level of denition, including a function call, an interruption, an event using a Real-Time Operating
System (RTOS), an ADA rendezvous or a Remote Procedure Call (RPC) in a distributed implementation.
Consequently, a prototype of the sensor is also a useful mean to validate the specication in the presence
of the real-time inputs, with physical characteristics similar to those of the nal implementation and
which will be produced by the synthesis stage. Rapid prototyping aims at analyzing the performance of
an implementation, to validate its capability of satisfying hard real-time constraints, etc. To do so, the key
technologies are the use of software synthesis, hardware synthesis, and the synthesis of interfaces between
software and hardware using programmable components.
The prototype to be generated will be highly dependant on the physical architecture selected and on the
physical communication links. The targeted architecture is strongly dependant on the cost of components
and production. Consequently, as shown in the Figure 46.8, there are some nontrivial aspects to be
analyzed, to be in a position of producing a prototype:
The denition of the hardware/software architecture (partitioning, mapping).
The sequencing of the software on each software processor (scheduling).
Formal model of the
intelligent sensor
Formal validation
Partitioning, mapping,
scheduling
Prototype of the
intelligent sensor
Hardware/software
synthesis
FIGURE 46.8 Design ow of an intelligent sensor.
I
2
I
3
I
1
MUX
xxxxx
Ref.
E/B
EPROM RAM
Communication
interface
ADC DAC
T
2
T
3
T
1
MUX
Software processor
Output interface Input interface
Amp Amp
Microprocessor
System bus
FIGURE 46.9 Typical model of an architecture for an intelligent sensor system.
Figure 46.9 shows typical hardware/software architecture for an intelligent sensor. We can observe the
hybrid character of the intelligent sensors, mixing together analog and numerical components. Each
component description can be rened to exhibit the detailed architecture of a component. On the
gure, we can see an architecture organized around a microprocessor that processes the functionalit-
ies for which it is in charge. This processor can be a DSP, a processor that has a CPU customized for
data-intensive operations such as digital ltering. Bidirectional communication is ensured through vari-
ous means, using a serial link, a CAN (Controller Area Network), an Ethernet link, etc. [15]. Finally,
memories (ROM, RAM, etc.) ensure the memorization of the information located inside the sensor. In
the future, the hardware architecture will tend to combine more and more customized hardware with
embedded software. The denition of a hardware/software architecture involves checking if the sensor
can be schedulable, that is, if all the performance requirements can be guaranteed. A deadline (a point in
the time or a delta-interval by which a system action must occur) is an example of a requirement that,
when missed, constitutes an erroneous computation. Consequently, the denition of hardware/software
architectures generally requires more complex modeling, by dening the external timing requirements of
the messages. The requirement of Quality-of-Service (QoS) of a message can be expressed using different
means. For example, the response timing can be dened in terms of timeliness requirements (typically,
deadlines) [16].
Assigning an execution order to concurrent modules, and nding a sequence of instructions imple-
menting a functional module are the primary challenges in software organization. These can be nontrivial
issues to deal with, particularly when one must consider the performance, as well as the functional require-
ments, of the system. The software implementation can be facilitated by the use of real-time languages
and their underlying executive kernel. Such languages provide a style of programming that enables the
manipulation of events and/or state changes with constructions expressing the behavior through the par-
allelism, the synchronizing, etc. Selecting a language to compute the model is not straightforward with
basically several possibilities: developing a dedicated language or the use of an existing language, the use
of a graphical or a textual input form, etc. The domain of applicability of a language must also be carefully
studied with basically two approaches. The synchronous approach states that time is a sequence of instants
between which nothing interesting occurs [17]. In each instant, some events occur in the environment
and a reaction is computed instantly by the modeled design. This means that computation and internal
communication take no time. This hypothesis is very convenient, allowing modeling the complete system
as a single FSM with a completely predicable behavior. ESTEREL [18,19] and its graphical expression form,
the SYNCCHARTS [20,21], are representatives of synchronous imperative programming languages. Like
all language specialized to control-dominated systems programming, data manipulation cannot be done
very naturally. In the asynchronous approach like in ELECTRE [22], events are observed and processed
immediately. This approach enhances the expression power of the language. Moreover, the design can be
more efciently implemented on heterogeneous hardware/software architectures. On another side, timing
constraints are difcult to check.
46.3 The CAP Language
46.3.1 Description
The generic intelligent sensor model gave birth to a new language that can be akin to the category
of asynchronous languages, the CAP language [23]. Its ability of providing a rapid prototype of the
intelligent sensor model on common microcontrollers available on the market has been one of the main
reasons of developing this language. The developers have consequently limited the implementation to
monoprocessors and to a sequential operation.
The CAP language is an incomplete language in the sense that it species only the interaction between
computational modules (internal services), and not the computation performed by the modules. An
interface with a host language species the behavioral contents of such units through C instructions.
Like every conventional language, the grammar can be described using a Backus Naur Form (BNF) also
known as a metalanguage. Figure 46.10 shows a formal grammar dened by Reference 10. Reference 11
complemented the language with the possibility of declaring internal services. As can be seen, the two parts
of the model are normally successively expressed. The interface of the instrument follows the metavariable
instrument with:
A number referring to the instrument as a node in the network.
Variables that can be exported or updated on the network.
Communication links imported or exported to implement the command protocol.
The expression of the graph of the external modes is achieved through the set of vertices and the set of
transitions. Each transition is then described by declaring the input vertex, the output vertex, and the link
of communication source of the event of transition. Variables of the imported or exported lists are dened
according to a type in C. Finally, the expression of a list of external modes in the denition of a service
FIGURE 46.10 An overview of the formal grammar. (From J. Ttussor, PhD thesis. With permission.)
can be noted. These modes are the only ones in which the service can be launched. The description of
the internal model is close to the external one and can be easily understood. As shown on Figure 46.11,
the principle of the operation of the compiler is relatively straightforward: after the lexical and syntactical
analyses, a set of lists reecting the formal model is produced by the code generator. After being generated,
the code is compiled and linked with the user-dened libraries and the processor-dependant startup code.
The generated data structures are split into two sets of list: a rst set contains the names of the objects and
is subdivided into eight lists: external modes, internal modes, external services, internal services, variables,
export links, import links, and internal events. The second set contains the detailed description of each
transition. The conformity of the description to the formal model will be analyzed using these lists. They
also contribute to the providing of a level of abstraction between the software synthesizer and the real
machine.
Source of the IS
Intermediate
data structures
Input/output
libraries
Kernel
Executable
CAP compiler
C compiler
FIGURE 46.11 The CAP design ow. (From J. Ttussor, PhD thesis. With permission.)
Let us illustrate this principles on an intelligent instrument designed to process ultrasonic sounds in
order to produce a distance measurement value.
46.3.2 Illustration
To illustrate the approach, let us take the following example [11]: a measurement system is composed of
one ultrasonic transmitter able to emit a signal toward a target in order to compute the distance between
this target and the sensor with the help of two receivers. Figure 46.12 shows the basic conguration of the
measurement system.
The intelligent instrument delivers the two measurements of distance d
1
and d
2
, each of them with a val-
idation degree d
v1
and d
v2
, respectively. As shown in Figure 46.13, the basic principle of the measurement
is the following: the ultrasonic emitter sends a sinusoidal waveform linearly modulated in frequency to the
F = (f
max
f
min
) interval with a rate of variation of the frequency: = f /T
r
. Then, the instantaneous
frequency F
received
undergoes an offset relatively to the frequency F
emitted
of f = |F
emitted
F
received
|.
Reference 24 has shown that the distance d can be determined by measuring the offset only inside the
interval [t
0
, T
r
] where its value is f
a
. For that, the measurement is inhibited during the time t
a
where the
offset of the frequency has the value f
b
.
The state space of the sensor is decomposed into two USOMs: the conguration mode (which is the
default USOM) and the measurement mode (see Figure 46.14). The transition between each mode is
triggered on the cantcp_in event.
A number of internal services have been specically developed for this application:
Initialization and conguration services enable the modication of the parameter values used for
the computation of d,: duration of the slope T
r
, slope p, time t
dead
, and voltage u
min
. They take
account of particular conditions of measurements (nature of the obstacle, environment, etc.).
Internal services such as those depicted in Figure 46.15 contribute to the measurement elaboration;
for example, the internal service for the slope generationhas the responsibility of sending the signal in
directionof the obstacle; suppression of aberrant measurements lters and keeps the impulsions close
to the median f
amoy
(average of the measured frequencies on T
r
), while the others are discarded.
The Dynamic Packet Transport (DPT) service computes the pseudo-triangular distribution of the
possibilities.
Receiver Receiver
Transmitter
Sensor
d
1
d
0
d
2
FIGURE 46.12 Conguration of the measurement system. (FromJ. Ttussor, PhD thesis. With permission.)
Emitted
signal
Received
signal
Slope a
F
f =f
a
f =f b
U
s
t
t
T
r
t
0
t
dead
f
min
f
max
f
FIGURE 46.13 Principle for the measurement of f
a
. (FromJ. Ttussor, PhD thesis. With permission.)
Internal calculations are not very complex. For example, the distance computation service computes the
distance on every f
a
input. By doing this, there are N
exp
measurements of d carried out during the
interval T
r
, using the following formula:
D =
f
a
2
T
r
t
dead
F
V (46.1)
The internal service named validity estimation computes a value in the [0, 1] interval according to the
following rule:
D
v
=
N
exp
N
th
(46.2)
where N
th
is the theoretical number of the periods of f
a
that are supposed to be observed by the instrument.
Import_slope_duration Import_umin
Configuration
Measure
Cantcp_in Cantcp_in
Import_wave_speed
Reset
Import_slope Law_selection
Distance_measurement Standard deviation
Mode configuration, measure.
Transition configuration to measure on cantcp_in.
Transition measure to configuration on cantcp_in.
Default mode configuration.
FIGURE 46.14 Graph of the USOMs and the corresponding CAP declarations. (From J. Ttussor, PhD thesis. With
permission.)
As shown, the declaration of the distance_measurement external service includes the request (on
cantcp_in) and the USOM (in measure) where it is available. The measurement functionality can be
carried out according to four internal modes: nominal, degraded1, degraded2, and critical. For example,
when the validity of the measurement d1 (see equation [46.2]) goes down below a given threshold in the
nominal mode, the instrument can automatically switch in the degraded1 mode.
Figure 46.16 part of the synthesized code presents an excerpt of the synthesized code. As stated earlier,
these symbolic data structures will be used by the verication tool to check the properties of the formal
model. The data structures (array of services, etc.) will be typically stored inside a volatile memory (RAM
or SRAM) of the hardware architecture, while the automaton, not reachable to the user will be set in a long-
term memory (EPROM, FLASH, etc.). Depending on the operating mode, development or exploitation
mode, the user program can be stored inside the volatile memory or set in long-term memory.
46.3.3 Implementation
The execution of the various services is handled by an automaton that has the responsibility to interpret the
formal model. As shown in Figure 46.17, the execution machine runs a cyclic program that processes the
inputs, updating a FIFO(First In First Out) storing of the pending events. Then, depending on the nature
of the event, a transition can be red or a call to a procedure is achieved. Permanent functions are then
executed. The loop ends with the emission of the output messages. For each step, the detailed behavior is
the following:
1. Reading of a message on each communication link: if the intelligent sensor is concerned by the
message, it will be processed according to its type:
In case of a data message, the corresponding variable is updated and the associated event is
triggered.
Slope
Generation
Acquisition
(voice 2)
Acquisition
(voice 1)
Suppression of
aberrant measurements
1
Translation in
accumalized units 1
Distance
computation 1
Standard_
deviation 1
Get_Standard_
deviation1
Get_Standard_
deviation2
Standard_
deviation2
Get_measurement 1
Get_measurement 2
DPT1
DPT2
Validity
estimation 1
Validity
estimation 2
Get_validity1
Get_validity2
Average
computation
Get_DPT1
Get_DPT2
Get average
Distance
computation 2
Translation in
accumalized units 2
Suppression of
aberrant measurements
2
Service distance_measurement on cantcp_in in measure
{uses slop_generation, acquisition 1, distance_conputation 1, ... }
Iservice acquisition 1 on end (slop_generation)
{/* C code of the internal service */}
FIGURE 46.15 Functional decomposition of the measurement functionality. (From J. Ttussor, PhD thesis. With
permission.)
In case of an event message, the corresponding event is inserted inside the FIFO event for an
ulterior processing.
In case of a query message, the current mode is exported on output links.
In case of a variable message, the variable is linked to import an external variable.
2. Event queue processing: three situations can arise and be analyzed in the following order:
The event is a request of a change of the external mode. Provided this change is authorized, the
internal variable is updated and the new mode is exported on output links.
The same thing for an event corresponding to a request of a change of the internal mode.
The event is an execution request of an external service. Provided the service(s) is (are) enabled
in the current mode, these services are executed using the procedural entry point.
The same thing for an event corresponding to a request for executing an internal service. If
dened, an event noticing the end of execution is inserted in the FIFO event.
3. Execution of the EVER services: these are services that run permanently.
4. Diffusion of the messages on communication links (exported variables, change of mode, etc.).
FIGURE 46.16 Part of the synthesized code. (FromJ. Ttussor, PhD thesis. With permission.)
Input link
FIFO
Event link
FIFO
Output
link FIFO
Reading and processing of the
messages
Read and processing of the events
Triggering of EVER events
Emission of the messages
FIGURE 46.17 Principle of operation of the automaton.
Note the reduced portability of the application, because of the tight link that exists between the application
and the hardware. Open software platforms (EmbeddedJava, JavaCard, etc. [25]) can be considered as
attractive solutions tothis problem, by interpreting the applications via a virtual machine. But, the designer
has to be aware of the substantial performance penalty that could be paid by adopting this solution.
46.4 Conclusion
In this chapter, a generic model for the design of intelligent instruments has been discussed. In the formal
model of the sensor, the specication of the external services is given according to a user point of view
of the functionalities available inside the sensor. The external model uses the concept of the USOMs to
preserve the activation of services that are not available during the current external mode. Internal services
dene the basic units that will be assembled to describe the complex behavior of an external service,
taking into account various temporal dependencies. Like the external services, the internal services can
be gathered inside internal modes. Internal modes dene the internal states of a service according to its
possibilities of operating under various contextual situations. Some of the basic principles underlying
the implementation of the model on hardware architectures have also been exposed. The advantages of
the numeric instrumentation for the processing relatively to the conventional instrumentation are well
established today. We have nally discussed of an implementation that can be generated automatically
through the use of language like CAP. The automatic generation of the implementation is conditioned to
the formal verication of the properties underlying the generic model.
As stated earlier, the intelligent sensors to come will be composed of more complex and heterogeneous
components. This trendwill change the industrial landscape, making the trade andassembly of IPs embod-
iedinlayouts, RTL(Register Transfer Level) designs andsoftware programs indispensable [17]. This aspect,
not specic to the design of the intelligent sensors is a great challenge. For example, Hardware/software
cosimulation is often performed with separated simulation models [26]. This makes trade-off evalu-
ation difcult because the models must be recompiled whenever a change in the architecture mapping
is made. The conuent system level design tool [27] introduces an intermediate level of abstraction, the
functional level, between the specication and the architectural model of the sensor [28]. As shown in
Figure 46.18, this level denes the logical architecture of the system in terms of functional components
Requirements
document
Specifications
docment
Functional
model
Architectural
model
Prototypes
Customer needs
Requirement defination
System specifications
Functional design
Architecture design
Printotyping
Rense
Capitalization
IP
Product
FIGURE 46.18 The MCSE design methodology for complex sensor hardware architectures.
(simply called functions) and the relations between them(ports, shared variables, events depending on the
kind of relationship). Like in the Vulcan system[29], the use of a control/data owgraph for the behavioral
model facilitates the partitioning at the operation level. This environment has been used successfully for
the design of an intelligent sensor for pattern recognition [30]. The functional model provides an envir-
onment for behavioral and performance analysis in a technology- and language-independent manner
that allows implementation of the same functionality on diverse physical architectures [31]. Automatic
synthesis can be achieved on hardware (VHDL descriptions) or RTOS primitives (VxWorks). A systemC
simulation engine and code generator is also available for system-on-chip (SoC).
Acknowledgment
All the gures in relation with the CAP language are reprinted with permission from Dr L. Foulloy,
University of Savoy, France.
References
[1] N. Medvodovic and R.N. Taylor. A classication and comparison framework for software
architecture description language. IEEE Transactions on Software Engineering, 26: 7093, 2000.
[2] M. Staroswiecki and M. Bayart. Models and languages for the interoperability of smart
instruments. Automatica, 32: 859873, 1996.
[3] N. Halbwachs. Synchronous Programming of Reactive Systems. Kluwer Academic Publishers,
Dordrecht, 1993.
[4] E. Benoit, R. Dapoigny, and L. Foulloy. Fuzzy-based intelligent sensors: modelling, design,
applications. In Proceedings of the 8th IEEE International Conference on Emerging Technologies
(ETFA2001), Antibes, France, October 2001.
[5] E. Dekneuvel, M. Ghallab, and J.P. Thibault. Hypotheses management in a multi-sensory per-
ception machine. In Proceedings of the 10th European Conference on Articial Intelligence (ECAI),
Vienna, Austria, August 1992.
[6] J.M. Riviere, M. Bayart, J.M. Thiriet, A. Bouras, and M. Robert. Intelligent instruments: some
modeling approaches. Measurement and Control, 29: 179186, 1996.
[7] S. Edwards, L. Lavagno, E.A. Lee, and A. Sangiovanni-Vincentelli. Design of embedded systems:
formal models, validation and synthesis. Proceedings of the IEE, 85: 366390, 1997.
[8] D. Harel, H. Lachover, A. Naamad, A. Pnueli et al. STATEMATE: a working environment for
the development of complex reactive systems. IEEE Transactions on Software Engineering, 4(16):
403414, 1990.
[9] J.P. Calvez. Embedded Real-Time Systems. A Specication and Design Methodology. John Wiley &
Sons, NewYork, 1993.
[10] A. Bouras and M. Staroswiecki. Building distributed architectures by the interconnection of
intelligent instruments. In IFAC INCOM98, Nancy, June 1998.
[11] J. Tailland, L. Foulloy, and E. Benoit. Automatic generation of intelligent instruments frominternal
model. In Proceedings of the International Conference SICICA2000, Argentina, September 2000.
[12] S. Fleury, M. Herrb, and R. Chatila. Design of a modular architecture for autonomous robot. In
Proceedings of the IEEE International Conference on Robotics and Automation, San Diego, CA, 1994.
[13] M. Staroswiecki, G. Hoblos, and A. Aitouche. Fault tolerance analysis of sensor systems. In
Proceedings of the 38th IEEE Conference on Decision and Control Phoenix, USA, 1999.
[14] R.P. Kurshan. Automatic-Theoretic Verication of Coordinating Processes. Princeton University
Press, 1994.
[15] J. Warrior. Smart sensor networks of the future. Sensor Magazine Mars, 4045, 1997.
[16] B.P. Douglass. Doing Hard Time. Developing Real-Time Systems with UML, Objects, Frameworks,
and Patterns. Addison Wesley, Reading, MA, 1999.
[17] F. Balarin, M. Chiodo, P. Giusto, H. Hsieh, A. Jurecska, L. Lavagno, C. Passerone, A. Sangiovanni-
Vincentelli, E. Sentovich, K. Suzuki, and B. Tabbara. HardwareSoftware Co-Design of Embedded
Systems. Kluwer Academic Publishers, Dordrecht, 1997.
[18] G. Berry and G. Gonthier. The ESTEREL synchronous programming language: design, semantics,
implementation. Science of Computer Programming, 19: 87152, 1992.
[19] Esterel Studio. http://www.esterel-technologies.com
[20] C. Andr. Representation and analysis of reactive behaviors: a synchronous approach. In
Proceedings of the CESA96, Lille, France, 1996, pp. 1929.
[21] C. Andr, F. Boulanger, and A. Girault. A software implementation of synchronous programs.
In Proceedings of the 2nd International Conference on Application of Concurrency to System Design
(ICACSD 2001), Newcastle upon Tyne, UK, June 2529, 2001, pp. 133142.
[22] F. Cassez and O. Roux. Compilation of the ELECTRE reactive language into nite transition
systems. Theoretical Computer Science, 146: 109143, 1995.
[23] E. Benoit, J. Tailland, L. Foullooy, and G. Mauris. A software tool for designing intelligent sensors.
In Proceedings of the IEEE Instrumentation and Measurement Technology IMTC/2000, Baltimore,
MD, May 2000.
[24] G. Mauris, E. Benoit, and L. Foulloy. Ultrasonic smart sensors: the importance of the measurement
principle. In Proceedings of the IEEE/SMC International Conference on Systems Engineering in the
Service of Humans, Le touquet, France, October 1993.
[25] EmbeddedJava and Javacard. http://Java.sun.com
[26] J. Rowson. Hardware/software co-simulation. In Proceedings of the Design Automation Conference,
1994, pp. 439440.
[27] CoFluent Studio. http://www.couentdesign.com
[28] J.P. Calvez. A co-design case study with the MCSE methodology. Design Automation of Embedded
Systems, Special Issue on Embedded Systems Case Studies, 1: 183211, 1996.
[29] R.K. Gupta, C.N. Coelho, and G. De Micheli. Program implementation schemes for
hardware/software systems. IEEE Computer, 27: 4855, 1994.
[30] E. Dekneuvel, F. Muller, and T. Pitarque. Ultrasonic smart sensor design for a distributed percep-
tion system. In Proceedings of the 8th IEEE International Conference on Emerging Technologies and
Factory Automation (ETFA), Antibes Juan le spins, France, 1518 October 2001.
[31] J.P. Calvez and O. Pasquier. Performance assessment of embedded hw/sw systems. In Proceedings
of the International Conference on Computer Design, Austin, TX, October 1995.

10 06 28 08 23 52 Pratheesh

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

10 06 28 08 23 52 Pratheesh

Enviado por

Direitos autorais:

Formatos disponíveis

H A N D B O O K

B. Whalley. Fast instruction cache analysis via static cache simulation.

into subsignals of length 2.

attribute inside a circle.

) T, we usually write this as s

and say that a system S moves from the

, and that there are no

then there exists t

then there exists s

and divergent states S

then there exists t

then there exists s

2006 by Taylor & Francis Group, LLC

with a different insertion

map the data

map elements of the data domain onto behaviors.

) = has only one solution. Interleaving makes it possible to select other

[19]. These decision

concurrently with any other state changes. The

, that is, the logical reading is that of an inference rule. Computation

with action act.u-(act)->u.

q, then preconditions P and P

are equivalent if for an arbitrary agent u the states e[u] and e

[u] are bisimilar (from the equivalence of

satisfy the requirements, then for all e E,

depend on parameters appearing in the assumptions. Usually, if a = c = d

, albeit there can be special cases such as hiding, scheduling

for the environment we rst dene the transition rules

is not a terminal state (, , or 0), and p

(assignment for functions). If x / X it

synchronizes the clocks e and e

). For example, a sequence with alternation of a request

, where U and V are

[..]), and sequencing, where different sequences can be combined

and the code change is such that C

in two or more pieces to

algorithm is illustrated in Figure 12.5,

[p] = M[p] F( p, t ) + F(t , p) for each p of P. In the sequel, M

denotes the fact that a

is obtained by ring t at M. A transition t is said to be a

is said to be reachable from M if there is a sequence

. The set of markings reachable fromthe initial marking is

S such that (s, t , s

) , and we denote it with s

. If corresponds to the set

U, will describe the effects of a

U, are monotonic. This guarantees that all the

A of the following type:

S = set of abstract pipeline states (14.2)

C = set of abstract cache states (14.3)

T : set of abstract trace,

2006 by Taylor & Francis Group, LLC

further, denotes the minimum-operator (f g )(t ) = min {f (t ), g (t )}.

, where k is a constant depending on the battery design, I is the

, N), which will

f ()| . For each constraint f some statistics have to be gathered on so as to

token_size) %ch_size for each

found in Step 4 from T

, which may be a physical memory address or an address

(i.e., mem[v2] = mem[v2

ORBis a customized ORBthat offers

ORB, and other similar CORBA imple-

3r, where r is the