Você está na página 1de 285

Performance driven FPGA design

with an ASIC perspective

Andreas Ehliar

Linköping, 2009
Performance driven FPGA design with an ASIC perspective
Andreas Ehliar
Dissertations, No 1237

Copyright °
c 2008-2009 Andreas Ehliar (unless otherwise noted)
ISBN: 978-91-7393-702-3
ISSN: 0345-7524
Printed by LiU-Tryck, Linköping 2009

Front cover Pipeline of an FPGA optimized processor (See Chapter 7)


Back cover: Die photo of a DSP processor optimized for audio decoding
(See Chapter 6)

URL for online version: http://urn.kb.se/resolve?urn=urn:nbn:se:


liu:diva-16732 Errata lists will also be published at this location if necessary.

Parts of this thesis is reprinted with permission from IET, IEEE, and FPGAworld.com.
The following notice applies to material which is copyrighted by IEEE:
This material is posted here with permission of the IEEE. Such permission of the IEEE
does not in any way imply IEEE endorsement of any of Linköping universitet’s products
or services. Internal or personal use of this material is permitted. However, permission to
reprint/republish this material for advertising or promotional purposes or for creating new
collective works for resale or redistribution must be obtained from the IEEE by writing to
pubs-permissions@ieee.org. By choosing to view this material, you agree to all provisions
of the copyright laws protecting it.
Abstract

FPGA devices are an important component in many modern devices.


This means that it is important that VLSI designers have a thorough
knowledge of how to optimize designs for FPGAs. While the design
flows for ASICs and FPGAs are similar, there are many differences as
well due to the limitations inherent in FPGA devices. To be able to use
an FPGA efficiently it is important to be aware of both the strengths and
weaknesses of FPGAs. If an FPGA design should be ported to an ASIC
at a later stage it is also important to take this into account early in the
design cycle so that the ASIC port will be efficient.
This thesis investigates how to optimize a design for an FPGA through
a number of case studies of important SoC components. One of these
case studies discusses high speed processors and the tradeoffs that are
necessary when constructing very high speed processors in FPGAs. The
processor has a maximum clock frequency of 357 MHz in a Xilinx Virtex-
4 devices of the fastest speedgrade, which is significantly higher than
Xilinx’ own processor in the same FPGA.
Another case study investigates floating point datapaths and describes
how a floating point adder and multiplier can be efficiently implemented
in an FPGA.
The final case study investigates Network-on-Chip architectures and
how these can be optimized for FPGAs. The main focus is on packet
switched architectures, but a circuit switched architecture optimized for
FPGAs is also investigated.
All of these case studies also contain information about potential pit-

iii
falls when porting designs optimized for an FPGA to an ASIC. The focus
in this case is on systems where initial low volume production will be
using FPGAs while still keeping the option open to port the design to
an ASIC if the demand is high. This information will also be useful for
designers who want to create IP cores that can be efficiently mapped to
both FPGAs and ASICs.
Finally, a framework is also presented which allows for the creation
of custom backend tools for the Xilinx design flow. The framework is
already useful for some tasks, but the main reason for including it is to
inspire researchers and developers to use this powerful ability in their
own design tools.

iv
Populärvetenskaplig
Sammanfattning

En fältprogrammerbar grindmatris (FPGA) är ofta en viktig komponent


i många moderna apparater. Detta innebär att det är viktigt att personer
som arbetar med VLSI-design vet hur man optimerar kretsar för dessa.
Designflödet för en FPGA och en applikationsspecifik krets (ASIC) är
liknande, men det finns även många skillnader som bygger på de be-
gränsningar som är inbyggda i en FPGA. För att kunna utnyttja en FPGA
effektivt är det nödvändigt att känna till både dess svagheter och styrkor.
Om en FPGA baserad design behöver konverteras till en ASIC i ett senare
skede är det också viktigt att ta med detta i beräkningen i ett tidigt skede
så att denna konvertering kan ske så effektivt så mycket.
Denna avhandling undersöker hur en design kan optimeras för en
FPGA genom ett antal fallstudier av viktiga komponenter i ett system
på chip (SoC). En av dessa fallstudier diskuterar en processor med hög
klockfrekvens och de kompromisser som är nödvändiga när en sådan
konstrueras för en FPGA. I en Virtex-4 med högsta hastighetsklass kan
denna processor användas med en klockfrekvens av 357 MHz vilket är
betydligt snabbare än Xilinx egen processor på samma FPGA.
En annan fallstudie undersöker datavägar för flyttal och beskriver
hur en flyttalsadderare och multiplicerare kan implementeras på ett ef-
fektivt sätt i en FPGA.
Den sista fallstudien undersöker arkitekturer för nätverk på chip och

v
hur dessa kan optimeras för FPGAer. Huvudfokus i denna del är paket-
baserade nätverk men ett kretskopplat nätverk optimerat för FPGAer un-
dersöks också.
Alla fallstudier innehåller också information om eventuella fallgropar
när kretsarna ska konverteras från en FPGA till en ASIC. I detta fall är
fokus främst på system där småskalig produktion använder FPGAer där
det är viktigt att hålla möjligheten öppen till en ASIC-konvertering om
det visar sig att efterfrågan på produkten är hög. Detta avsnitt är även av
intresse för utvecklare som vill skapa IP-kärnor som är effektiva i både
FPGAer och i ASICs.
Slutligen så presenteras ett ramverk som kan användas för att skapa
skräddarsydda backend-verktyg för det designflöde som Xilinx använ-
der. Detta ramverk är redan användbart till vissa uppgifter men den
största anledningen till att detta inkluderas är att inspirera andra forskare
och utvecklare till att använda denna kraftfulla möjlighet i sina egna
utvecklingsverktyg.

vi
Abbreviations

• ASIC: Application Specific Integrated Circuit

• CLB: Configurable Logic Block

• DSP: Digital Signal Processing

• DSP48, DSP48E: A primitive optimized for DSP operations in some


Xilinx FPGAs

• FD,FDR,FDE: Various flip-flop primitives in Xilinx FPGAs

• FIR: Finite Impulse Response

• FFT: Fast Fourier Transform

• FPGA: Field Programmable Gate Array

• HDL: Hardware Description Language

• IIR: Infinite Impulse Response

• IP: Intellectual Property

• kbit: Kilobit (1000 bits)

• kB: Kilobyte (1000 bytes)

• KiB: Kibibyte (1024 bytes)

• LUT: Look-Up Table

vii
• LUT1, LUT2, . . . , LUT6: Lookup-tables with 1 to 6 inputs

• MAC: Multiply and Accumulate

• MDCT: Modified Discrete Cosine Transform

• NoC: Network on Chip

• NRE: Non Recurring Engineering

• OCN: On Chip Network

• PCB: Printed Circuit Board

• RTL: Register Transfer Level

• SRL16: A 16-bit shift register in Xilinx FPGAs

• VLSI: Very Large Scale Integration

• XDL: Xilinx Design Language

viii
Acknowledgments

There are many people who have made this thesis possible. First of all,
without the support of my supervisor, Prof. Dake Liu, this thesis would
never have been written. Thanks for taking me on as your Ph.D. student!
I would also like to acknowledge the patience with my working hours
that my fiancee, Helene Karlsson, has had during the last year. Thanks
for your understanding!
I’ve also had the honor of co-authoring publications with Johan Eilert,
Per Karlström, Daniel Wiklund, Mikael Olausson, and Di Wu.
Additionally, in no particular order1 I would like to acknowledge the
following:

• The community on the comp.arch.fpga newsgroup for serving as a


great inspiration regarding FPGA optimizations.

• Göran Bilski from Xilinx for an interesting discussion about soft


core processors.

• All present and former Ph.D. students at the division of Computer


Engineering.

• Ylva Jernling for taking care of administrative tasks of the bureau-


cratic nature and Anders Nilsson (Sr) for taking care of administra-
tive tasks of technical nature.

• Pat Mead from Altera for an interesting discussion about Altera’s


Hardcopy program.
1 Ensured by entropy gathered from /dev/random.

ix
• All the teaching staff at Datorteknik, especially Lennart Bengtsson
who offered much valuable advice when I was given the responsi-
bility of giving the lectures in basic switching theory.

Finally, my parents have always supported me in both good and bad


times. Thank you.
Andreas Ehliar, 2009

x
Contributions

My main contributions are:

• An investigation of the design tradeoffs for the data path and con-
trol path of a 32-bit microprocessor with DSP extensions optimized
for the Virtex-4 FPGA. The microprocessor is optimized for very
high clock frequencies (around 70% higher than Xilinx’ own Mi-
croblaze processor). Extra care was taken to keep the pipeline as
short as possible while still retaining as much flexibility as possible
at these frequencies. The processor should be very good for stream-
ing signal processing tasks and adequate for general purpose tasks
when compared with other FPGA optimized processors. Finally, it
is also possible to port the processor to an ASIC with high perfor-
mance.

• A network-on-chip architecture optimized for very high clock fre-


quencies in FPGAs. The focus of this work was to take a simple
packet switched NoC architecture and push the performance as
high as possible in an FPGA. When published this was probably
the fastest packet switched NoC for FPGAs and it is still very com-
petitive when compared with all types of FPGA based NoCs. This
NoC architecture has also been released as open source to allow
other researchers to access a high performance NoC architecture
for FPGAs and improve on it if desired.

• High performance floating point adder and multiplier with perfor-

xi
mance comparable to commercially available floating point mod-
ules for Xilinx FPGAs.

• A library for analysis and manipulation of netlists in the backend


part of Xilinx’ design flow. This library and some supporting util-
ities, most notably a logic analyzer core inserter, has also been re-
leased as open source to serve as an inspiration for other researchers
interested in this subject.

• An investigation of how various kinds of FPGA optimizations will


impact the performance and area of an ASIC port.

xii
Preface

This thesis presents my research from October 2003 to January 2009. The
following papers are included in the thesis:

Paper I: Using low precision floating point num-


bers to reduce memory cost for MP3 decoding

The first paper, written in collaboration with Johan Eilert, describes a


DSP processor optimized for MP3 decoding. By using floating point
arithmetic it is possible to lower the memory demands of MP3 decod-
ing and also simplify firmware development. It was published at the
International Workshop on Multimedia Signal Processing, 2004.
Contributions: The contributions in this paper from Johan Eilert and
me are roughly equal.

Paper II: An FPGA based Open Source Network-


on-chip Architecture

The second paper presents an open source packet switched NoC archi-
tecture optimized for Xilinx FPGAs. It was published at FPL 2007. The
source code for this NoC is also available under an open source license
to allow other researchers to build on this work.

xiii
Paper III: Thinking outside the flow: Creating
customized backend tools for Xilinx based de-
signs
The third paper presents the PyXDL tool which allows XDL files to be
analyzed and edited from Python. It was published at FPGAWorld 2007.
The PyXDL tool is available as open source.

Paper IV: A High Performance Microprocessor with


DSP Extensions Optimized for the Virtex-4 FPGA
The fourth paper, written in collaboration with Per Karlström presents
a high performance microprocessor which is heavily optimized for the
Virtex-4 FPGA through both manual instantiation of FPGA primitives
and floorplanning. It was published at Field Programmable Logic and
Applications, 2008.
Contributions: I designed most of the architecture of the processor,
Per Karlström helped me with reviewing the architecture of the proces-
sor and evaluated whether it was possible to add floating point units to
the processor.

Paper V: High performance, low-latency field-


programmable gate array-based floating-point
adder and multiplier units in a Virtex 4
The fifth paper, written in collaboration with Per Karlström, studies float-
ing point numbers and how to efficiently create a floating point adder
and multiplier in an FPGA. It was published by IET Computers & Digi-
tal Techniques, Vol. 2, No. 4, 2008.
Contributions: Per Karlström is responsible for the IEEE compliant

xiv
rounding modes and the test suite. The remaining contributions in this
paper are roughly equal.

Paper VI: An ASIC Perspective on High Perfor-


mance FPGA Design
The final paper is a study of how various FPGA optimizations will im-
pact an ASIC port of an FPGA based design. It has been submitted for
possible publication to the IEEE conference of Field Programmable Logic
and Applications, 2009.

Licentiate Thesis
The content of this thesis is also heavily based on my licentiate thesis:

• Aspects of System-on-Chip Design for FPGAs, Andreas Ehliar, Linköping


Studies in Science and Technology, Thesis No. 1371, Linköping,
Sweden, June 2008

Other research interests


Besides the papers included in this thesis my research interests also in-
cludes hardware for video codecs and network processors.

Other Publications
• Flexible route lookup using range search, Andreas Ehliar, Dake Liu;
Proc of the The Third IASTED International Conference on Com-
munications and Computer Networks (CCN), 2005

• High Performance, Low Latency FPGA based Floating Point Adder and
Multiplier Units in a Virtex 4, Karlström, P. Ehliar, A. Liu, D; 24th
Norchip Conference, 2006.

xv
xvi
Contents

1 Introduction 1
1.1 Scope of this Thesis . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

I Background 5

2 Introduction to FPGAs 7
2.1 Special Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2 Xilinx FPGA Design Flow . . . . . . . . . . . . . . . . . . . 9
2.3 Optimizing a Design for FPGAs . . . . . . . . . . . . . . . . 10
2.3.1 High-Level Optimization . . . . . . . . . . . . . . . 10
2.3.2 Low-level Logic Optimizations . . . . . . . . . . . . 11
2.3.3 Placement Optimizations . . . . . . . . . . . . . . . 12
2.3.4 Optimizing for Reconfigurability . . . . . . . . . . . 13
2.4 Speed Grades, Supply Voltage, and Temperature . . . . . . 14

3 Methods and Assumptions 17


3.1 General HDL Code Guidelines . . . . . . . . . . . . . . . . 18
3.2 Finding FM AX for FPGA Designs . . . . . . . . . . . . . . . 19
3.2.1 Timing Constraints . . . . . . . . . . . . . . . . . . . 19
3.2.2 Other Synthesis Options . . . . . . . . . . . . . . . . 20
3.3 Possible Error Sources . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 Bugs in the CAD Tools . . . . . . . . . . . . . . . . . 21

xvii
3.3.2 Guarding Against Bugs in the Designs . . . . . . . . 23
3.3.3 A Possible Bias Towards Xilinx FPGAs . . . . . . . 23
3.3.4 Online Errata . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Method Summary . . . . . . . . . . . . . . . . . . . . . . . . 24

4 ASIC vs FPGA 27
4.1 Advantages of an ASIC Based System . . . . . . . . . . . . 27
4.1.1 Unit Cost . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.2 Higher Performance . . . . . . . . . . . . . . . . . . 28
4.1.3 Power Consumption . . . . . . . . . . . . . . . . . . 28
4.1.4 Flexibility . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2 Advantages of an FPGA Based System . . . . . . . . . . . . 30
4.2.1 Rapid Prototyping . . . . . . . . . . . . . . . . . . . 30
4.2.2 Setup Costs . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.3 Configurability . . . . . . . . . . . . . . . . . . . . . 31
4.3 Other Solutions . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4 ASIC and FPGA Tool Flow . . . . . . . . . . . . . . . . . . . 33

5 FPGA Optimizations and ASICs 37


5.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 39
5.2 ASIC Port Method . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Finding Fmax for ASIC Designs . . . . . . . . . . . . . . . . 40
5.4 Relative Cost Metrics . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Adders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6 Multiplexers . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.7 Datapath Structures with Adders and Multiplexers . . . . 47
5.8 Multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.9 Memories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.9.1 Dual Port Memories . . . . . . . . . . . . . . . . . . 55
5.9.2 Multiport Memories . . . . . . . . . . . . . . . . . . 56
5.9.3 Read-Only Memories . . . . . . . . . . . . . . . . . . 58
5.9.4 Memory Initialization . . . . . . . . . . . . . . . . . 59
5.9.5 Other Memory Issues . . . . . . . . . . . . . . . . . 59
5.10 Manually Instantiating FPGA Primitives . . . . . . . . . . . 61

xviii
5.11 Manual Floorplanning and Routing . . . . . . . . . . . . . 62
5.12 Pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
5.13 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

II Data Paths and Processors 65

6 An FPGA Friendly Processor for Audio Decoding 67


6.1 Why Develop Yet Another FPGA Based Processor? . . . . 68
6.2 An Example of an FPGA Friendly Processor . . . . . . . . . 69
6.2.1 Processor Architecture . . . . . . . . . . . . . . . . . 69
6.2.2 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.2.3 Register File . . . . . . . . . . . . . . . . . . . . . . . 71
6.2.4 Performance and Area . . . . . . . . . . . . . . . . . 71
6.2.5 What Went Right . . . . . . . . . . . . . . . . . . . . 73
6.2.6 What Could Be Improved . . . . . . . . . . . . . . . 74
6.2.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . 75

7 A Soft Microprocessor Optimized for the Virtex-4 77


7.1 Arithmetic Logic Unit . . . . . . . . . . . . . . . . . . . . . 78
7.2 Result Forwarding . . . . . . . . . . . . . . . . . . . . . . . 82
7.3 Address Generator . . . . . . . . . . . . . . . . . . . . . . . 84
7.4 Pipeline Stall Generation . . . . . . . . . . . . . . . . . . . . 87
7.5 Shifter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.6 Other Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
7.6.1 Register File . . . . . . . . . . . . . . . . . . . . . . . 91
7.6.2 Input/Output . . . . . . . . . . . . . . . . . . . . . . 91
7.6.3 Flag Generation . . . . . . . . . . . . . . . . . . . . . 91
7.6.4 Branches . . . . . . . . . . . . . . . . . . . . . . . . . 92
7.6.5 Immediate Data . . . . . . . . . . . . . . . . . . . . . 92
7.6.6 Memories and the MAC Unit . . . . . . . . . . . . . 92
7.7 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
7.7.1 Porting the Processor to an ASIC . . . . . . . . . . . 94
7.8 Comparison with Related Work . . . . . . . . . . . . . . . . 95

xix
7.8.1 MicroBlaze . . . . . . . . . . . . . . . . . . . . . . . . 96
7.8.2 OpenRisc . . . . . . . . . . . . . . . . . . . . . . . . 96
7.9 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
7.10 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

8 Floating point modules 99


8.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 100
8.2 Designing Floating Point Modules . . . . . . . . . . . . . . 100
8.3 Unoptimized Floating Point Hardware . . . . . . . . . . . . 102
8.4 Optimizing the Multiplier . . . . . . . . . . . . . . . . . . . 103
8.5 Optimizing the Adder . . . . . . . . . . . . . . . . . . . . . 103
8.6 Comparison with Related Work . . . . . . . . . . . . . . . . 104
8.7 ASIC Considerations . . . . . . . . . . . . . . . . . . . . . . 106
8.8 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

III On-Chip Networks 109

9 On-chip Interconnects 111


9.1 Buses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
9.1.1 Bus Performance . . . . . . . . . . . . . . . . . . . . 112
9.1.2 Bus Protocols . . . . . . . . . . . . . . . . . . . . . . 113
9.1.3 Arbitration . . . . . . . . . . . . . . . . . . . . . . . . 114
9.1.4 Buses and Bridges . . . . . . . . . . . . . . . . . . . 114
9.1.5 Crossbars . . . . . . . . . . . . . . . . . . . . . . . . 116
9.2 On Chip Networks . . . . . . . . . . . . . . . . . . . . . . . 117
9.2.1 Network Protocols . . . . . . . . . . . . . . . . . . . 117
9.2.2 Deadlocks . . . . . . . . . . . . . . . . . . . . . . . . 118
9.2.3 Livelocks . . . . . . . . . . . . . . . . . . . . . . . . . 120

10 Network-on-Chip Architectures for FPGAs 121


10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
10.2 Buses and Crossbars in an FPGA . . . . . . . . . . . . . . . 123
10.3 Typical IP Core Frequencies . . . . . . . . . . . . . . . . . . 124
10.4 Choosing a NoC Configuration . . . . . . . . . . . . . . . . 126

xx
10.4.1 Hybrid Routing Mechanism . . . . . . . . . . . . . . 126
10.4.2 Packet Switched . . . . . . . . . . . . . . . . . . . . 128
10.4.3 Circuit Switched NoC . . . . . . . . . . . . . . . . . 131
10.4.4 Minimal NoC . . . . . . . . . . . . . . . . . . . . . . 131
10.4.5 Comparing the NoC Architectures . . . . . . . . . . 132
10.5 Wishbone to NoC Bridge . . . . . . . . . . . . . . . . . . . . 133
10.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 135
10.7 Availability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.8 ASIC Ports . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
10.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

IV Custom FPGA Backend Tools 141

11 FPGA Backend Tools 143


11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
11.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.3 PyXDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
11.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . 146

V Conclusions and Future Work 147

12 Conclusions 149
12.1 Successful Case Studies . . . . . . . . . . . . . . . . . . . . . 149
12.2 Porting FPGA Optimized Designs to ASICs . . . . . . . . . 150

13 Future Work 151


13.1 FPGA Optimized DSP . . . . . . . . . . . . . . . . . . . . . 151
13.2 Floating Point Arithmetic on FPGAs . . . . . . . . . . . . . 152
13.3 Network-on-Chip . . . . . . . . . . . . . . . . . . . . . . . . 153
13.4 Backend Tools . . . . . . . . . . . . . . . . . . . . . . . . . . 153
13.5 ASIC Friendly FPGA Designs . . . . . . . . . . . . . . . . . 153

xxi
xxii
Chapter 1
Introduction

Field programmable logic has developed from being small devices used
mainly as glue logic to capable devices which are able to replace ASICs
in many applications. Today, FPGAs are used in areas as diverse as flat
panel televisions, network routers, space probes and cars. FPGAs are
also popular in universities and other educational settings as their con-
figurability make them an ideal platform when teaching digital design
since students can actually implement and test their designs instead of
merely simulating them. In fact, the availability of cheap FPGA boards
mean that even amateurs can get into the area of digital design.
As a measure of the success that FPGAs enjoy, there are circa 7000
ASIC design starts per year whereas the number of FPGA design starts
are roughly 100000 [1]. However, most of the FPGA design starts are
likely to be for fairly low volume products as the unit price of FPGAs
make them unattractive for high volume production. Similarly, most of
the ASIC design starts are probably only intended for high volume prod-
ucts due to the high setup cost and low unit cost of ASICs. Even so, the
ASIC designs are likely to be prototyped in FPGAs. And if a low volume
FPGA product is successful it may have to be converted to an ASIC.
One of the motivations behind this thesis is to investigate a scenario
where an FPGA based product has been so successful that it makes sense
to convert it into an ASIC. However, there are many ways that an ASIC
and FPGA design can be optimized and not every ASIC optimization

1
2 Introduction

can be used in an FPGA and vice versa. If the FPGA design was not
designed with an ASIC in mind from the beginning, it may be hard to
create such a port. This thesis will classify and investigate various FPGA
optimizations to determine whether they make sense to use in a product
that may have to be ported to an ASIC. This part of the thesis should
also be of interest to engineers who are tasked with creating IP cores for
FPGAs if the IP cores may have to be used in ASICs.
Another motivation is simply the fact that the large success of FPGAs
of course also means that there is a large need for information about how
to optimize designs for these devices. Or, to put it another way, a de-
sire to advance the state of the art in creating designs that are optimized
for FPGAs. This effort has focused on areas where we believed that the
current state of the art could be substantially improved or substantially
better documented.
A more personal motivation is the fact that relatively little research
on FPGA optimized design is happening in Sweden. After all, it is more
likely that a freshly graduated student from a university will be involved
in VLSI design for FPGAs rather than ASICs. My hope is that this thesis
can serve as an inspiration for these students and perhaps even inspire
other researchers to look further into this interesting field.
The results in this thesis should be of interest for engineers tasked
with the creation of FPGA based stand alone systems, accelerators, and
soft processor cores.

1.1 Scope of this Thesis


This thesis is mainly based on case studies where important SoC compo-
nents were optimized for FPGAs. The main case studies are:

• Microprocessors

• Floating point datapath components

• Networks-on-Chip
1.2 Organization 3

These were selected as they are representative of a variety of inter-


esting and varied architectural choices where we believed that we could
improve the state of the art. For example, when we began the micro-
processor research project there were no credible DSP processors opti-
mized for FPGAs. The NoC situation was similar in that most NoC re-
search had been done on ASICs and very few NoCs had been optimized
for FPGAs in any way. The floating point datapath is slightly different as
there were already a few floating point adder and multiplier with good
performance available. However, all of these were proprietary cores
without any documentation of how the high performance was reached.
These case studies are also interesting because they cover a fairly
wide area of interesting optimization problems. Microprocessors con-
sists of many small but latency critical datapaths. In contrast, when float-
ing point components are used to create datapath based architectures,
high throughput is required, but the latency is usually not as important.
NoCs are interesting because the datapaths in a NoC are intended mainly
to transport data as fast as possible instead of transforming data.
The opportunities and pitfalls when porting a design which has been
heavily optimized for an FPGA is also discussed for all of these case stud-
ies.
Finally, a framework is presented which allows a designer to create
backend tools for the Xilinx design flow, either to analyze or modify a
design after it has been placed and routed.

1.2 Organization
The first part of this thesis contains important background information
about FPGAs, FPGA optimizations, design flow, and methods. This part
also contains a comparison of the performance and area cost for different
components in both FPGAs and ASICs.
Part II contains an investigation of two microprocessors (one FPGA
friendly processor and one FPGA optimized processor). This part also
contains a description of the floating point adder and multiplier. Part III
4 Introduction

contains both a brief overview of Networks-on-Chip and a description


and comparison of FPGA optimized packet switched, circuit switched,
and statically scheduled NoCs. Part IV describes a way to create custom
tools to analyze and manipulate already created designs which will be
interesting for engineers wanting to create their own backend tools. Part
V contains conclusions and also a discussion about possible future work.
This section also contains a list of all ASIC porting guidelines that are
scattered through the thesis. Finally, Part VI contains the publications
that are relevant for this thesis1 .

1 The electronic version of this thesis does not contain Part VI.
Part I

Background

5
Chapter 2
Introduction to FPGAs

An FPGA is a device that is optimized for configurability. As long as


the FPGA is large enough, the FPGA is able to mimic the functionality
of any digital design. When using an FPGA it is common to use a HDL
like VHDL or Verilog to describe the functionality of the FPGA. Special-
ized software tools are used to translate the HDL source code into a con-
figuration bitstream for the FPGA that instructs the many configurable
elements in the FPGA how to behave.
Traditionally, an FPGA consisted of two main parts: routing and con-
figurable logic blocks (CLB). A CLB typically contains a small amount of
logic that can be configured to perform boolean operations on the inputs
to the CLB block. The logic can be constructed by using a small memory
that is used as a lookup table. This is often referred to as a LUT.
The logic in the CLB block is connected to a small number of flip-flops
in the CLB block. The CLBs are also connected to switch matrices that in
turn are connected to each other using a network of wires. A schematic
view of a traditional FPGA is shown in Figure 2.1.
In reality, todays FPGAs are much more complex devices and a num-
ber of optimizations have been done to improve the performance of im-
portant design components. For example, in Xilinx FPGAs, a CLB has
been further divided into slices. A slice in most Xilinx devices for exam-
ple, consists of two LUTs and two flip-flops. There is also special logic in
the slice to simplify common operations like combining two LUTs into a

7
8 Introduction to FPGAs

Local Lookup Flip


connections table flop

Switch matrix

Non−local
connections
Lookup Flip
table flop

(a) FPGA overview (b) CLB and switch matrix

Figure 2.1: Schematic view of an FPGA

larger LUT and creating efficient adders.

2.1 Special Blocks


The basic architecture in Figure 2.1 is not very optimal when a memory
is needed. To improve the performance of memory dense designs, mod-
ern FPGAs have embedded memory blocks capable of operating at high
speed. In a Virtex-4 FPGA, an embedded memory, referred to as a block
RAM, contains 512 words of 36 bits each. (It is also possible to configure
half of the LUTs in the CLBs as a small memory containing 16 bits, this
is referred to as a distributed RAM.) In contrast, a Stratix-3 from Altera
have embedded memory blocks of different sizes. There are many blocks
that contains 256 36-bit words and a few blocks with 2048 72-bit words.
To improve the performance of arithmetic operations like addition
and subtraction, there are special connections available that allows a LUT
to function as an efficient full adder. This is referred to as a carry chain. A
carry chain is also connected to adjacent slices to allow for larger adders
to be created.
To improve the performance of multiplication, hard wired multiplier
blocks are also available in most FPGAs, sometimes combined with other
logic like an accumulator. In a Virtex-4, a block consisting of a multiplier
and an accumulator is called a DSP48 block. The multiplier is 18 × 18 bits
and the accumulator contains 48 bits. There are also special connections
2.2 Xilinx FPGA Design Flow 9

available to easily connect several DSP48 blocks to each other that can be
used to build efficient FIR filters or larger multipliers for example.
In some FPGAs there are also more specialized blocks like processor
cores, Ethernet controllers, and high speed serial links.

2.2 Xilinx FPGA Design Flow


A typical FPGA design flow consists of the following steps (in more ad-
vanced flows some of these steps may be combined):

• Synthesis: Translate RTL code into LUTs, flip-flops, memories, etc.

• Mapping: Map LUTs and flip-flops into slices

• Place and route: First decide where all slices, memory blocks, etc
should be placed in the FPGA and then route all signals that con-
nects these components

• Bitfile generation: Convert the netlist produced by the place and


route step into a bitstream that can be used to configure the FPGA

• FPGA Configuration: Download the bitstream into the FPGA

There are also other steps that are optional but can be used in some
cases. A static timing analyzer, for example, can be used to determine
the critical path of a certain design. It can also be used to make sure that
a design is meeting the timing constraints, but this is seldom necessary
as the place and route tool will usually print a warning if the timing
constraints are not met.
There are special tools available to inspect and modify the design.
A floorplanning tool allows a designer to investigate the placement of
all components in a design and change the placement if necessary. An
FPGA editing tool can be used to view and edit the exact configuration
of a CLB and other components in terms of logic equations for LUTs,
flip-flop configuration, etc. It will also show how signals are routed in
the FPGA and can also change the routing if necessary.
10 Introduction to FPGAs

2.3 Optimizing a Design for FPGAs


Optimizing an algorithm to an FPGA will use the same general ideas as
optimizing for ASICs. The basic idea is to use as much parallelization as
required to achieve the required performance. However, the details are
not quite the same as described below.

2.3.1 High-Level Optimization


Adding pipeline-stages, if possible, is a simple way to increase the per-
formance in both FPGAs and ASICs. It is usually especially area effi-
cient in FPGAs, since most FPGA designs are not flip-flop limited, which
means that there are a lot of flip-flops available and an unused flip-flop
is a wasted flip-flop. Although a general technique, some designs can-
not easily tolerate extra pipeline stages (e.g. microprocessors) and other
methods are required in those cases.
Another way to improve the performance of an FPGA is by utilizing
all capabilities of the embedded memories. In ASICs, dual port memo-
ries are more expensive than single port memories. Therefore it makes
sense to avoid dual port memories in many situations. However, in
FPGAs, the basic memory block primitive is usually dual-ported by de-
fault. Therefore it makes sense to use the memories in dual-ported mode
if it will simplify an algorithm. Similarly, each memory block in an FPGA
has a fixed size. Therefore it can make sense to decrease logic usage at
a cost of increased memory usage as long as the memory usage for that
part of the design will still fit into a certain block RAM.
Similarly, the multipliers in an FPGA have a fixed size (e.g. 18 × 18
bits, in a Virtex-4). When compared to an ASIC where it is easy to just
generate a multiplier of another size, it is worthwhile to make sure that
the algorithm doesn’t need larger multipliers than provided in the FPGA.
This works the other way around as well. Coming up with a way to
reduce a multiplier from 16 × 16 bits to a mere 13 × 13 bits at the cost
of additional logic is not going to help in terms of resource utilization
(although it may improve the timing slightly).
2.3 Optimizing a Design for FPGAs 11

2.3.2 Low-level Logic Optimizations

In many cases there is no need to go further than the optimizations men-


tioned in the previous section. However, if the performance that was
reached by the previous optimizations was not satisfactory, it is possible
to fine-tune the architecture for a certain FPGA. Some examples of how
to do this are:

• Modify the critical path to take advantage of the LUT structure.


For example, if an 8-to-1 multiplexer is required it will probably
be synthesized as shown in Figure 2.2(a) when synthesized to a
Virtex-4, utilizing a total of 4 LUTs distributed over two slices and
taking advantage of the built-in MUXF5 and MUXF6 primitives.
However, if it is possible to rearrange the logic so that the inputs to
the mux are zero in case the input is not going to be selected, the
mux can be rearranged using a combination of or gates and muxes.
In Figure 2.2(b), the zero is arranged by using the reset input of a
flip-flop directly connected to the mux.

Other ways in which the design can be fine tuned is to make sure
that the algorithms are mapped to the FPGA in such a way that
adders can be efficiently combined with other components such as
muxes while keeping the number of logic levels low.

• In a Virtex-4 some LUTs can be configured as small shift registers.


This makes it very efficient to add small delay lines and FIFOs to a
design.

• Bit serial arithmetics can be a great way to maximize the through-


put of a design by minimizing the logic delays at a cost of increased
complexity. To be worthwhile, a large degree of parallelism must
be available in the application. Bit (or digit) serial algorithms can
also be a very useful way to minimize the area cost of modules that
are required in a system but have low performance requirements,
such as for example a real time clock.
12 Introduction to FPGAs

Slice
Lut

D Q X
MUXF5
R

Lut

MUXF6

Slice
Lut
Slice
Lut
X
1

MUXF5

Lut MUXF5
Lut

(a) Using four LUTs configured as 2-to-1 (b) Using two LUTs configured as or gates
muxes

Figure 2.2: Example of low level logic optimization: 8-to-1 mux

2.3.3 Placement Optimizations

If the required performance is not reached through either high or low


level logic optimizations it is usually possible to gain a little more perfor-
mance by floorplanning. There are two kinds of floorplanning available
for an FPGA flow. The easiest is to tell the backend tools to place certain
modules in certain regions of the FPGA. This is rather coarse grained but
can be a good way to ensure that the timing characteristics of the design
will not vary too much over multiple place and route runs. The other
way is to manually (either in the HDL source code or through the use of
a graphical user interface), describe how the FPGA primitives should be
placed. For example, if the critical path is long (several levels of LUTs), it
makes sense to make sure that all parts of it are closely packed, preferably
inside a single CLB due to the fast routing available inside a CLB. If the
2.3 Optimizing a Design for FPGAs 13

design consists of a complicated data path, the entire data path could be
designed using RLOC attributes to ensure that the data path will always
be placed in a good way.
The advantage of floorplanning has been investigated in [2], and was
found to be able to improve the performance from 30% to 50%. However,
since this was published in 2000, a lot of development has happened in
regards to automatic place and route. Today, the performance increase
that can be gained from floorplanning is closer to 10% or so and it is often
enough to floorplan only the critical parts of the design [3]. It should also
be noted that it is very easy to reduce the performance of a design by a
slight mistake in the floorplanning.
Finally, if it is still not possible to meet timing even though floorplan-
ning has been explored, it might be possible to gain a little more perfor-
mance by manually routing some critical paths. The author is not aware
of any investigation into how much this will improve the performance,
but the general consensus seems to be that the performance gains are not
worth the source code maintenance nightmare that manual routing leads
to.

2.3.4 Optimizing for Reconfigurability

The ability to reconfigure an FPGA can be a powerful feature, especially


for the FPGA families where parts of the FPGA can be reconfigured dy-
namically without impacting the operations of other parts of the FPGA.
This can be a very powerful ability in a system that has to handle a wide
variety of tasks under the assumption that it doesn’t have to handle all
kinds of tasks simultaneously. In that case it may be possible to use re-
configuration, similarly to how an operating system for a computer is
using virtual memory. That is, swap in hardware accelerators for the
current workload and swap out unused logic. This can lead to signifi-
cant unit cost reductions as a smaller FPGA can be used without any loss
of functionality.
While this ability is powerful, it is only supported for a few FPGAs
14 Introduction to FPGAs

and the support from the design tools is rather limited. But the config-
urability of an FPGA can still be useful, even if it is not possible to re-
configure the FPGA dynamically. One example is to use a special FPGA
bitstream for diagnostic testing purposes (e.g. testing the PCB that the
FPGA is located on). While such functionality could be included in the
main design it may be better from a performance and area perspective to
use a dedicated FPGA configuration for this purpose.

2.4 Speed Grades, Supply Voltage, and Temper-


ature
Due to differences in manufacturing, the actual performance of a certain
FPGA family can vary by a significant amount between various speci-
mens. Faster devices are marked with a higher speed grade than slower
devices and can be sold at a premium by the FPGA manufacturers. There
is no exact definition of what a speedgrade means, but according to the
author’s experience of Xilinx’ devices, going up one speedgrade means
that the maximum clock frequency will increase around 15% depending
on the design and the FPGA. An example of the impact of the speedgrade
on two designs are shown in Table 2.1. (The unit that is tested is a small
microcontroller with a bus interface, serial port and parallel port.) While
an upgrade in speed grade is an easy way to improve the performance
of a design, it is not cheap. For example, a XC4VLX80-10-FFG1148 had
a cost of $1103 in quantities of one unit on the 15th of October 2008 on
NuHorizon’s webshop. The same device in speed grade 11 had a cost of
$1358 and speed grade 12 a cost of $1901. It is clearly a good idea to use
the slowest speedgrade possible.
Another factor that is seldom mentioned in in FPGA related publica-
tions is the supply voltage and temperature. By default, the static timing
analysis tools uses the values for the worst corner (highest temperature
and lowest voltage). For some applications this is not necessary. If good
voltage regulation is available, which guarantees that the supply voltage
2.4 Speed Grades, Supply Voltage, and Temperature 15

Design Device Speedgrade Fmax [MHz]


Small Virtex-4 10 210
Microcontroller 11 246
12 277

Table 2.1: Impact of speedgrade on a sample FPGA design

85 ◦ C 65 ◦ C 45 ◦ C 25 ◦ C 0 ◦C
1.14V 323.2 324.1 325.3 326.4 329.6
1.18V 334.0 335.1 336.2 337.4 340.8
1.22V 344.5 345.7 346.9 347.9 351.4
1.26V 354.5 355.6 356.8 357.8 361.4

Table 2.2: Timing analysis using different values for supply voltage and
temperature

will not approach the worst case, we can specify a higher minimum volt-
age to the timing analyzer. Similarly, if good cooling is available, we can
specify that the FPGA will not exceed a certain temperature.
In Table 2.2, we can see the impact of these changes on a micro-
processor design in a Virtex-4 (speedgrade 12). In the upper left corner
the worst case with minimum supply voltage and maximum tempera-
ture is shown. The design will work at 323.2 MHz in all temperature and
voltage situations that the FPGA is specified for. On the other hand, if an
extremely good power and cooling solution is used, we could clock the
design at 361.4 MHz with absolutely no margin for error. This is a differ-
ence of over 10% without having to change anything in the design! It can
therefore be worthwhile to think about these values when synthesizing
a design for a certain application. Many real life designs will not need
to use the worst case values. However, results in publications are rarely,
if ever, based on other than worst case values. Therefore Table 2.2 is the
only place in this thesis where results are reported that are not based on
worst case temperature and supply voltage conditions.
16 Introduction to FPGAs
Chapter 3
Methods and Assumptions

Normally, the design flow for an FPGA based system will go through the
following design steps:

1. Idea

2. Design specification

3. HDL code development

4. Verification of HDL code

5. Synthesis/Place and route/bitstream generation

6. Manufacturing

7. (Debug of post manufacturing problems if necessary)

This is of course a simplified view. In practice, the process of writing a


design specification is a science in and of itself. Likewise with HDL code
implementation and verification, not to mention manufacturing. There
are usually some overlap between the phases as well, especially between
the verification and development phase.
The method used for the majority of designs described in this thesis
is based on the method described above. The most prominent idea in
our method is the fact that a rigid design specification is an obstacle to a
high performance VLSI design. There is therefore a considerable overlap

17
18 Methods and Assumptions

between the design specification phase and the HDL code development
phase. In fact, it is necessary to quickly identify areas that are likely to
cause performance problems and prototype these to gain the knowledge
that is necessary to continue with the design specification.
Another difference between a normal design flow and the design
flow employed in this thesis (and many other research projects) is that
a lot of effort was spent on low level optimizations with the intention of
reaching the very highest performance. This is uncommon in the indus-
try where performance that is “good enough” is generally accepted. To
know where the low level optimizations are required it is necessary to
study the output from the synthesis tool and the output from the place
and route tool. If there is something clearly suboptimal in the final netlist
it may be fixed either by rewriting the HDL code (possibly by instantiat-
ing low level FPGA primitives) or by manual floorplanning. This method
is described as “construct by correction” in [4] (which also contain a good
overview of the entire design process). Another description of the design
flow (with a focus on ASIP development) can be found in [5].

3.1 General HDL Code Guidelines


This thesis assumes that FPGA friendly rules are used when writing the
HDL code. Some of the more important guidelines are:

• Use clock enable signals instead of gating the clock

• Do not use latches

• Use only one clock domain if at all possible

• Do not use three-state drivers inside the chip

A thorough list of important guidelines for FPGA design can be found


in for example [6]. It should also be noted that many of the guidelines for
FPGA design are also useful for ASIC designs. For an in depth discussion
of guidelines for VLSI design, see for example [4].
3.2 Finding FM AX for FPGA Designs 19

3.2 Finding FM AX for FPGA Designs


While the parameters mentioned in Section 2.4 are easy to understand,
there are other parameters that impact the maximum clock frequency.
Perhaps the most important is the synthesis tools and the place and route
tools. Depending on the tools that are used, different results will be ob-
tained. It is probably a good idea to request evaluation versions of the
various synthesis tools that are available from time to time to see if there
is a reason to change tool. It should also be noted that it is not always a
good idea to upgrade the tools. It is not uncommon to find that an older
version will produce better results for a certain design than the upgraded
version. The author has seen an older tool perform more than 10% better
than a newer tool on a certain design. In some cases it is even possible
that the best results will be achieved when combining tools from various
versions.
All tools in the FPGA design flow have many options that will impact
the maximum frequency, area, power usage, and sometimes even the
correctness of the final design. Many of these choices can also be made
on a module by module case or even on a line by line case in the HDL
source code. Finding the optimal choices for a certain design is not an
easy task. It is also not uncommon that the logical choice is not the best
solution (e.g. sometimes a design will synthesize to a higher speed if it is
optimized for area instead of speed).

3.2.1 Timing Constraints


Perhaps the most important of these options are the timing constraints
given to the tools. The tools will typically not optimize a design further
when it has reached the user specified timing constraints. If the timing
constraint cannot be achieved, different tools behave in different ways.
Xilinx’ tools will spend a lot of time trying to meet a goal that cannot be
achieved. It is also not uncommon that an impossible timing constraint
will mean that the resulting circuit will be slower than if a hard but possi-
ble timing constraint was specified. Altera’s tools on the other hand does
20 Methods and Assumptions

not seem to be plagued by this particular problem though. If a very hard


timing constraint is set, Altera’s place and route tool will give a design
with roughly the same Fmax as can be found when sweeping the timing
constraint over a wide region. (This behavior has been tested with ISE
10.1 and Quartus II 8.1. The same behavior has also been reported in [7].)
Another important thing to consider is clock jitter. As clock frequen-
cies increase, jitter is becoming a significant issue that designers need to
be aware of. It is possible to specify the jitter of the incoming clock sig-
nals in the timing constraints. The use of modules like DCMs and DLLs
will also add to the jitter (this jitter value is usually added automatically
by the backend tools). This is important since the jitter will probably
account for a significant part of the clock period on a high speed de-
sign. However, since it seems to be very unusual to specify any sort of
clock jitter when publishing maximum frequencies for FPGA designs,
the number presented in this thesis will also ignore the effects of jitter1 .
The careful designer will therefore compensate for the lack of jitter when
evaluating the maximum frequency of different solutions for use in his
or her system.

3.2.2 Other Synthesis Options


Other issues that will have an impact on the maximum frequency of a
certain design are the settings of the synthesis and backend tools. The
following is a list of some of the more important options:

• Overall optimization levels: If the design can relatively easily meet


the timing requirements there is no need to spend a lot of time on
optimizations.

• Retiming: The tools are allowed to move flip flops to try to balance
the pipeline for maximum speed

• Optimization goal: Area or speed


1 The author has yet to see an FPGA related publication where the authors specifically
state that they have used anything but 0 for the clock jitter.
3.3 Possible Error Sources 21

• Should the hierarchy of the HDL design be kept or flattened to al-


low optimizations over module boundaries?

• Resource sharing: Allows resources to be shared if they are not


used at the same time.

In this thesis many of these options have been tweaked to produce


the best results for the case studies. As the HDL code itself gets more
and more optimized it is common that many optimizations are turned off
since they will interfere with the manual optimization that has already
been done.

3.3 Possible Error Sources


In a work such as this there are a wide variety of possible error sources.
Perhaps the most insidious source of error is in the form of bugs in the
CAD tools. The author have encountered serious bugs of various kinds
in many CAD tools during his time as a Ph.D. student. Some bugs are
easy to detect by the fact that the tool simply crashes with a cryptic error
message. Other bugs are harder to detect, such as when the wrong logic
is synthesized without any warning or error message to indicate this.
This is not intended as criticism towards any vendor but rather as an
observation of fact. Almost anyone who has used a program as complex
as a CAD tool for a longer period of time will discover bugs in it. And
anyone who has tried to develop a program as complex as a CAD tool
knows how hard it is to completely eliminate all bugs. Overall, the ven-
dors have been very responsive to bug reports as they are of course also
interested in removing bugs.

3.3.1 Bugs in the CAD Tools

Sometimes a synthesis bug is easy to detect, for example, if the area of the
design is significantly smaller than expected it is possible that a bug in
22 Methods and Assumptions

the optimization phase has removed logic that is actually used in the de-
sign. Sometimes bugs introduced by the synthesis or backend tools will
not have a dramatic effect on the area of a design and must be detected
by actually using the design in an FPGA. To guard against this possibil-
ity, all major designs in this thesis have been tested on at least one FPGA.
While minor bugs caused by the backend tools could still be present,
they are unlikely to ruin the conclusions of this thesis as they would be
present in fairly minor functionality of the designs that would only be
triggered under special circumstances. It should also be noted that it is
possible to simulate the synthesized netlist, which is yet another way to
detect whether the synthesis tool has done something wrong. (Bugs in
the backend tools are harder to detect.)

Another source of error that is even harder to detect is bugs in the


static timing analysis where a certain path is reported as being faster than
it actually is. This kind of error could mean that the maximum frequency
of a design will not be as high as the value reported by the tool. This is
harder to detect without testing the design on a wide variety of FPGAs
(ideally FPGAs that are known to just barely pass timing tests for the
speedgrade under test). Since this is clearly impractical, the only choice
is to trust the values reported by the static timing analysis tool (unless
the values that are reported are very suspicious).

Yet another form of possible problem is when the HDL simulator


does not simulate the hardware correctly. The most likely way to find
such bugs are to observe them during simulation. Another way is to ob-
serve that the FPGA does not behave as the simulation predicts (although
this can also mean that the synthesis tool is doing something wrong).

This situation is even worse for ASIC based design flows as it is not
practical to manufacture a small testdesign just to see if it works. In sum-
mary, we have little choice but to rely on the tools. Yet it is important to
stay on guard and not trust the tools 100%, especially when they report
odd or very odd results.
3.3 Possible Error Sources 23

3.3.2 Guarding Against Bugs in the Designs


While tool bugs are very dangerous they are also quite rare. Another
more common source of bugs is simply the designer himself2 . The tra-
ditional way to guard against this is to write comprehensive testbenches
and test suites. All major designs in this thesis have testbenches that
are fairly comprehensive. Extra care has been taken to verify the most
important details and the details that are thought most likely to contain
bugs. For example, when writing the test suite for the arithmetic unit
described in Section 7.1, care was taken to exercise all valid forwarding
paths. However, only a few different values were tested. That is, not
all possible combinations of input values were tested for addition and
subtraction due to the huge amount of time this would take and the low
likelihood that there would be a bug in the adder itself.
There are no known bugs in the current version of the designs de-
scribed in this thesis, but it is possible that there are unknown bugs.
However, since care has been taken to exercise the most important parts
of the designs thoroughly, it is very likely that the remaining bugs will
be minor issues that will have no or little effect on the conclusions drawn
in this thesis.
Finally, it should also be noted that testbenches were not written for
many of the simple test designs in Chapter 5. It was felt that the correct-
ness of the source code of for example a simple adder could be ensured
merely by inspecting the source code and by looking at the synthesis
report, mapping report, and in some cases the actual logic that was syn-
thesized. However, the more tricky designs described in Chapter 5, such
as the MAC unit, do have testbenches.

3.3.3 A Possible Bias Towards Xilinx FPGAs


Due to the author’s extensive experience with Xilinx FPGAs, much of
this thesis has been written with Xilinx FPGAs in mind. All of the case
2 At this point honesty compels the author to admit that he has been responsible for
more than one bug in his life. . .
24 Methods and Assumptions

studies discussed in this thesis were optimized for Xilinx FPGAs, and
often a particular Xilinx FPGA family as well. Care has been taken to
avoid an unfair bias towards Xilinx in the parts that discuss other FPGA
families but it is nevertheless possible that some bias may still be present
and it is only fair to warn the reader about this.
There is also a clear bias towards SRAM based FPGAs in this text as
FPGA families manufactured using flash and antifuse technologies are
not typically designed for high performance.

3.3.4 Online Errata


As described above, there are many possible error sources. While care
has been taken to minimize these, few works of this magnitude are ever
completely free of minor errors. The reader is encouraged to visit either
the author’s homepage at http://www.da.isy.liu.se/~ehliar/
or the page for the thesis at http://urn.kb.se/resolve?urn=urn:
nbn:se:liu:diva-16732 to see if there are any erratas published.
(The later URL is guaranteed by Linköpings University to be available
for a very long period of time.) Likewise, if the reader encounters some-
thing that seems suspicious in the thesis, the author would very much
like to know about this.

3.4 Method Summary


The method used in this thesis to optimize FPGA designs can be sum-
marized as follows:

• Do not fix the design specification until a prototype has shown


where the performance problems are located and a reasonable plan
on how to deal with the performance problems has been finalized.
To reach the highest performance it may be necessary to implement
a prototype with much of the functionality required of the final sys-
tem before the design specification can be finalized.
3.4 Method Summary 25

• Use synchronous design methods and avoid using techniques such


as latches and clock gating

• Investigate if the synthesis tool have used suboptimal constructs.


If so, rewrite the HDL code to infer or instantiate better logic.

• Investigate if floorplanning can help the performance as well.

• Vary synthesis and backend options to determine which options


lead to the highest performance.

• Manage the timing constraints appropriately for the tool that is


used for place and route (e.g. increase the timing constraints it-
eratively until it is no longer possible to meet timing when using
Xilinx devices)

• Further, the timing constraint settings assume a clock with no jitter


and the worst case parameters for temperature and supply voltage

• Be wary of bugs in both the design and the CAD tools. Always
check that the reported area and performance are reasonable.
26 Methods and Assumptions
Chapter 4
ASIC vs FPGA

There are many similarities when designing a product for use with either
an FPGA or an ASIC. There are also many differences in the capabilities
of an FPGA and an ASIC. This chapter will concentrate on the most im-
portant differences.

4.1 Advantages of an ASIC Based System


The advantages of an ASIC can be divided into four major areas: Unit
cost, performance, power consumption and flexibility.

4.1.1 Unit Cost


One of the biggest advantages which an ASIC based product enjoys over
an FPGA based product is a significantly lower unit cost once a certain
volume has been reached. Unfortunately the volume required to off-
set the high NRE costs of an ASIC is very high which means that many
projects are never a candidate for ASICs. For example, in [8], the authors
show an example where the total design cost for a standard cell based
ASIC is $5.5M whereas the design cost for the FPGA based product is
$165K. It is clear that the volume has to be quite high before an ASIC
can be considered. It should also be noted that this comparison is for a
0.13 µm process. More modern technology nodes have even higher NRE

27
28 ASIC vs FPGA

costs and therefore even higher volumes are necessary before it makes
sense to consider an ASIC.

4.1.2 Higher Performance


Another reason for using an ASIC is the higher performance which can
be gained by using a modern ASIC process. During a comparison of over
20 different designs it was found that an ASIC design was on average 3.2
times faster than an FPGA manufactured on the same technology node
[9]. This is slightly misleading though, as FPGAs are often manufactured
using the latest technologies whereas an ASIC could be manufactured
using an older technology for cost reasons. In this case the performance
gap will be lower.

4.1.3 Power Consumption


An ASIC based system usually has significantly lower power consump-
tion than a comparable FPGA based system. While some FPGAs are
specifically targeting low power users, such as the new iCE65 from Sili-
conBlue, most FPGAs are not targeted specifically at low power users.
The main reason for this is of course the reconfigurability of the FPGA.
There is a lot of logic in an FPGA which is used only for configura-
tion. While the dynamic power consumption of the reconfiguration logic
is practically 0, all of the configuration logic contributes to the leakage
power.
Another reason why ASICs are better from a power consumption per-
spective is that it is easier to implement power reduction techniques like
clock gating and power gating.
While it is possible to perform clock gating in an FPGAs, it is sel-
domly used in practice. One reason is that FPGAs have a limited number
of signals optimized for clock distribution. While flip-flops in an FPGA
can also be fed from a local connection, this will complicate static timing
analysis. FPGA vendors strongly recommend users to avoid clock gating
and to use the clock enable signal of the flip-flops instead.
4.1 Advantages of an ASIC Based System 29

While clock gating is possible but hard to do in an FPGA today, selec-


tive power gating is not possible in modern FPGAs. However, in Actel’s
Igloo [10] FPGA it is possible to freeze the entire FPGA by using a special
Flash*Freeze pin. While Actel do not say exactly how this is implemented,
it is reasonable to assume that some sort of power gating is involved.
Spartan 3A FPGAs has a similar mode activated by a suspend pin which
allows the device to retain its state while in a low power mode.
True selective power gating has also been investigated in a modified
Spartan 3 architecture [11], but the authors state that there is not enough
commercial value in such features yet due to the performance and area
penalty of the power gating features.

4.1.4 Flexibility

The final main reason for using an ASIC instead of an FPGA is the flex-
ibility you gain with an ASIC. An ASIC allows the designer to imple-
ment many circuits which are either impossible or impractical to create
in the programmable logic of an FPGA. This includes for example A/D
converters, D/A converters, high density SRAM and DRAM memories,
non volatile memories, PLLs, multipliers, serializers/deserializers, and
a wide variety of sensors.
Many FPGAs do contain some specialized blocks, but these blocks are
selected to be quite general so that they are usable in a wide variety of
contexts. This also means that the blocks are far from optimal for many
users. In contrast, an ASIC designer can use a block which has been
configured with optimal parameters for the application the designer is
envisioning. This allows an ASIC designer to both save area and increase
the performance.
The ultimate in flexibility is the ability of an ASIC designer to design
either part of the circuit or the entire circuit using full custom methods.
This allows the designer to create specialized blocks which have no par-
allel in FPGAs. For example, if a designer wanted to create an image
processor with integrated image sensor, this would not be possible to do
30 ASIC vs FPGA

with the FPGAs currently available.


Full custom techniques are also able to reduce the power and area or
increase the performance. For a more thorough discussion about this, see
for example [12].

4.2 Advantages of an FPGA Based System


While there are many advantages to an ASIC, there are also many advan-
tages to be had when using an FPGA.

4.2.1 Rapid Prototyping


As there is no manufacturing turn around time for an FPGA based sys-
tem, a design can quickly be tested and evaluated even though parts of
the design are not yet completed. In contrast, most companies would
not be able to afford to manufacture a partially functioning ASIC just for
testing purposes. This means that developers can start developing the
firmware on a partially working prototype when using FPGAs instead
of using a much slower simulation model.
A hybrid approach is to use an FPGA for prototyping and an ASIC for
production. This is a good and easy solution in some cases, but in other
cases it can be tricky. If the system will interface to external interfaces
which has to run at high speed, the FPGA might not be able to run at
this speed which means that some compromises have to be made. For
example, while prototyping an ASIC with a PCI interface, the PCI bus
might have to be underclocked as described in [13].

4.2.2 Setup Costs


The setup cost for using a low end FPGA is practically zero. Major FPGA
vendors have a low end version of their design tool available for free
download. The full version of the vendor tools are also available for
a relatively low fee. It is also possible to buy a low end version of an
HDL simulator from the FPGA vendors cheaply. There are also a large
4.2 Advantages of an FPGA Based System 31

number of low cost prototype board available for various FPGAs. All of
this means that anyone, even hobbyists, can start using an FPGA without
having to buy any expensive tools. This is certainly not true for ASICs as
the tool cost alone can be prohibitive in many cases.
The other reason for the low setup cost is that the use of an FPGA
means that the mask costs associated with an ASIC are avoided which
can be a significant saving for a modern technology.

4.2.3 Configurability

There are two main reasons why the configurability of an FPGA is impor-
tant. The cost reason has already been briefly mentioned in Section 4.2.2.
The other reason is that it is possible to deploy bug-fixes and/or up-
grades to customers if a reconfigurable FPGA is used.
If a one time programmable FPGA, such as a member of Actel’s anti-
fuse based FPGA family, is used, this is of course not possible. It will
still be possible to change the configuration of newly produced prod-
ucts without incurring the large NRE cost associated with an ASIC mask
change though.
Another interesting possibility is the ability to reconfigure only parts
of an FPGA while the FPGA is still running, so called partial dynamic
reconfiguration. This capability is present in the Xilinx Virtex series from
Virtex-II and up. This could for example mean that a video decoding
application could have a wide variety of optimized decoding modules
stored in flash memory. As soon as the user wants to play a specific video
stream, a decoding module optimized for that particular video format is
loaded into the FPGA.
The advantage is of course that a smaller FPGA could be used. The
disadvantage is that the tool support for dynamic reconfiguration is lim-
ited at the moment. Simulating and verifying such a design is also con-
siderably more difficult. Although a large number of research publica-
tions have studied partial reconfiguration it is seldomly used in real ap-
plications yet.
32 ASIC vs FPGA

Finally, it should also be mentioned that by handling the configura-


tion of an FPGA yourself you don’t need to hand over your design files
to an outside party. If an ASIC would be used instead of an FPGA, you
will have to trust that the foundry will employ strict security measures
to keep your design secret. This is not a big problem in most cases, but
it could potentially be troublesome when very sensitive information is
contained in the design files such as cryptographic keys.

4.3 Other Solutions


There are a few other solutions available which are worth mentioning
even though they are mostly outside the scope of this thesis. If the price
of a large FPGA is a concern, Xilinx has a product called Easypath [14].
The idea behind this product is that a certain design will only utilize a
small amount of the available routing resources in an FPGA. If an un-
used part of the routing is not working correctly, this doesn’t matter to
that particular design. In fact, only the routing which is actually used has
to be tested which will reduce the testing cost significantly. Xilinx guar-
antees that the FPGAs they sell under this program will work correctly
for up to two customer specific bitstreams. The advantage is of course
that the customer will get a substantial discount for a product that works
identically to a fully tested FPGA. The disadvantage is that the configura-
bility of the FPGA can no longer be fully used. While Xilinx guarantees
that all LUTs have been tested, which allows some bugs to be fixed, it
is no longer possible to reconfigure the FPGA with an arbitrary design
with any guarantees of success.
Altera’s alternative to Easypath is called Hardcopy [15]. This is a
structured ASIC based product with a similar architecture to Altera’s
FPGA families. The advantage is that a Hardcopy based design will be
faster than an FPGA based design while the unit cost will also be lower.
Another advantage is that the footprint of the HardCopy device is the
same as a regular FPGA. It is therefore easy to migrate a PCB from a
Stratix FPGA to a Hardcopy device. The breakeven point where it makes
4.4 ASIC and FPGA Tool Flow 33

sense to use a Hardcopy device varies from design to design, but it is


probably somewhere between 1k units for a large design and 20k units
for a small design [16]. The major disadvantage is of course the NRE cost.
Since Altera is using a structured ASIC approach, custom masks have to
be created for some metal layers. This also means that bugs which could
have been fixable with a LUT change in a Xilinx Easypath device will not
be fixable here.
Another interesting solution is eASIC’s Nextreme product family [17].
This is a cross between a traditional gate array and an SRAM based
FPGA. SRAM based lookup-tables are used for the logic while a cus-
tom created via layer is used to program the routing. This means the die
area of a Nextreme solution when compared to a similar FPGA solution
will be smaller, which leads to a reduction in cost. The major advantage
of Nextreme when compared to a traditional gate array is that only one
via layer has to be customized. In theory there is no NRE cost for low
volume production since this via customization can be done by using
eBeam technology. For high volume production it is more cost effective
to create a custom mask for the via layer though. Another hybrid be-
tween an SRAM based FPGA and structured ASICs is Lattice’s MACO
blocks [18]. This is basically an FPGA which is simply combined with a
structured ASIC. The FPGA works just as a regular SRAM based FPGA
and the MACO blocks works just as a structured ASIC. Unfortunately, it
seems that the significance of gate array will be reduced in the future as
FPGAs take over their role [19].

4.4 ASIC and FPGA Tool Flow


A comparison of the design flow for an ASIC and an FPGA (exemplified
using Xilinx tools and terminology) can be shown in Table 4.1. (The ASIC
design flow has been adapted from [20]). There are some steps that are
enclosed in parentheses on the FPGA side. These can be done but are not
required. Simulating the post synthesis netlist or post place and route
netlist could be done when it is suspected that a tool is present in the
34 ASIC vs FPGA

synthesis tool or the backend tools. Using a physical synthesis can be


done if the tool supports it.
As can be seen, even if all optional parts of the FPGA flow is used,
the ASIC flow is considerably more complicated than the FPGA flow
and requires a considerable level of expertise fully utilize it.
4.4 ASIC and FPGA Tool Flow 35

ASIC flow Xilinx Design flow


Specification Specification
HDL Code HDL Code
Behavioral Simulation Behavioral Simulation
Preliminary RTL Floorplan
Synthesis Synthesis
Static Timing Analysis
Floorplanning (Floorplanning)
Physical Synthesis (Physical Synthesis)
Scan Chain Ordering and Routing
Verification
Static Timing Analysis
Gate level full timing simulation (Post synthesis simulation)
Functional / Formal Verification
Clock Tree synthesis
Routing - Optimization Place and Route
Parasitic Extraction
Final Verification
Static Timing Analysis Static Timing Analysis
Gate Level Full Timing Simulation (Post place and route simulation)
Functional / Formal Verification
Generate Configuration Bitstream
Design Rule Checks Design Rule Checks
Layout Versus Schematic Checks
Tapeout Configure FPGA with bitstream

Table 4.1: Comparison of the ASIC and FPGA design flow (not including
power optimizations)
36 ASIC vs FPGA
Chapter 5
FPGA Optimizations and
ASICs

In this chapter the performance of logic implemented in an FPGA and an


ASIC will be compared. At first, this looks like a relatively easy task:

1. Select a design to test

2. Synthesize the design for an FPGA

3. Synthesize the design for an ASIC

4. Compare the performance of these designs

In practice, this is a decidedly non-trivial problem if a completely fair


comparison of the capabilities of FPGAs and ASICs is desired. The best
way to do this would be to compare an FPGA where the contents of all
look-up tables are optimal and the routing is optimal to an ASIC design
where the placement, sizing, and routing are all optimal. This is unfor-
tunately an optimization problem of extreme complexity, even for small
designs. For everything but the smallest toy design it is unsolvable.
A more realistic comparison would be to use powerful methods such
as full custom ASIC design that can produce very efficient designs even
though an optimal solution is not guaranteed. At the same time, the
same design should be implemented on an FPGA where every LUT has

37
38 FPGA Optimizations and ASICs

been manually optimized and extensive floorplanning has been used to


optimize timing. While this would certainly be an interesting research
project, it would also require a large amount of time, making it impracti-
cal to do for anything but the smallest designs.

The approach used in this thesis is a more practical approach that in-
tends to highlight a scenario where FPGAs are used for relatively low
volume production and where the design is later on ported to an ASIC
for high volume production, primarily for cost reduction. Although a
lower total cost could be achieved by immediately designing for an ASIC,
this is a risky move for several reasons. For example, if the market for a
product is uncertain, it can be a good idea to avoid the high NRE costs of
an ASIC since it is not certain that these costs can be recouped. Even if
there actually is a huge market for a product it can still be a good idea to
use an FPGA in the beginning due to its short time to market.

What this part of the thesis intends to highlight is the impact of var-
ious FPGA optimized constructs when the design is ported to an ASIC.
The intention is that the reader should know what the impact is on an
ASIC port when using different kinds of FPGA optimizations.

Finally, it is important to point out that the intention of this chapter is


not to recommend a certain FPGA family or vendor. This is why relative
performance numbers are used instead of absolute numbers. This chap-
ter is rather intended to show the strengths and weaknesses of modern
FPGAs when compared to ASICs. To a lesser degree it will also show
that different FPGA families have different strengths which may be of
interest when trying to optimize a system for a specific FPGA.

Important design hints are marked like this. All guide-


ASIC
Porting lines are also present in Appendix A
Hint
5.1 Related Work 39

5.1 Related Work


There is surprisingly little information available on converting designs
optimized for FPGAs to ASICs. A brief introduction is given in for ex-
ample [21], but few, if any, decent in-depth guides are publicly available.
There is much more material available on how to port an ASIC design
to an FPGA however [22] [23] [24]. These resources may still be be of
interest when creating an FPGA design that will later be ported to an
ASIC though. Finally, there are also a number of guides available that
discusses how to port a design to a Structured ASIC [25] [26] [27].
One noteworthy publication that discusses the performance differ-
ence of ASICs and FPGAs is [9]. In this publication the area, power, and
performance of over 20 designs are compared. However, the designs in
this study were not optimized for a specific FPGA or ASIC process [28].

5.2 ASIC Port Method


The method investigated in this thesis for the ASIC ports is based on
porting FPGA designs directly to an ASIC with a minimum of effort. To
support RTL code with FPGA primitives instantiated, a small compat-
ibility library has been written. This library has synthesizable models
of FPGA primitives such as LUTs, flip-flops and even a limited version
of the DSP48 block. This means that it is easy to retarget even an ex-
tremely FPGA optimized design to an ASIC. The primary advantage of
this kind of porting method is that the time spent in verification should
be minimized due to the minimal amount of logic changes necessary.
(Assuming of course that the verification for the original FPGA product
was thorough.)
Unless otherwise noted in the text, no floorplanning was done and
the designs were synthesized without any hierarchy. This kind of low
effort porting is compatible with the scenario outlined above, where the
primary reason for an ASIC port is cost reduction instead of performance.
The toolchain used for the ASIC ports in this thesis is based on Syn-
40 FPGA Optimizations and ASICs

opsys Design Compiler (Version A-2007.12-SP5-1) and Cadence SOC En-


counter (Version 05.20-s230_1). All performance numbers quoted in this
thesis are based on the reports from the static timing analysis. Unless
otherwise noted, the area for power rings and I/O pads is not included
in any figures.
It could also be noted that this compatibility method could be used
to port the synthesized FPGA netlist as well. This would leave stronger
guarantees that the ASIC will have the same functionality as the FPGA
(for example, if don’t cares have been used incorrectly in the source code
the behavior of the design could differ depending on the optimizations
performed by the synthesis tool). Unfortunately this is probably not pos-
sible to do without violating the end user license agreement of the FPGA
design tools so this will not be investigated further in this thesis.

5.3 Finding Fmax for ASIC Designs


As has already been discussed in Section 3.2, there are many things that
impact the maximum performance of an FPGA design. Most of these are
also valid for ASIC designs. One of the largest differences when trying to
find the maximum frequency of an FPGA design is that there are many
ways to improve the performance of an ASIC design by trading area for
frequency. The synthesis tool will basically use a slow but small unit
when the frequency requirements are low and a large but fast unit when
a higher maximum frequency is requested.
For most designs such large variations are not common as there are
typically large parts of a design that do not require any extensive opti-
mizations to meet timing. Only the critical paths have to be optimized
by using expensive structures like logic duplication, advanced adder
schemes, etc. As an experiment, the OpenRisc processor with 8 KiB data
cache and 8 KiB instruction cache was synthesized to a 130nm process.
When optimized for area, the processor had an area of 2.16 mm2 and a
maximum frequency of 36 MHz. When optimized for speed, the area
was 2.20 mm2 and the maximum frequency 178 MHz. In this case the
5.4 Relative Cost Metrics 41

choice of speed or area did not make a large difference in the area but a
huge difference in the speed.
While it is possible to trade area for frequency in an FPGA as well,
it is seldom possible to do so for primitive constructs like adders (under
the assumption that reasonable sized adders are used).

5.4 Relative Cost Metrics


In this chapter, the relative cost of various design elements in terms of
area and frequency are mentioned. The intention is that this chapter will
show whether a certain construct is a good idea to use in a certain FPGA.
Another important factor that will be discussed in this chapter is if it is a
good idea to use a certain architecture in an ASIC.
The relative area and performance for an 32-bit adder was set to 1 and
all other costs were derived from this. The ASIC area cost was measured
by looking at the size of the block when synthesized, placed, and routed
(excluding the power ring ). No I/O pads were added to the ASIC de-
signs. Figure 5.1 shows an example of what a report in this thesis can
look like. Since the area and Fmax of a 32-bit adder is used as a reference
almost all values are 1 in this table. The exception is an ASIC optimized
for area where the values are relative to an ASIC optimized for speed.
In Figure 5.1, this means that when optimized for speed in an ASIC, the
adder is roughly 9 times faster than the same adder optimized for area.
Determining the area of a circuit in an ASIC is relatively straightfor-
ward. Doing the same for an FPGA design is more difficult as the area
usage of a certain FPGA design is a multidimensional value. That is,
there is no obviously correct way to map the number of used LUTs, flip-
flops, DSP blocks, memory blocks, IO pads, and other FPGA primitives
to one single area number. One solution is to measure the silicon area
of the various components in the FPGA and calculate the active silicon
area for a certain design. This is basically the path that is used in [9].
This metric is certainly interesting from an academic point of view, es-
pecially when discussing FPGA architectures and how to make FPGAs
42 FPGA Optimizations and ASICs

Device Relative Relative


performance area cost
Spartan 3A 1 1
Virtex 4 1 1
Virtex 5 1 1
Cyclone III 1 1
Stratix III 1 1
ASIC (Speed) 1 1
ASIC (Area) 0.11 0.21

The performance of this circuit is used as a reference for all other performance comparisons in this chapter!

Figure 5.1: 32-bit adder

more silicon efficient.

Knowing the exact silicon area of a LUT or a flip-flop is not going to


help a VLSI designer however. Therefore another metric will be used in
this thesis. If we assume that the designer considers all components in
the FPGA to be of equal monetary value it could be said that the area
cost of for example all memory blocks and all slices have an equal area
cost. (E.g. if there are 10000 slices and 20 memories in an FPGA the
designer will value 1 memory and 500 slices equally. The total area cost
of a design with for example 520 slices and 5 memories would in this
example be 520 + 5 · 500 = 3020 pseudo slices.)

While it is unlikely that a designer will consider all components to be


exactly equal in value, it is also likely that the designer will use an FPGA
where the ratio of memory blocks or DSP blocks to slices is appropriate
to his needs. However, the author acknowledges that this metric may
be controversial to some readers and has therefore marked all area cost
figures that includes converted memory or DSP blocks with a ∗ in the
tables in this chapter.
5.5 Adders 43

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.97 1
Virtex 4 0.98 1
Virtex 5 0.9 1
Cyclone III 0.82 2
Stratix III 0.81 1
ASIC (Speed) 0.89 1.9
ASIC (Area) 0.13 0.25

Figure 5.2: 32-bit adder/subtracter

5.5 Adders

An adder is a very commonly used component in many designs. It is also


one of the components for which FPGAs have been optimized. The carry
chains in FPGAs is usually one of the fastest parts of the FPGA if not the
fastest. However, in an ASIC, the area cost of an adder can vary wildly
depending on the timing constraints as can be seen in Figure 5.1. This
was even more pronounced when creating an adder/subtracter as seen
in Figure 5.2. Although the performance is still high, the area of the speed
optimized circuit is almost twice as large as a plain adder. (Although
experiments indicate that the area will quickly decrease if the absolutely
highest performance is required).
Another area that is interesting to investigate is the case where sev-
eral adders are used after each other. Figure 5.3 and Figure 5.4 show the
relative performance of a 32-bit adder with 3 and 4 operands. It should
be noted that the area for the Xilinx families does not quite grow linearly
due to some minor optimizations by the synthesis tool. Another interest-
ing fact is that the relative area for a 3 input adder in a Stratix III is the
same as the area for a 2 input adder. This is due to the slice architecture
of the Stratix III which has a separate adder instead of implementing the
adder partly in a LUT (there is also a separate carry chain for the LUT
44 FPGA Optimizations and ASICs

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.86 1.9
Virtex 4 0.83 1.9
Virtex 5 0.82 1.9
Cyclone III 0.77 2
Stratix III 0.89 1
ASIC (Speed) 0.74 1.3
ASIC (Area) 0.12 0.4

Figure 5.3: 32-bit 3 operand adder

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.62 2.9
Virtex 4 0.58 2.9
Virtex 5 0.58 2.9
Cyclone III 0.77 3
Stratix III 0.82 2
ASIC (Speed) 0.69 1.7
ASIC (Area) 0.1 0.55

Figure 5.4: 32-bit 4 operand adder

that allows two adders to be implemented using the same amount of


slices as only one adder). This architecture is also the explanation why
the performance of the Stratix III based adders don’t drop quite as much
as for the other architectures.

Adders with more than two inputs are typically more area
ASIC
Porting efficient in ASICs than in FPGAs.
Hint
5.6 Multiplexers 45

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 2.3† 1
Virtex 4 2† 1
Virtex 5 1.7† 1
Cyclone III 2.2† 1
Stratix III 2.1† 1
ASIC (Speed) 1.8 0.12
ASIC (Area) 1.8 0.12
† Exceeds Fmax for clock net.

Figure 5.5: 32-bit 2-to-1 mux

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 1.5 8
Virtex 4 1.2 8
Virtex 5 0.92 5
Cyclone III 1.3 10
Stratix III 1.7† 5
ASIC (Speed) 0.9 0.57
ASIC (Area) 0.31 0.48
† Exceeds Fmax for clock net.

Figure 5.6: 32-bit 16-to-1 mux

5.6 Multiplexers

A very important part of many datapaths is the multiplexer (mux). An


FPGA based solely on 4 input LUTs can implement a 2n -to-1 mux using
2n − 1 LUTs. However, modern FPGAs usually have some sort of hard-
wired muxes in the slices and CLBs to optimize the implementation of
muxes. In this case it is possible to implement a 2n -to-1 mux using only
2n−1 LUTs. In the Virtex-4 architecture, muxes of sizes up to 32-to-1 can
46 FPGA Optimizations and ASICs

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.85 9.4
Virtex 4 0.83 9.5
Virtex 5 0.78 9
Cyclone III 1.1 9.4
Stratix III 1.1 7.3
ASIC (Speed) 0.85 1.5
ASIC (Area) 0.25 1.2

Figure 5.7: 32-bit bus with 8 master ports and 8 slave ports

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.77 74
Virtex 4 0.7 75
Virtex 5 0.6 59
Cyclone III 0.75 80
Stratix III 0.73 61
ASIC (Speed) 0.74 5.5
ASIC (Area) 0.24 3.4

Figure 5.8: 32-bit crossbar with 8 master ports and 8 slave ports (Imple-
mented with muxes)

be optimized in this way.


Figure 5.5 - 5.6 show the relative performance and area of a 2-to-1
and 16-to-1 32-bit multiplexers. Note that the relative area of the ASIC
multiplexers when compared to the FPGA multiplexers is very low.
It is clear that muxes are not likely to be a big problem when porting
a design to an ASIC. However, it may be possible to improve the design
of the ASIC port by inserting extra muxes in the design. A cheap way to
do this could be to replace a mux based bus with a full crossbar. If the
crossbar is using the same bus protocol as the bus, relatively little would
5.7 Datapath Structures with Adders and Multiplexers 47

have to be redesigned and reverified. This is shown in Figure 5.7 and


Figure 5.8 which illustrates a bus and crossbar with 8 master ports and
8 slave ports. There are other ways that muxes could improve the per-
formance of a design as well, but they are likely to be more complex to
implement and verify (e.g. adding more result forwarding to a proces-
sor).
While multiplexers are very costly in an FPGA, they are
quite cheap in an ASIC. Optimizing the mux structure
ASIC
Porting in the FPGA based design will have little impact on an
Hint
ASIC port.
When porting an FPGA optimized design to an ASIC it
may be possible to increase the performance of the ASIC
ASIC
Porting by adding muxes to strategic locations such as for exam-
Hint
ple by replacing a bus with a crossbar.

5.7 Datapath Structures with Adders and Mul-


tiplexers
Just investigating standalone components is not very interesting. What
is interesting is to look at a datapath that contains muxes in combination
with other elements. The tradeoffs when creating a design with many
muxes for an FPGA is not the same as when creating a design for an
ASIC. In [29], Paul Metzgen describes the relative tradeoffs off different
FPGA building blocks and comes to the conclusion that “The Key to Op-
timizing Designs for an FPGA . . . is to Optimize the Multiplexers”.
A simple example of how muxes can be optimized is shown in Fig-
ure 5.9. In this figure, a 2-to-1 mux is used for one of the operands to
a 32 bit adder. A naive area cost estimate for this construct in an FPGA
would take the cost of a 32 bit 2-to-1 mux and add it to the cost of a 32
bit adder. However, this doesn’t take into account that the LUTs are not
fully utilized in a 32-bit adder in all contemporary Xilinx devices. This
means that it is possible to merge the mux into the LUTs used for the
adder yielding the same LUT cost as for a regular 32-bit adder. The per-
48 FPGA Optimizations and ASICs

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.99 1
Virtex 4 0.97 1
Virtex 5 0.91 1
Cyclone III 0.81 2
Stratix III 0.75 1
ASIC (Speed) 0.69 1.1
ASIC (Area) 0.14 0.25

Figure 5.9: 32-bit adder with 2-to-1 mux on one operand

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.85 2
Virtex 4 0.83 2
Virtex 5 0.8 2
Cyclone III 0.79 3
Stratix III 0.79 2
ASIC (Speed) 0.67 1.2
ASIC (Area) 0.14 0.3

Figure 5.10: 32-bit adder with 2-to-1 mux on both operands

formance of this kind of solution will also be very good, almost the same
as for just the 32-bit adder.
However, if two 2-to-1 muxes should be used, one for each operand,
the area cost will double since it is not possible to put that mux into
the same LUT as seen in Figure 5.10. Finally, if large muxes like 4-to-
1 muxes are used for both operands of the adder, the area cost of the
FPGA designs will go up significantly and the performance will drop
considerably as can be seen in Figure 5.11
There are two final examples that are interesting to mention. The first
is a rather special case which may be good to know about; the case where
5.7 Datapath Structures with Adders and Multiplexers 49

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.77 5
Virtex 4 0.69 5
Virtex 5 0.71 3
Cyclone III 0.77 5
Stratix III 0.88 3
ASIC (Speed) 0.63 1.8
ASIC (Area) 0.12 0.4

Figure 5.11: 32-bit adder with 4-to-1 mux on both operands

a two input and gate is used as the input for both adder operands as can
be seen in Figure 5.12. Note that this has the same area cost in Xilinx
FPGAs as a plain adder. This is because the and MULT_AND primitive
can be used for one of the and gates. A similar structure was also used
in the ALU in Microblaze where the result can be either OpB + OpA, OpB
- OpA, OpB, or OpA [30]. Without using the MULT_AND primitive it
would not be possible to get the last operation here (OpA)). Stratix III
also allows this structure to be implemented using the same amount of
resources as a plain adder because it has a dedicated adder inside each
slice and doesn’t need to use LUTs to create the adder itself. The LUTs
in the slice can therefore be used separately to create these (and other)
functions.
The other case is when an adder/subtracter is used instead of just an
adder. Figure 5.13 and Figure 5.14 shows the properties of an adder/-
subtracter without and with a 2-to-1 input mux on one of the inputs. The
plain adder/subtracter can be implemented with the same area as an
adder in all FPGAs. However, it is not possible to combine an adder/-
subtracter with a mux in most FPGAs.
Overall, it is hard to say anything definitive about the cost of these
components in an ASIC since there is a large gap between the area and
frequency of the slowest and fastest variation. What can be seen in these
50 FPGA Optimizations and ASICs

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.99 1
Virtex 4 0.98 1
Virtex 5 0.78 1
Cyclone III 0.84 3
Stratix III 0.85 1
ASIC (Speed) 0.89 0.89
ASIC (Area) 0.11 0.27

Figure 5.12: 32-bit adder with 2-input bitwise and on both operands

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 1 1
Virtex 4 0.98 1
Virtex 5 0.89 1
Cyclone III 0.83 2
Stratix III 0.82 1
ASIC (Speed) 0.86 1.7
ASIC (Area) 0.14 0.25

Figure 5.13: 32-bit add/sub

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.83 2
Virtex 4 0.84 2
Virtex 5 0.83 1
Cyclone III 0.82 2
Stratix III 0.89 2
ASIC (Speed) 0.62 2.4
ASIC (Area) 0.14 0.31
Note that it is important that this is implemented as a +/- b, where b is the output of the mux.

Figure 5.14: 32-bit add/sub with 2-to-1 mux for one operand
5.8 Multipliers 51

figures, especially the area optimized figures is that this particular kind
of FPGA optimization is not very helpful for the ASIC performance or
ASIC area. The area when optimized for area is (not surprisingly) about
the same as when combining the area of the individual components.
While a lot of performance and area can be gained in an
FPGA by merging as much functionality into one LUT
ASIC
Porting as possible, this will typically not decrease the area cost
Hint
or increase the performance of an ASIC port.

5.8 Multipliers
The most common method for multiplication in FPGAs today is to use
one of the built in multiplier blocks that are present in almost all modern
FPGAs. While some FPGAs such as the Virtex-II and Spartan-3 have
a standalone multiplier as a separate block, other FPGAs also integrate
accumulators into the same block. The later are commonly called DSP
blocks.
In many cases there are optional pipeline stages built into these multi-
pliers and the maximum performance can only be reached if these pipeline
stages are utilized. In the Spartan 3A for example, there are optional
pipeline stages before and after the combinational logic of the multiplier
whereas a DSP48 block in a Virtex-4 can contain up to 4 pipeline reg-
isters. (Although the fourth register will not increase the performance
as it is only present to easily allow for construction of large multipliers
without having to store intermediate results in flip-flops in the FPGA
fabric [31]).
As an example of how to optimize a circuit for the DSP blocks we will
study a typical MAC unit in a simple DSP processor when implemented
on a Virtex-4 or 5. The MAC unit contains a 16× 16 multiplier and four
48-bit accumulator registers. This could be implemented as seen in Fig-
ure 5.15. Unfortunately, the FPGA performance is not very good in this
case since it is not possible to use the built-in accumulation register of the
DSP48 block.
52 FPGA Optimizations and ASICs

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Virtex 4 0.45 30∗
Virtex 5 0.41 35∗
ASIC (Speed) 0.42 5.3
ASIC (Area) 0.084 2.4
∗ Area cost includes DSP blocks and/or memory blocks as described in Section 5.4. Parts in grey are

mapped to the DSP48 block.

Figure 5.15: MAC unit with 4 accumulator registers mapped to DSP48


block
Both Fmax and area cost values are relative to
the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Virtex 4 1.2 31∗
Virtex 5 0.82 35∗
ASIC (Speed) 0.49 5.4
ASIC (Area) 0.11 2.7
∗ Area cost includes DSP blocks and/or memory blocks as described in Section 5.4. Parts in grey are

mapped to the DSP48 block.

Figure 5.16: MAC unit with 4 accumulator registers mapped to DSP48


block with pipelining

The adder could also be pipelined, as shown in Figure 5.16. This al-
lows a high performance to be reached, but the circuit is no longer iden-
tical to the circuit in Figure 5.15. In this case it is no longer possible to
perform continuous accumulation to the same accumulation register due
to the data dependency problems introduced by the pipelined adder.
Finally, a circuit that still allows for high speed operation while hav-
ing the same capabilities as the circuit in Figure 5.15 is shown in Fig-
ure 5.17. In this circuit result forwarding is used to bypass the register
file. The performance of this circuit is much higher than the naive im-
plementation and clearly demonstrates how important it is to make sure
5.8 Multipliers 53

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Virtex 4 0.92 33∗
Virtex 5 0.72 36∗
ASIC (Speed) 0.49 4.9
ASIC (Area) 0.093 2.7
∗ Area cost includes DSP blocks and/or memory blocks as described in Section 5.4. Parts in grey are

mapped to the DSP48 block.

Figure 5.17: MAC unit with 4 accumulator registers mapped to DSP48


block with pipelining and forwarding

that the DSP blocks are utilized efficiently in an FPGA.


However, note that the performance and area cost in the ASIC doesn’t
differ very much from case to case. Since the multiplier structures are
not fixed it is simply not necessary to adhere to a certain coding style.
An unoptimized multiplier structure will therefore port well to an ASIC.
However, if the multipliers have been optimized for the FPGA, an ASIC
port will probably have performance problems due to the large relative
performance difference between adders and multipliers in ASICs. This
could mean that the datapaths that contain multipliers may have to be
redesigned when ported to an ASIC.
It should also be noted that in all of these cases the RTL code was
first written using generic addition/multiplication. The synthesis tool
did not infer the desired logic however, so it was necessary to rewrite
the RTL code to instantiate a DSP48 in the desired configuration. The
Virtex-4 and Virtex-5 values are based on this rewritten design whereas
the ASIC values are based on the initial generic design. The performance
of the final two designs in the FPGAs were significantly improved by this
rewrite.
If a design is specifically optimized for the DSP blocks
in an FPGA the ASIC port is likely to have performance
ASIC
Porting problems. The datapath with multipliers may have to be
Hint
completely rewritten to correct this.
54 FPGA Optimizations and ASICs

There are a few alternative ways to design multipliers inside FPGAs.


One way is simply to create a multiplier using the FPGA fabric itself. This
approach is resource intensive and requires a decent amount of pipelin-
ing if high throughput is required. This is probably not a good idea un-
less targeting devices without any builtin multipliers. (While most mod-
ern FPGAs does have dedicated multipliers there are a few exceptions,
like the LatticeSC.)
Another method that can be used when high performance is required
but a hardware multiplier is not available is to use a look-up table based
approach. While a direct look-up table for the function f (a, b) = a · b
is only useful for very small bit-widths, a hybrid approach can be used
where the look-up table implements the function g(x) = x2 /4. In this
case a · b is calculated as g(a + b) − g(a − b). This allows multiplica-
tion of modest bit-widths such as 8 bits to be efficiently calculated even
in FPGAs without any support for hardware multiplication assuming
enough memory resources can be dedicated to the look-up table.
A third method that can be used is a bit-serial approach. While com-
monly used in designs for old FPGAs it is still a valid design method
for situations where low resources usage is more important than high
throughput and low latency.
This is by no means an exhaustive survey of multiplication algorithms,
there is a wide variety of alternative methods such as for example meth-
ods based on logarithmic number systems. There are also a wide variety
of ways to optimize a multiplier if one of the operands is constant (see
for example [32] or [33]).

5.9 Memories
When synthesizing a design for ASIC that contains memories it is very
important that optimized memory blocks are used for large memories.
As an experiment, an 8 KiB memory block was synthesized with stan-
dard cells for an ASIC process (and optimized for area). The area was
5.9 Memories 55

almost 10 times as large as a custom made memory block of the same


size.
There are a few disadvantages to using specialized memories as well.
The design cannot be ported to a new process by merely resynthesizing
the HDL code. Simulation is also more difficult because special simu-
lation models have to be used for the memories. These disadvantages
mean that it might not make sense to use memory compilers for small
memories as the increase in design and verification time will outweigh
the area/speed advantage of these memories.
A common way to handle customized memory blocks is to write
wrappers around the memories so that it is possible to port the design to
a new technology by changing the wrappers instead of having to rewrite
the HDL code of the design itself.
Create wrapper modules for memory modules so that only
ASIC the wrappers have to be changed when porting the design
Porting
Hint to a new technology.

5.9.1 Dual Port Memories


Another very important factor to take into account when designing for
both FPGAs and ASICs is the number of ports on the memory. The RAM
blocks in an FPGA usually has two independent read/write ports, which
means that many FPGA designs use more than one memory port even
though similar functionality could be achieved using only one memory
port.
It is hard to find publicly available information about the performance
and size of custom memory blocks. One datasheet which is available
without any restrictions shows that a 1 KiB dual port memory with 8-
bit width is around 63% larger than a single port memory of the same
size [34]. This datasheet is for a rather old process though (0.35 µm)
and the author has seen dual port memories for newer processes that are
more than twice as large as a single port memory.
Regardless of the exact proportions, it is certainly more expensive to
56 FPGA Optimizations and ASICs

use a dual port memory than a single port memory. There are many cases
where a dual port memory is natural to use but not strictly necessary. A
good example of this is a synchronous FIFO. While it is convenient with
a dual port memory here, it is not strictly necessary as it is possible to
use for example two single port memories and design the FIFO so that
one memory is read while the other is written and vice versa. This is
described in for example [35].
Finally, there are examples where it is not easily possible to avoid
the use of dual port memories. In such cases the cost of redesigning the
system to use single port memories has to be weighed against the area
savings such a redesign will produce.

Avoid large dual port memories if it is possible to do so


ASIC
Porting without expensive redesigns.
Hint

5.9.2 Multiport Memories


There are some cases where more than two ports are required on a mem-
ory. An example of this would be a register file in a typical RISC proces-
sor. Such a register file usually has two read ports and one write port. In
for example a Virtex4 this is implemented as shown in Figure 5.18. This
memory is implemented by using two dual port distributed RAM mem-
ories where each memory has one read port and one write port. When
writing to this memory block, the same value is written into both dis-
tributed RAMs. This gives the illusion of having a memory with two
read ports and one write port even though in reality there are actually
two separate memories.
The area cost of a few different register file memories with different
number of ports are shown in Table 5.1. While the area grows relatively
slowly in the ASIC, the area for the FPGA based register grows extremely
fast when going from one to two write ports. The explanation for this is
that the synthesis tool is no longer able to use the distributed memo-
ries. The relatively high area costs for the Cyclone III based architectures
5.9 Memories 57

Figure 5.18: Small register file memory with multiple read ports imple-
mented using duplication

Area cost values are relative to the values for the 32-bit adder in Figure 5.1.
Ports Spartan 3A Virtex 4 Virtex 5 Cyclone III Stratix III ASIC optimized for
Read Write Area Speed
1 1 0.50 0.50 0.25 9.8† 2.1 0.65 0.67
2 1 1.0 1.0 0.50 20† 2.1 0.78 0.83
2 2 9.5 9.5 13 11 4.5 0.95 1.1
4 2 14 14 15 16 6.1 1.2 1.4
† Area cost includes DSP blocks and/or memory blocks as described in Section 5.4.

Table 5.1: Relative area cost of an 8-bit 16 entry register file memory with
different number of ports

with one write port are caused by the use ofM9K block RAMs as the Cy-
clone III does not have distributed memory.
If many small register file memories with only one write
port and few read-ports are used in a design, the area cost
for an ASIC port will be relatively high compared to the
ASIC
Porting area cost of the FPGA version. On the other hand, if
Hint
more than one write port is required, the ASIC port will
probably be much more area efficient.
Large multiport memories are probably not going to be used in many
FPGA designs due to their extreme area cost. While such a memory
should certainly be more area efficient in an ASIC than an FPGA, the
area cost will be extremely high anyway. As an example, in a die-photo
58 FPGA Optimizations and ASICs

of an SPU in the cell processor the register file is roughly the same size as
one of the large SRAM blocks [36]. However, the RAM block is a single
port memory of 64 KiB whereas the register file has 6 read ports and 2
write ports and contains only 128 × 128 bits. The single port memory
stores around 30 times as much data as the register file while using the
same amount of die area!

There are also ways to fake a multiport memory that can be efficiently
used in both ASICs and FPGAs. One way is to use a normal memory
that has a clock signal which is running at a multiple of the regular sys-
tem clock frequency [37]. Another way is to use some sort of caching
scheme, although this implies that the memory is no longer a true multi-
port memory.

5.9.3 Read-Only Memories

Another type of memory which hasn’t been considered yet is a read-only


memory. Whether a memory is read-only or not is seldom an issue in
FPGAs (although some Altera FPGAs will implement small ROMs more
efficient than small RAMs), but the impact of using a ROM instead of
RAM in an ASIC will be substantial. In for example the ATC35 process,
a 1024 × 16 bit RAM is roughly 7 times larger than a ROM of the same
size [34].

The disadvantage of using read-only-memory is of course the lost


flexibility. Changes to the contents of the memory will now require an
expensive mask change (although ROMs are typically designed so that
only one mask has to be changed). It is therefore probably not a good
idea to use a ROM-only solution for memories that are likely to contain
bugs such as firmware or microcode memories. A compromise solution
could be to wrap the ROM in a block that allows a small number of rows
in the memory to be modified at runtime, such as the solution described
by AMD in [38].
5.9 Memories 59

Design Guideline: If some of the memories can be created


ASIC with a ROM-compiler, the area savings in an ASIC port
Porting
Hint will be substantial.

5.9.4 Memory Initialization


An FPGA designer has the luxury of being able to initialize the RAM
blocks in the design at configuration time. This means that it is common
to use part of a RAM for read-only data such as for example bootload-
ing and initialization code whereas the rest of the RAM can be used for
runtime data.
In an ASIC, the contents of RAM memories are instead usually un-
defined at power-up. In the example above, the initial firmware/boot-
loading would either have to be part of a ROM or loaded into the RAM
through other means at power-up.

Avoid relying on initialization of RAM memories at con-


ASIC
Porting figuration time in the FPGA version of a design.
Hint

5.9.5 Other Memory Issues


There are also some specialized memories that can be useful in some
cases. One of these is the content addressable memory (CAM). Small
CAM memories can be implemented in an FPGA although it is fairly
expensive to do so as seen in Table 5.2. (It should be noted that significant
performance improvements can be made by utilizing specialized CAM
memories instead of the standard cell based approach used here.)
Even more expensive is the ternary CAM that is often used in routers.
While it is certainly feasible to use small CAM memories in an FPGA, it
is probably a good idea to avoid large CAM memories. In many cases, a
CAM is not strictly required and can be replaced with other algorithms
that are more area efficient in an FPGA such as algorithms based on
searching for the data in a tree-like data structure.
60 FPGA Optimizations and ASICs

Both Fmax and area cost values are relative to the values for the 32-bit adder in Figure 5.1.
Device Relative Relative
performance area cost
Spartan 3A 0.72 8.2
Virtex 4 0.64 8.2
Virtex 5 0.51 8.2
Cyclone III 0.85 11
Stratix III 1 8.2
ASIC (Speed) 0.92 2.4
ASIC (Area) 0.19 1.5

Table 5.2: 16 entry CAM memory with 16 bit wide data

When porting a design to an ASIC, consider if specialized


ASIC memories like CAM memories can give significant area
Porting
Hint savings or performance boosts.

Another issue is designs that are heavy users of the SRL16 primitive
in Xilinx FPGAs. These small 16-bit shift registers are commonly used in
delay lines or small FIFOs. As can be seen in Section 10.8, this can lead
to a huge area increase in an ASIC when compared against a design that
doesn’t use SRL16 primitives.

There are a few other considerations to take into account when using
memories in an ASIC process that are not necessary to take into account
in an FPGA. For example it might be possible to generate memories that
are optimized for speed or power consumption. For large memories it
might be a good idea to use a memory with redundancy so that a tiny
fabrication error in the memory will not cause it to fail. In an ASIC mem-
ory it is often also possible to use a write mask, which means that it is
possible to write to only certain bits in a memory word without having
to do a read-modify-write operation.

Consider if special memory options that are unavailable


ASIC in FPGAs, like write masks, can improve the design in
Porting
Hint any way.
5.10 Manually Instantiating FPGA Primitives 61

5.10 Manually Instantiating FPGA Primitives


In some cases it might be necessary to manually instantiate adders and
subtracters in an FPGA design. One reason could be that floorplanning
requires a deterministic name, another reason could be that the synthesis
tool is not able to infer a desired configuration automatically (although
modern synthesis tools are usually pretty good at this). Regardless of the
reason, such a design cannot be directly synthesized to an ASIC due to
the dependencies on FPGA elements.
By using a compatibility library it is still possible to synthesize the
design. The drawback to this approach is that the synthesizer will only
see a bunch of combinational logic instead of a plus sign in the source
code. This will mean that optimized adder structures such as the ones
available in Synopsys DesignWare library will not be used.
An example of the performance and area cost of a 32-bit adder/sub-
tracter implemented through instantiation of Xilinx primitives is shown
in Figure 5.19. The performance of the ASIC port is not quite as good
as the performance of the adder/subtracter shown in Figure 5.2, but the
area is also lower. The synthesis tool is obviously able to do a pretty good
job of optimizing this structure even though it doesn’t know beforehand
that it is an adder/subtracter.
A word of warning though, it is extremely important that the module
that contains FPGA primitives is instantiated without hierarchy. Other-
wise the synthesis tool will not be able to perform such optimizations.
Experiments indicate that if hierarchy is disabled, the performance of an
instantiated adder will be roughly the same as that of an area optimized
adder.
It is possible to port a design with instantiated FPGA
primitives using a small compatibility library. For adders
and subtracters, the performance will be adequate unless
ASIC they are a part of the critical path in the ASIC. If this
Porting
Hint approach is used it is imperative that the modules with
instantiated FPGA primitives are flattened during syn-
thesis and before the optimization phase!
62 FPGA Optimizations and ASICs

Both Fmax and area cost values are relative to


the values for the 32-bit adder in Figure 5.1.
LUT
Device Relative Relative
performance area cost
Spartan 3A 0.97 1
LUT

Virtex 4 0.98 1
LUT

Virtex 5 0.9 1
ASIC (Speed) 0.76 1.4
LUT

ASIC (Area) 0.15 0.28


All 32 LUTs/flip-flops not shown.

Figure 5.19: 32-bit adder/subtracter using FPGA primitives

5.11 Manual Floorplanning and Routing

Manual floorplanning and manual routing is a labor intensive process


that allows a designer to precisely control where certain parts of a de-
sign will end up in the FPGA. While this process can improve the tim-
ing of an FPGA design it should in theory have no impact on an ASIC.
However, it is likely that other optimizations are necessary to efficiently
support manual floorplanning, such as manually instantiating memo-
ries, LUTs and flip-flops to make sure that the names of these primitives
in the netlist are not changed. By doing these changes to the source code
it is possible that the synthesis output might also vary and the result of
these changes must be evaluated separately.

Manual floorplanning of an FPGA design will not have


ASIC any impact on an ASIC port unless the design is modified
Porting
Hint to simplify floorplanning in the FPGA.
5.12 Pipelining 63

Area cost values are relative to the values for the 32-bit adder in Figure 5.1.
Pipeline Spartan 3A Virtex 4 Virtex 5 Cyclone III Stratix III ASIC optimized for
Stages Area Speed
1 260† 380† 450† 57 54 5.5 20
2 260† 380† 450† 58 77 6.1 13
3 260† 380† 450† 61 79 6.8 14
4 260† 380† 450† 60 78 7.0 13
† Area cost includes DSP blocks and/or memory blocks as described in Section 5.4.
(The constant coefficient multipliers are implemented in the fabric in the Cyclone III and Stratix III
devices and don’t use DSP blocks.)

Table 5.3: Relative area of an eight point 1D DCT pipeline

Fmax values are relative to the values for the 32-bit adder in Figure 5.1.
Pipeline Spartan 3A Virtex 4 Virtex 5 Cyclone III Stratix III ASIC optimized for
Stages Area Speed
1 0.25 0.17 0.16 0.39 0.39 0.073 0.30
2 0.33 0.21 0.21 0.47 0.52 0.079 0.35
3 0.38 0.28 0.27 0.59 0.58 0.10 0.41
4 0.37 0.28 0.27 0.73 0.67 0.10 0.40

Table 5.4: Relative performance of an eight point 1D DCT pipeline

5.12 Pipelining
Pipelining is an important technique to improve the performance of a
digital design. In FPGAs this can be done in an area efficient manner
since there are usually a large amount of flip-flops available in an FPGA.
When porting a design to an ASIC it is seldom a drawback to have a
large number of pipeline stages as well if performance is the number one
priority. Unfortunately flip-flops can be fairly expensive in terms of area
in an ASIC which means that it may be a good idea to rewrite the design
so that fewer pipeline stages are needed. However, it is not always true
that an increased number of pipeline stages will always increase the area.
Consider for example the case outlined in Table 5.3 and Table 5.4 where
a pipeline for an eight point 1D DCT was implemented using different
number of pipeline stages. (This module was not optimized for FPGAs
in any way.) When going from one to two pipeline stages the maximum
64 FPGA Optimizations and ASICs

frequency increased by almost 15% even though the area of the new de-
sign is only 65% as large as the area of the design with a shorter pipeline.
This is probably caused by area inefficient optimizations used by the syn-
thesis tool as it is struggling to reach an unreachable performance goal.
When adding extra pipeline stages after that, the performance increases
until a plateau is reached at the third pipeline stage.
This example shows that pipelining a datapath is not guaranteed to
increase the area although it is not uncommon that the area is increased,
especially if the datapath is not a part of a critical path such as when
adding delay registers to synchronize the values in one datapath with
the values in another datapath.
While pipelining an FPGA design will certainly not hurt
the maximum frequency of an ASIC, the area of the ASIC
will often be slightly larger than necessary, especially if
ASIC
Porting the pipeline is not a part of the critical path in the ASIC.
Hint
Designs that contains huge number of delay registers will
be especially vulnerable to such area inefficiency.

5.13 Summary
There are a wide variety of issues that need to be taken into account when
porting a design from an FPGA to an ASIC. This is especially true if the
design has been optimized for a certain FPGA from the beginning with-
out any thoughts of an ASIC port.
When porting a design the most tricky areas are likely to be the ar-
chitecture around the memories and multipliers. If these have been opti-
mized for a specific FPGAs an ASIC port is likely to be suboptimal. The
other FPGA optimizations are not going to harm an ASIC port and are in
some cases even beneficial.
There are also many other issues that have not been discussed here,
like I/O, design for test, and power dissipation which it is important that
the designer takes into account as well.
Part II

Data Paths and Processors

65
Chapter 6

An FPGA Friendly
Processor for Audio
Decoding

Abstract: In this chapter a DSP processor specialized for audio decoding will be
described. While not specifically optimized for FPGAs, the processor is still able
to achieve a clock frequency of 201 MHz in a Virtex-4 and the performance of the
processor when decoding an MP3 bitstream is comparable to a highly optimized
commercial MP3 decoding library.

In early FPGAs soft processors were seldom used due to their large
area and the high cost of an FPGA compared to a microprocessor. Nowa-
days the situation is decidedly different and soft processors are regularly
used in anything from the smallest to the largest FPGAs. Although most
people are probably using the soft processor cores that are available from
their FPGA vendor there are also a huge amount of processors available
at for example OpenCores [39] (95 at the time of writing). There is cer-
tainly no lack of choice in the soft processor market.

67
68 An FPGA Friendly Processor for Audio Decoding

6.1 Why Develop Yet Another FPGA Based Pro-


cessor?

The main reason to create yet another soft core processor is because the
majority of the processors that are available are not really optimized for
FPGAs. While some processors on OpenCores probably have a decent
performance in an FPGA, none are likely to match the performance of
the FPGA optimized processors that are available from the FPGA ven-
dors. Another issue is that there does not seem to exist a credible DSP
processor that has been optimized for FPGAs.

At first, the idea of using a DSP processor for signal processing in


an FPGA does not make sense. If the DSP algorithm is computationally
intensive it is probably a much better idea to create custom hardware in
the FPGA instead of using a processor for it. On the other hand many
applications employs many different algorithms that are not indidually
very computationally intensive.

Consider for example a video camera with a high definition image


sensor and a microphone for audio recording. Two tasks that have to be
carried out in this device is motion estimation for video and MDCT pro-
cessing for audio. These are both signal processing tasks that can easily
be accelerated in hardware. The difference is that over the course of one
second, the number of calculations needed for the MDCT is negligible
when compared to the number of calculations required for motion esti-
mation. Therefore it makes sense to create a hardware based accelerator
for the motion estimation whereas the MDCT for audio decoding could
be done in software.

A processor with good digital signal processing performance is there-


fore a good idea for these kinds of tasks that don’t require an accelerator
but are still computationally intensive for a general purpose processor.
And this processor also needs to be optimized for FPGAs as an ASIC
based DSP is unlikely to have a high performance in an FPGA.
6.2 An Example of an FPGA Friendly Processor 69

6.2 An Example of an FPGA Friendly Processor


The main idea behind the xi1 processor described in this chapter is that
it should use floating point arithmetic for most of the calculations that
have to be performed when decoding an MP3 bitstream. This allows
a high dynamic range to be achieved using fewer bits than fixed point
arithmetic. This means that the amount of memory required for interme-
diate data is significantly reduced. (Dynamic range is very important for
audio application and if high dynamic range is available the precision
doesn’t have to be extremely high, see [40] for more details.)
The architecture of the processor was not specifically optimized for
FPGAs. However, one of the goals was that the architecture should be
suitable both for FPGA and ASIC implementation to simplify prototyp-
ing.
The processor has also been manufactured in a 180 nm ASIC process
(see photo on the back cover of the thesis), although the evaluation of
this MPW chip is not yet complete and will be published in a later publi-
cation.

6.2.1 Processor Architecture


The processor core is a fairly standard pipelined RISC processor. There
are 5 pipeline stages for integer operations and 8 pipeline stages for float-
ing point operations. To increase the performance of MP3 decoding the
processor has support for circular buffers and floating point MAC for fil-
ter acceleration and bitstream reading instructions for Huffman decod-
ing acceleration.
There are 16 general purpose registers. Each register can hold either
16 bit integer data or 23 bit floating point data (16 bit mantissa, 6 bit
exponent and one sign bit). The program memory is 24 bit wide and can
contain up to 8192 instructions. The constant memory is 23 bits wide and
1 The designers selected this name due to heavy indoctrination by the math department
(no math lecture seems to be complete unless at least one ξ has been written on the white-
board. . . )
70 An FPGA Friendly Processor for Audio Decoding

Figure 6.1: Pipeline of the xi audio decoding processor

1024 words large. It is primarily used for floating point constants. The
data memory is 16 bits wide and 8192 words can be stored in it. Totally
the chip has 343 kilobit memory.
It is not possible to read or store a 23 bit floating point value directly
to the data memory due to the difference in size. This is handled by con-
verting floating point values into a 16 bit format before they are stored
to memory. The different word lengths were chosen by profiling an MP3
decoder, which was modified to support a number of different floating
point formats.

6.2.2 Pipeline
The architecture of the processor is shown in Figure 6.1. As can be seen,
the processor has a relatively long pipeline, especially the floating point
units. Every pipeline step also does relatively little work, which means
that it should be relatively simple for a synthesis tool to map this archi-
tecture to an FPGA. (Another reason for the long pipeline was to ensure
6.2 An Example of an FPGA Friendly Processor 71

that the performance in an ASIC would still be adequate if the power


supply voltage was reduced in order to save power.)
There is no result forwarding used in this architecture, partly because
the muxes required to implement this would increase the area, especially
in FPGAs. Omitting result forwarding also simplified the verification.
Another reason was that it was not clear whether the clock frequency
of the processor would be high enough in an FPGA to decode MP3 bit-
streams in real time if forwarding was available. Another feature that
was not included due to the verification cost was hazard detection (it is
up to the programmer to avoid hazards by careful scheduling of the code
or by inserting NOP instructions).

6.2.3 Register File


In the processor there are a few special registers that control for example
the address generator for the constant memory. It is tempting to place
these registers in the regular register file. The advantage is that no special
instructions are required to access these registers and it is also possible
to manipulate these registers arbitrarily using most of the instructions
in the instruction set. However, the disadvantage is that it is no longer
possible to use the distributed memory available in for example Xilinx
FPGAs to implement the register file. Because of this the decision was
taken to implement special registers separately and use special instruc-
tions to access these registers.

6.2.4 Performance and Area


The performance of this processor is summarized in Table 6.1. This ver-
sion has a total of 351 kbit memory. The most critical path is in the ad-
dress generator to the data memory in the FPGA version and inside the
multiplier for the ASIC version.
Table 6.2 shows a comparison of our MP3 decoder when compared to
the commercial Spirit MP3 Decoder [41] on a variety of platforms. The
numbers for peak MIPS is based on decoding a 48kHz 320 kbit/s MP3
72 An FPGA Friendly Processor for Audio Decoding

Virtex 4 LUTs 2370


(speedgrade 12) Flip-flops 1048
RAMB16 22
DSP48 1
Fmax 201 MHz
130nm ASIC Area 2.3 mm2
(Speed optimized) Fmax 396 MHz

Table 6.1: Performance and area of audio processor

bitstream. In the case of the LiU firmware the value of 20 MIPS is reached
by constructing a synthetic bitstream where all options and values in
the bitstream have been chosen to trigger the worst case behavior of the
decoder.
As can be seen, the performance of the xi MP3 decoder using our own
firmware is efficient when compared to the Spirit MP3 Decoder [41] on
single issue processors such as the ARM9. As for the memory usage,
the use of floating point constants initially seems to drastically reduce
the amount of constant memory used. However, this is not necessar-
ily true as the Huffman decoding is implemented entirely as a program
in our decoder. (It is not clear how the Spirit MP3 decoder stores the
Huffman table but it is likely that it is stored in constant memory.) The
program memory size for our decoder is not so impressive although op-
timizing the size of the program memory was never a goal for this partic-
ular project. There is a lot that could be improved here, see Section 6.2.6
for more information. Also note that our decoder is only able to decode
MPEG1 - layer III bitstreams. A fully compliant decoder should also be
able to decoder layer I and II. This would take some additional program
memory.
When comparing the performance of our processor with the VLIW
based AudioDE processor it is also clear that an impressive performance
improvement can be gained by increasing the parallelism. There is how-
ever a rather steep increase in terms of memory area and probably core
6.2 An Example of an FPGA Friendly Processor 73

Processor Firmware Peak/Average Memory cost


MIPS
xi Custom† 20/14 35.1kB (20.5 kB program,
(limited accuracy) 2.6 kB constants, 12 kB data)
xi Custom† 20/14‡ 37.5kB (20.5 kB program,
(full precision ) 2.7 kB constants, 14.3 kB data)‡
6.8 kB constants, 13.3 kB data)
ARM9 Spirit 22/17.5 39.2kB (19.7 kB program,
7.2 kB constants, 12.3 kB data)
AudioDE Spirit 5.5/5 54kB (27 kB program,
27 kB data+constants)
† Does not include Layer I or II decoding ‡ Estimated from simulation.

Table 6.2: MP3 decoder comparison (lower values are better)

area for this improvement.


Finally, we believe that it is a huge advantage to be able to use float-
ing point numbers since it is quite easy to convert a high level model in
ANSIC C or Matlab to our processor as most operations can be imme-
diately ported. If a fixed point DSP was used, more development time
has to be spent on making sure that the dynamic range of the fixed point
calculations are good enough. Our approach should lead to shorter de-
velopment time and faster time to market. This advantage is harder to
measure quantitatively however.

6.2.5 What Went Right


Overall this project was a success. We have shown that it is possible to
implement an efficient audio processor with a floating point unit. The
ability to use floating point arithmetic significantly reduces the firmware
development time (and consequently time to market) as there is no need
to analyze the algorithms very deeply as the floating point numbers will
handle scaling automatically for us.
The maximum frequency of the processor is high and the efficiency
of the processor is also very high, as any MP3 bitstream can be decoded
while running at a frequency of 20 MHz.
74 An FPGA Friendly Processor for Audio Decoding

The simple architecture without result forwarding and hazard detec-


tion meant that the verification was easy as few bugs were present in the
design. The simple architecture and deep pipeline also means that the ar-
chitecture is very suitable to use in an FPGA. But it is important to note
that this architecture is also suitable for an ASIC as the performance in
an ASIC is also high.

6.2.6 What Could Be Improved

As has already been mentioned, this project was mostly aimed at demon-
strating that low precision floating point arithmetic can efficiently be
used for audio applications. Therefore a number of details were not in-
vestigated fully and could certainly be improved.

Program Size

The most important thing to improve is program memory size. One way
to improve this could be to improve the instruction encoding. If the in-
struction word could be reduced by two bits this would mean an 8%
saving in program memory size. This would be possible at the expense
of reducing the size of immediate constants and would therefore mean a
slight decrease in performance due to the need to load certain constants
into registers before use.
Another way to decrease the program memory size is to use loop
instructions instead of unrolling some code. This could probably save
around a kilobyte of program code, especially in the windowing, which
consists of large unrolled convolution loops.
However, the most major savings would come from optimizing the
Huffman decoder. Right now all Huffman tables are implemented en-
tirely in software using the read-bit-and-branch-conditionally instruc-
tion. This is very wasteful and it is likely that the size of the Huffman
tables (6.7kB) could be reduced by about half by using a memory with
custom bit width and a small Huffman decoding accelerator. This would
6.2 An Example of an FPGA Friendly Processor 75

also have the advantage of accelerating the performance of the MP3 de-
coder as well.

Processor Architecture

While the processor architecture was certainly good enough to imple-


ment an efficient decoder the architecture of the processor was not very
programmer friendly. Since the pipeline contained no hazard detection
it was often necessary to reschedule the code to avoid data hazards. This
was especially true for the floating point instructions as four unrelated
instructions have to be issued before the result of such an instruction is
available for use. For integer operations it is necessary to issue two un-
related instructions before the result is available.
The lack of hazard detection presents an interesting challenge for the
programmer who has to schedule the code so as to minimize the need
for NOP instructions in the code2 . In the current firmware about 10% of
the program consists of such NOPs. (Making it possible to remove these
NOPs would reduce the firmware size by about 2 kB.) When running the
MP3 decoder, around 6.6% of all executed instructions are NOPs.
The lack of register forwarding can actually be seen as a feature in
some cases since intermediate values can be stored in the pipeline in-
stead of in the register file making it possible to perform a 16 point float-
ing point DCT without having to store any intermediate results in mem-
ory using only a 16 entry register file. However, this makes it hard to
implement interrupts since the entire state of the pipeline would have to
be saved and restored if the MP3 firmware should work correctly.

6.2.7 Conclusions
Although the xi processor successfully demonstrated that floating point
arithmetic is a good idea for audio processors the processor and firmware
could be improved:
2 The assembler and simulator will help the programmer with this by warning him when

he is not using the pipeline correctly.


76 An FPGA Friendly Processor for Audio Decoding

• In a processor without result forwarding it is usually possible to im-


plement DSP tasks efficiently through careful instruction schedul-
ing to avoid data dependencies. (Although this is not trivial in
some cases.)

• When a data dependency could not be avoided the lack of auto-


matic stalling also meant that NOPs had to be inserted, at a cost of
increased program memory size.

• The utilization of the pipeline for intermediate storage of results


meant that it is cumbersome to implement interrupts

• The integer pipeline and the floating point pipeline had different
lengths, making it possible to create a situation where both pipelines
are trying to write to the register file at the same time.

Of these, the total lack of result forwarding is probably the worst


problem. While it is an interesting and challenging job to write and to
schedule code for the xi processor it is not a good use of developer time.
Chapter 7
A Soft Microprocessor
Optimized for the Virtex-4

Abstract: In this chapter the tradeoffs that are required to design a micro-
processor with a very high clock frequency in an FPGA is described. The design
has been carefully optimized for the Virtex-4 FPGA architecture to ensure as
much flexibility as possible without making any compromises regarding a high
clock frequency. Even though the processor is more advanced than the FPGA
friendly processor described in the previous chapter, the FPGA optimizations
allows it to operate at a much higher clock frequency. With floorplanning, the
processor can achieve a clock frequency of 357 MHz in a Virtex-4, which is con-
siderably higher than other FPGA optimized processors.

Although the processor described in the previous chapter was suc-


cessful it did have a number of flaws that made it less attractive for a
programmer, in particular the lack of result forwarding is a big problem
when it comes to programmer productivity. The challenge that will be
discussed in this chapter is how to create a processor (called xi2), which
is truly optimized for an FPGA while at the same time trying to address
the total lack of result forwarding in the processor described in the pre-
vious chapter.
The initial goal for this project was to be able to utilize the float-

77
78 A Soft Microprocessor Optimized for the Virtex-4

ing point units described in Chapter 8 in a processor on an FPGA1 This


means that the design should be able to operate at around 360 MHz in
a Virtex-4 (speedgrade 12). To determine if this was remotely feasible,
a number of small designs were synthesized to the Virtex-4 to estimate
their maximum operating frequencies:

• 32-bit Arithmetic logic unit

• Result forwarding

• Address generator

• Pipeline Stall generation

• Shifter

These components were selected as they were suspected to be diffi-


cult to implement in hardware at high speed based on previous experi-
ence with Xilinx FPGAs. For every item in the list a couple of different
designs were tried until an acceptable solution was found.

7.1 Arithmetic Logic Unit


The arithmetic logic unit (ALU) can be considered to consist of two sep-
arate parts, a logic unit and an adder/subtracter. The logic unit was not
expected to be a part of the critical path. It was however unknown how
many bits an adder/subtracter could contain before limiting the perfor-
mance of the processor. The performance of an adder with different num-
ber of bits is shown in Table 7.1.
The biggest problem is that the results in the table assumes that a flip-
flop is present on all inputs and outputs of the adder. In a processor this
means that there is no way to utilize the result of an addition in the next
clock cycle. In order to support forwarding of results it is necessary to
modify this circuit. Ideally a mux would be present on each input to the
1 The floating point components have not been integrated yet as we believe that the
current version of the processor is very competitive, even without floating point support.
7.1 Arithmetic Logic Unit 79

+/− +/−

a) 412 MHz b) 341 MHz

+/− +/−

c) 330 MHz d) 299 MHz

+/− +/−

e) 285 MHz f) 269 MHz

+/− +

g) 247 MHz h) 403 MHz

Note that circuit h has no subtract functionality

Figure 7.1: The maximum frequency of various adder/mux combina-


tions in a Virtex-4 (fastest speedgrade)
80 A Soft Microprocessor Optimized for the Virtex-4

Number of bits Maximum frequency


16 529.6 MHz
24 467.5 MHz
32 412.2 MHz
40 377.8 MHz
48 338.2 MHz

Table 7.1: The performance of an adder/subtracter for various sizes

adder to select from a number of different results. Figure 7.1 shows the
maximum performance attainable with various mux configurations.
As can be seen in Figure 7.1, the performance drops drastically com-
pared to the 412.2 MHz shown in Table 7.1. This is especially true for an
adder/subtracter with muxes on both inputs (e-g). This can be mitigated
by removing the subtracter from the adder as seen in the last configura-
tion (h) in Figure 7.1. Only configuration a and h can be implemented
using one lookup-table per bit in the adder. The remaining configura-
tions suffers from additional routing and logic delays.
Unfortunately configuration h doesn’t directly support subtraction.
This can be fixed by moving the inverter used for subtraction to the pre-
vious pipeline stage as seen in Figure 7.2. Additionally, by utilizing the
Force0, Force1, InvA, and Swap signals, an arithmetic unit can be con-
structed that can handle most result forwarding operations without any
penalty as described in Table 7.2. The only operation this configuration
cannot handle is to forward the same result to both inputs since it is only
possible to feed back the result to one input of the adder. (Although this
is of little use when subtracting.)
Finally, the careful reader may have noticed that it is not possible
to forward data from any other execution unit directly to the arithmetic
unit. The lack of full forwarding is a weakness of this processor, but
the implementation of partial forwarding allows a significant increase in
clock frequency when compared to a processor with full forwarding. For
example, if full forwarding was implemented it is likely that the con-
7.1 Arithmetic Logic Unit 81

°2008
c IEEE. Reprinted from Field Programmable Logic and Applications, 2008. FPL 2008.
International Conference on , , Ehliar, A. Karlström, P. Liu, D.

Figure 7.2: An arithmetic unit that can handle the most common result
forwarding operations (Maximum frequency, 403 MHz in a Virtex-4 of
the fastest speedgrade)

figuration in Figure 7.1g would have to be used, which is substantially


slower than the current implementation. As a comparison, without any
forwarding at all, the runtime cost of all NOPs that were inserted to avoid
data dependency problems in the MP3 decoder in the previous chapter
was around 6.6%. Since the Fmax difference between full forwarding and
partial forwarding is considerably larger than 6.6%, the tradeoff should
be more than worth it in terms of performance. A more in-depth discus-
sion of partial forwarding can be found in [42].
82 A Soft Microprocessor Optimized for the Virtex-4

Instruction Forwarded Control signals


sequence operand
add r2,r1,r0 - -
add r2,r1,r2 OpB Force0=1 Select=1
add r2,r2,r1 OpA Swap=1 Force0=1 Select=1
sub r2,r1,r2 OpB Force1=1 Select=1
sub r2,r2,r1 OpA Swap=1 InvA=1 Select=1
sub r2,r2,r2 Both Replace with set r2,#0
add r2,r2,r2 Both Cannot forward directly

Table 7.2: Forwarding operands to the arithmetic unit (The order of the
operands are destination register, OpA, OpB).

7.2 Result Forwarding


As has already been seen in the previous section, it can be tricky to im-
plement forwarding in a processor when it is optimized for FPGAs. It
simply isn’t possible to use as many muxes as one would like since they
won’t fit into a critical path in combination with for example an adder.
And as was seen in Section 5.6, the area cost for muxes is very large in
an FPGA as well. The number of muxes should therefore be reduced for
both area and performance reasons.
A common way to reduce the multiplexers in optimized soft core pro-
cessors is to utilize the reset input of a pipeline register to set unused reg-
isters in all pipeline stages to 0. When this has been done only an or-gate
is required to select the correct result from that particular pipeline stage.
This is illustrated in Figure 7.3. While a mux is still required to select
which pipeline stage to forward a result from, muxes are not required to
select the execution unit in that particular pipeline stage.
It is not enough that the pipeline is able to support result forwarding.
It is also necessary to detect that it is necessary to do so. One way to do
this is to simply encode all forwarding information into the instruction
word. This is the most simple solution in terms of hardware. Unfor-
7.2 Result Forwarding 83

Select active unit

From other pipeline stages


Execution
unit 1

R
Execution
unit 2

R
Execution
unit 3

Figure 7.3: Result forwarding with reset/or structure

Original code With manual forwarding


sub r1,r9,r3 sub r1,r9,r3
add r2,r1,r7 add r2,FWAU,r7
loop: loop:
add r2,r2,#1 ; Ambiguous
... ...
... ...
bne loop bne loop

Figure 7.4: Result forwarding controlled by software (The order of the


operands are destination register, OpA, OpB)

tunately it is not an optimal solution. Consider the assembly code in


Figure 7.4. On line 2 the result of the previous operation is forwarded
by using FWAU as the source register. The first instruction of the loop
should also use FWAU instead of r2 as the source register the first time
the loop is run. However, the second time r2 should be read from the
register file. In conclusion, while it is not impossible to use manual for-
warding in a processor, to do so places severe restrictions on the pro-
grammer. Interrupts are also troublesome to implement as the control
flow can now be interrupted at almost any time, disrupting the manual
84 A Soft Microprocessor Optimized for the Virtex-4

forwarding process.
To simplify things for the programmer, automatic result forwarding
was implemented in favor of letting the programmer handle it. Figure 7.5
shows the first experiment as a. By putting the forwarding decision into
the decode stage the result is available in the next pipeline stage. Unfor-
tunately the performance of 337 MHz for this configuration when used
in the processor was not quite satisfactory. (Mostly due to the complexity
of quickly generating the control signals required for the forwarding in
the arithmetic unit described in the previous section.)
Instead, some matching logic was moved to the pipeline stage before
the decode stage. This means that the time available for reading from
the program memory is shortened somewhat. But the performance of
the program memory readout is still satisfactory as only a single LUT is
inserted before the flip-flops. This is shown as b in the figure.
It should also be noted that the forwarding logic in the xi2 processor is
slightly more complicated than described in this section as this pipeline
stage also has to generate the Force0/Force1/Swap/InvA signals shown
in Figure 7.2. It also has to handle data from the constant memory and
immediate data from the instruction word.

7.3 Address Generator


A very common structure in DSP applications is the circular buffer. To
efficiently handle for example FIR filters a DSP employs a number of
techniques:

• Combined Multiply-Accumulate (MAC) unit

• Hardware loop support

• Circular addressing to feed the MAC unit with data

The MAC unit is relatively easy to support using the DSP48 blocks
in the Virtex-4 FPGA. Hardware loop support is also relatively easy to
7.3 Address Generator 85

Figure 7.5: Performance for different forwarding configurations when


used in the processor
86 A Soft Microprocessor Optimized for the Virtex-4

Top−Step*2
Step
Step − Size
Top−Step
Step
+
Size
Top
+ −
− − MSB
MSB

To memory To memory
a) 209 MHz b) 458 MHz

Figure 7.6: Address generators for modulo addressing

implement in the program counter module. However, the address gen-


erator is not as easily implemented. When written in C, the following
assignment describes one step in a circular buffer2 :
ADDR = (ADDR + STEP - BOT) % SIZE + BOT;
If SIZE is not a power of two, this addressing mode is clearly going
to be expensive to implement. However, if the step size is small enough,
it can be implemented inexpensively using the following C code:
ADDR += STEP; if(ADDR > TOP) ADDR -= SIZE; if(ADDR < BOT)
ADDR += SIZE;
If it is certain that the buffer only has to be traversed in one direction,
one of the conditions can be removed. A hardware implementation of
this is shown in Figure 7.6 (a). The performance of this configuration is
clearly not good enough for our purposes. However, if it is acceptable to
use slightly different parameters configuration b can be used instead. The
major disadvantage of configuration b is that the end parameter (TOP-
2STEP) cannot be used for the first iteration and TOP-STEP has to be
2 The addressing mode is sometimes referred to as modulo addressing due to the use of
the modulo operator
7.4 Pipeline Stall Generation 87

used instead. (This could be handled automatically by the CPU when


updating this register at a cost of some extra hardware.)

7.4 Pipeline Stall Generation


There are several ways to handle hazards in a program. The most flexible
way is that the processor automatically detects the hazard and stalls the
processor until the required data is available. The opposite solution is to
ignore all hazards in hardware and trust the programmer to never cre-
ate a situation the hardware cannot solve. The first version of the MIPS
processor did not have any hardware to detect this situation for example
(although later versions do) [43]. There are advantages and disadvan-
tages to both solutions. A software based solution increases the code
size because of forced NOP instructions. A hardware solution allows
code density to be low while increasing the complexity of the hardware,
possibly increasing the critical path, and complicates verification as there
are many more situations to test.
A software based solution can also be advantageous because the pipeline
registers themself can be used to store temporary results. Consider the
following C code for example:

tmp = A - B; B = A + B; A = tmp;

In a regular processor this will need three registers. However, in a


processor without hazard detection and result forwarding it can be done
using only two registers:

add rA,rB,rA ; New value available after 2 cycles


sub rA,rB,rB ; This will use the _old_ value of rA

Effectively, this allows the programmer to utilize the pipeline regis-


ters in the processor as extra storage. This allows for example a 16 point
DCT to be performed using only 16 registers without resorting to any
temporary storage, a technique that was heavily utilized in the MP3 de-
coder described in the previous chapter. The drawback with this archi-
88 A Soft Microprocessor Optimized for the Virtex-4

tecture is that it is difficult to implement interrupts as the current state of


the pipeline registers must be saved as well.
Ideally, when implementing hazard checking in hardware, it is easi-
est if the pipeline can be stalled as early as possible. Initial experiments
indicated that it would not be possible to stall the pipeline in the decode
stage, which is the ideal place to place this. While it would be possible
to stall an instruction in later pipeline stages this would complicate the
pipeline and it was decided that an increase in program size could be
accepted for now. This is a slight disappointment since the lack of auto-
matic stalling was a rather large inconvenience when programming the
xi processor described in the previous chapter. However, if the assembler
is capable of automatically inserting NOPs as needed, the programmer
will never notice that hazard checking is not implemented in the hard-
ware except as an increase in program memory usage.

7.5 Shifter
The final component in the execute pipeline stage is the shifter. Figure 7.7
shows a number of different shift configurations. a and b shows a sim-
ple 32 bit shifter and the shifter in c can shift both left and right. It is no
surprise that the version that can shift in both directions is slower. By
pipelining version c, we arrive at version d, which is barely fast enough
but lacks arithmetic right shift. Adding an arithmetic right shift to d re-
sults in the relatively slow e. As both arithmetic and logic shift is typically
required in a processor another approach is necessary. f implements e in
a different way. Instead of implicitly writing the >>> operation in Ver-
ilog, it is implemented by generating a mask that is always or:ed together
with the result of the logic right shifter in the second pipeline stage. An-
other optimization is that a second unit determines in parallel if the shift
will be longer than 32 bits. In that case the result register will be set to
zero, regardless of the result of the shift (except for arithmetic right shift
of a negative number, which is handled by the mask generation unit).
Version f is used in the processor.
7.5 Shifter 89

a b a b

a << b a << b a >> b

op

a) 326 MHz
c) 293 MHz
a b
a b

a << b a >> b
a >> b

op

b) 331 MHz
d) 375 MHz
a b

a << b a >> b a >>> b

op

a b
op
e) 283 MHz

a << b[4:0] a >> b[4:0] Mask Generation Big


(arithmetic shifts) shift?

op
SR

f) 402 MHz

Figure 7.7: Various 32-bit shifter architectures


90 A Soft Microprocessor Optimized for the Virtex-4

1
Fetch

PC

PM

Decode
CM IR

Read operands

RF

Register forwarding
FW

Execute 1 OpA OpB

Out Port AU LU Shift 1


MEM

Execute 2

Flags Shift 2 Align

Writeback

WB

Figure 7.8: The overall architecture of the processor

7.6 Other Issues

A somewhat simplified view of the final pipeline is shown in Figure 7.8.


There are seven pipeline stages (fetch, decode, read operands, register
forwarding, execute 1, execute 2, and writeback). (Although it could
arguably be said that the PC register is yet another pipeline register.)
Also, the MAC unit is not shown in the pipeline figure (see below for
more information).
7.6 Other Issues 91

7.6.1 Register File


The register file is 32-bit wide and contains 16 entries. It has two read
ports and one write port and it is implemented using distributed RAM.
The register file itself can probably be extended to 32 entries without any
timing problems, but the forwarding logic described in Section 7.2 may
have some performance problems in that case.

7.6.2 Input/Output
Input and output is handled through special instructions that can write
and read to a number of I/O-ports. Only one output port and one input
port is implemented, but it is also possible to update internal registers
like the address generator registers using these instructions. The instruc-
tion word is 27 bits. In the performance numbers quoted in this thesis a
32-bit memory was used regardless of this, but if a larger program has to
be used, the reduced number of bits in the instruction word means that
3 block RAMs with 9-bit wide memory ports could be placed in parallel
instead of using 4 block RAMs.

7.6.3 Flag Generation


There are four status flags (zero, overflow, negative, and carry) and arith-
metic and logic instructions are able to influence these flags. Of these
flags the most tricky to generate is the zero flag since that flag depends
on the entire result of an arithmetic or logical operation. To optimize
this, the Z flag generation is partially done in the arithmetic unit and
logic unit on the combinational outputs from the adder/logic unit. In
the logic unit, all 32 bits are preprocessed down to 8 bits by using a logic
or operation on 4 bits at a time. In the arithmetic unit there is not enough
time available to do this on all bits after the addition, but the 20 lower
bits are preprocessed into 5 bits in this way, which means that a total of
25 bits have to be considered in the next stage when generating the Z flag
instead of 64 bits.
92 A Soft Microprocessor Optimized for the Virtex-4

7.6.4 Branches
Delay slots are used for all branches. Apart from that, there are no penal-
ties for a correctly predicted branch. If the branch is mispredicted there
is a penalty of either three or four cycles depending on whether it is pre-
dicted as taken or not taken. Register indirect branches always have a
penalty of four cycles.
There are only absolute branches available, but this is not a problem
as the address space for the memories is only 16 bits wide and the target
address will fit into the instruction word. Finally, there is a loop instruc-
tion that allows small loops to be implemented with a minimum of loop
overhead.

7.6.5 Immediate Data


There is an eight bit immediate field that can be used instead of one of
the operands from the register file. If the most significant of these bits
are 0 the remaining seven bits are sign extended to a 32-bit value. If this
bit is 1, the remaining seven bits are used to address a constant memory
that contains 32 bit constants. It is therefore possible to use up to 128
arbitrary 32-bit constants without any penalty.
Finally, for those situations where 128 constants are not enough it is
also possible to use a special SETHI instruction with a 24 bit immediate.
These 24 bits will be concatenated with the eight bit immediate field the
next time an instruction with immediate data is used. This will lead to
a one instruction penalty when loading arbitrary 32-bit constants, which
is similar to many other RISC processors where two instructions are re-
quired to load arbitrary 32-bit values into registers.

7.6.6 Memories and the MAC Unit


There are three different memory spaces available in the processor. The
address space for all memories is 16 bit wide, although only 2 KiB large
block RAMs have been used so far. What is not shown in the pipeline
7.7 Performance 93

in Figure 7.8 is that the second port of the constant memory (CM) and
data memory (MEM) is connected to the address generators described in
Section 7.3. A special part of the instruction word instructs the pipeline
to replace the OpA or OpB value in Figure 7.8 with a value read from
either the constant memory or the data memory.
When this ability is used in conjunction with the MAC operation this
allows a convolution to be efficiently performed. The MAC unit itself is
also not shown in the pipeline diagram, but it is based on instantiated
DSP48 blocks. It allows for 32x32 bit multiplication and 64 bit accumu-
lation. The results are accessed by reading from special registers as de-
scribed in Section 7.6.2. The MAC unit itself contains 6 pipeline stages
and the first stage is located in the Execute 1 stage.

7.7 Performance

In the beginning of this project the author thought it unlikely that this
processor would be able to compete with established commercial FPGA
microprocessors like MicroBlaze, hence the initial focus on DSP process-
ing. However, an initial design (lacking many features/instructions)
written directly in Verilog could be synthesized to almost 400 MHz in
a Virtex-4 (speedgrade 12). At this point we realized that it may be pos-
sible to create a more general purpose microprocessor that can operate at
a much higher clock frequency than MicroBlaze. The focus shifted from
a DSP processor to a microprocessor with DSP extensions at this point.
This processor has been optimized for a specific FPGA architecture.
While it is possible to synthesize the design without any changes for the
Virtex-5 instead of the Virtex-4, the performance when doing so is not
higher than in Virtex-4. A higher performance could probably be reached
if the processor was redesigned around the 6 input LUTs of the Virtex-5.
In fact, Table 7.3 shows that the performance in a Virtex-5 is lower than
in a Virtex-4.
94 A Soft Microprocessor Optimized for the Virtex-4

Device Fmax Area



xc4vlx80-12 334 (357 ) 1682 LUTS, 1419 FF, 3 RAMB16, 4 DSP48
xc5vlx85-3 320 1433 LUTS, 1419 FF, 3 RAMB16, 4 DSP48E
ASIC (Direct port) 325 1.4 mm2
ASIC (Rewritten MAC) 500 1.3 mm2
† With floorplanning

Table 7.3: Performance and area of the xi2 processor

7.7.1 Porting the Processor to an ASIC

It is interesting to look at this processor when porting it to an ASIC. When


using a compatibility library for every FPGA construct, including the
DSP48 blocks, the area for the processor is 1.4 mm2 and the maximum
frequency is 325 MHz. In this case, the critical path is in the MAC unit.
This is relatively discouraging as it is slower than the xi processor de-
scribed in the previous chapter.
As the critical path is in the MAC unit it makes sense to investi-
gate this further. The performance of the MAC unit itself when synthe-
sized separately is shown in Table 7.4. Three different kind of configu-
rations have been tried here. The first is the version that is based on the
DSP48 block. This configuration is only available for the Virtex-4 and
Virtex-5. The second configuration is based on a limited wrapper of the
DSP48 block which allows the MAC to be ported to an ASIC without any
changes to the source code. The final configuration is based on a rewrit-
ten MAC unit where the multiplier is implemented using a pipelined De-
signWare multiplier. When using this configuration the maximum clock
frequency of the processor is increased to 500 MHz and the area reduced
to 1.3 mm2 . Rewriting the MAC unit was clearly a good idea. In this case
the critical path is mainly caused by the path from the program memory
to the constant memory. Changing the memory to a version that is opti-
mized for speed instead of area will probably improve this (at a cost of
increased area/power), but we have not been able to test this yet due to
a lack of appropriate memory models.
7.8 Comparison with Related Work 95

Configuration Device Fmax Area


DSP48 xc4vlx80-12 500 4×DSP48
DSP48 xc5vlx85-3 550 4×DSP48
DSP48 based 130nm ASIC (speed) 424 0.19 mm2
DSP48 based 130nm ASIC (area) 83 0.10 mm2
Rewritten 130nm ASIC (speed) 743 0.13 mm2
Rewritten 130nm ASIC (area) 63 0.058 mm2

Table 7.4: The performance of the MAC unit in different architectures

Another memory related problem with this design in an ASIC is that


dual port memories have been used for both the constant memory and
the data memory. For the constant memory it is easy to separate it into
two memories, one for immediate constants used in the program and
one for constants that should be used for convolution operations. For the
data memory, one of the ports are used to read data during convolution
operation and the other port is used for normal read/write access to the
memory. This is a harder issue to fix that may require a redesign of the
pipeline. One possibility could be to use a dual port memory for a small
part of the address space and single port memories for the remaining
address space. The drawback would be that the programmer would then
have to place circular buffers in the correct memory region.

7.8 Comparison with Related Work


An early processor optimized for FPGAs is described in [44] which de-
scribes a RISC based CPU mapped to a Xilinx XC4005XL FPGA. While
the FPGA is old, this is still a very interesting publication that highlights
many issues that are still true today.
However, soft processor cores were not popular until larger FPGA
devices appeared. Currently, the major FPGA vendors have their own
solutions in the form of MicroBlaze [45], Nios II [46], and Mico32 [47].
Readers interested in soft CPUs for Altera FPGAs are encouraged to read
96 A Soft Microprocessor Optimized for the Virtex-4

James Ball’s chapter in [48]. Another interesting Altera related publica-


tion discusses how to optimize an ALU for Altera FPGAs [49].
Finally, while not explicitly designed for FPGAs, Leon [50] and Open-
Risc [51] can both easily be synthesized to FPGAs but the performance
will not be very good when compared to FPGA optimized processors.

7.8.1 MicroBlaze
The most natural processor to compare this work with is the MicroBlaze
from Xilinx, which has been optimized for Xilinx FPGAs. The maximum
clock frequency of the MicroBlaze is going to be around 200 MHz [52] in
a Virtex-4 of the fastest speedgrade. However, it should be noted that Mi-
croBlaze is a much more complete processor than the processor described
in this chapter. For example, MicroBlaze has better forwarding, cache
support, and support for stalling the processor when a hazard occurs.
At the same clock frequency, it is very likely that the performance of Mi-
croBlaze will be higher than the processor described here. On the other
hand, xi2 has a maximum clock frequency which is over 70% higher than
MicroBlaze. We believe that xi2 will still have a comfortable performance
advantage for many applications, especially those that involve DSP algo-
rithms.
Unfortunately the source code of MicroBlaze is not publicly available
so it is not possible to investigate how well it will perform in an ASIC.

7.8.2 OpenRisc
The OpenRisc or1200 processor is an open source 32-bit microprocessor
that is available at the OpenCores website. It has similar features to the
MicroBlaze. When a version of the or1200 processor with 8 KiB instruc-
tion and data cache + 4 KiB scratch pad memory was synthesized to a
130 nm process the maximum frequency was around 200 MHz. When
synthesized to an FPGA, the performance was around 94 MHz.
It is clear that the or1200 processor is not optimized for FPGAs, but
it is also interesting to note that the performance of our xi2 processor
7.9 Future Work 97

is significantly higher in an ASIC as well. Although in fairness to the


OpenRisc processor it should be noted that the or1200 is certainly more
general than our processor as well.

7.9 Future Work


While the processor architecture described in this publication seems very
promising, it does have some quirks that makes it harder to use. Regard-
less of these quirks we believe that it would be worthwhile to continue
the development of this processor and that the general architecture is a
good one.
One thing that is currently missing is interrupt support. This would
be relatively easy to add, especially if the interrupt delay is permitted to
be slightly non-deterministic so that the CPU can never be interrupted in
a delay slot. Another thing that could be added is some more instructions
although this will probably require the instruction word to be extended
beyond 27 bits. Most notably a division instruction of some sort is miss-
ing.
The issues mentioned in the previous paragraph are relatively mi-
nor but there are a few major shortcomings in the current architecture.
The lack of caches is a serious problem if general purpose development
should be done on this architecture. Handling cache misses is a non-
trivial problem and it is not clear how this can be implemented in the
best way on this processor.
Another major issue is the lack of a compiler. This means that it is
very hard to benchmark the processor using realistic benchmarks. Right
now we believe the processor to be good based on the fact that it should
be better than the xi processor in many ways (more forwarding, 32-bit in-
teger operations instead of 16-bit, better AGUs, higher clock frequency,
and branch prediction). Since we consider the xi processor to be a suc-
cess, it follows that xi2 should be even better. However, without running
real benchmarks on the processor we cannot prove this.
Another interesting research direction would be to continue the work
98 A Soft Microprocessor Optimized for the Virtex-4

started in the previous chapter and try to design a processor that has
very high performance in both FPGAs and ASICs with a minimum of
customizations for each architecture.

7.10 Conclusions
We believe that the processor described in this chapter is a very promis-
ing architecture. An improved version of this processor with a cache and
a compiler may be a serious alternative to other FPGA optimized proces-
sors, especially for DSP tasks where data dependencies can usually be
avoided by careful code scheduling.
The high Fmax of this design could be reached by carefully investigat-
ing the critical paths in all parts of the processor during the entire design
flow. By tailoring the architecture around these paths it was possible to
reach 357 MHz in a Virtex-4 (speedgrade 12). However, the final solution
represents a compromise between frequency and flexibility. One of the
tradeoffs is that the pipeline of the processor is visible to the program-
mer. For example, data hazards have to be managed in software since
the processor cannot detect this and stall the processor. A good toolchain
should be able to compensate for this, but this will also mean that it will
be hard to retain binary compatibility if the processor is improved.
When the processor is ported to a 130 nm ASIC process the perfor-
mance of the processor is 500 MHz and is mostly limited by the perfor-
mance of the memory blocks. To reach this performance in the ASIC
port it was necessary to rewrite the MAC unit to avoid the limitations
enforced by the DSP48 blocks in the FPGA version.
Chapter 8
Floating point modules

Abstract: This chapter describes how floating point components can be opti-
mized for FPGAs. The focus in this chapter is to create a high speed floating
point adder and multiplier with relatively low latency. The final solution has a
maximum frequency which is higher than previous publications at a comparable
pipeline depth although this comes at a price of a larger design area. The floating
point components can be easily ported to an ASIC with only minor modifica-
tions.

All calculations in DSP systems can be performed using fixed point


numbers. However, in many cases the dynamic range of the data makes
it difficult to design a solution using only fixed point. Either the fixed
point numbers will be very wide and hence waste resources or scaling
has to be employed which will increase the design time.
To avoid these problems it is possible to use floating point numbers
instead. If floating point numbers are used, the result of an operation
will automatically be scaled and it is therefore possible to represent a
much larger dynamic range than a fixed point number of the same width.
The use of floating point numbers can also reduce the width of memo-
ries, which can decrease the amount of on-chip memories as described in
Chapter 6.
This chapter concentrates on floating point addition/subtraction and

99
100 Floating point modules

multiplication as these are the most commonly used operators.

8.1 Related Work


Floating point arithmetic has of course been intensively studied, but there
are not so many recent publications for FPGAs. A paper which studies
a tradeoff between area, pipeline depth and performance is [53]. How-
ever, their design is probably too general to utilize the full potential of
the FPGA.
A very impressive floating point FFT core is presented in [54]. The
core can operate at 400 MHz in a Virtex-4 and the inputs and outputs are
single precision IEEE 754 compliant. However, numbers are not normal-
ized inside the core and the latency is high.
Xilinx has IP cores for both double and single precision floating point
numbers [55]. Nallatech also has some floating point modules avail-
able [56]. Neither has chosen to publish any details about their imple-
mentation though.
Another interesting project is FPLibrary at the Arenaire project [57].
Although the modules are not extremely fast, the source code is available
and the project has many function blocks available besides addition and
multiplication.

8.2 Designing Floating Point Modules


A floating point multiplier is quite easy to implement. The mantissas are
multiplied and the exponents are added. Depending on the result of the
multiplication, the result might have to be right shifted one step and the
exponent adjusted. A typical pipeline for floating point multiplication is
shown in Figure 8.1.
Floating point addition is more complicated than multiplication. At
first the exponents are compared and the mantissas are aligned so that
the exponent for both mantissas are equal. After the addition the result
8.2 Designing Floating Point Modules 101

Mantissa A Mantissa B Exponent A Exponent B

1 0

Normalization

Figure 8.1: Typical floating point multiplier

Mantissa A Mantissa B Exponent A Exponent B

Compare/Swap Compare/Select

Align Mantissa

+/−

Normalization

Figure 8.2: Typical floating point adder


102 Floating point modules

must be normalized and the exponent updated. If two positive numbers


or two negative numbers are added, the normalization is just as easy as
in the case of the multiplier. At worst, the mantissa has to be right shifted
one step. However, if a negative and a positive number are added, the
mantissa might have to be left shifted an arbitrary number of times (this
is often referred to as cancellation). Finally, the exponent must also be
updated depending on how many times the mantissa was shifted. A
typical pipeline for floating point addition is shown in Figure 8.2. By
comparing the magnitude as the first step we make sure that the smallest
number is always sent to the “Align Mantissa” module. This also assures
that the result of a subtraction in the third pipeline stage will always be
positive. Finally, in the exponent part of the pipeline, it is only necessary
to retain the exponent of the largest number.

8.3 Unoptimized Floating Point Hardware

As a first performance test, the floating point adder and multiplier from [58]
was synthesized to a Virtex-4 (speedgrade 12). These modules are closely
based on the architecture in Figure 8.1 and 8.2. The major difference
is that the multiplier consists of four pipeline stages instead of 2. The
source code of these modules were written in VHDL and were not opti-
mized for FPGA usage. The maximum frequency of the multiplier and
adder is 207 MHz and 190 MHz respectively when synthesized with ISE
9.2. The mantissa is 16 bits and an implicit 1 is used instead of an ex-
plicit 11 . The exponent is 6 bits and a sign bit is used to represent the
signedness of the number.

1 If an implicit 1 is used, the first bit in a floating point number is assumed to be set to 1.
This is always the case with a normalized floating point number. If an explicit 1 is used, the
first 1 is stored in the mantissa which is necessary if unnormalized floating point numbers
will be used.
8.4 Optimizing the Multiplier 103

Unnormalized mantissa Exp

Shifted by 0 4 20

ff1 in 4 ff1 in 4 ff1 in 4


shift shift shift

Priority
decoder 6−1 MUX

Normalized mantissa New exponent °2006


c IEEE.
Reprinted from Norchip Conference, 2006. 24th, High Performance, Low Latency FPGA based Floating
Point Adder and Multiplier Units in a Virtex 4, Karlström, P. Ehliar, A. Liu, D.

Figure 8.3: Parallelized normalizer

8.4 Optimizing the Multiplier


The critical path of the multiplier in the previous section is in the DSP48
multiplier block. The problem is that the synthesizer doesn’t instantiate a
pipelined multiplier correctly. By manually instantiating the appropriate
DSP48 component, the floating point multiplier can operate at over 400
MHz without any further FPGA optimizations. This example demon-
strates that it is easy to get good performance out of a floating point mul-
tiplier in an FPGA.

8.5 Optimizing the Adder


As for the floating point adder, the main bottleneck in the floating point
adder mentioned above is the normalizer when synthesized for an FPGA.
The normalizer must quickly both determine how many steps the man-
104 Floating point modules

tissa should be shifted and shift the mantissa. This is quite complicated
to do efficiently in an FPGA. One way to optimize this part is by using
parallelism. A normalize unit is constructed which can only normalize
a number with up to four leading zeros. Several of these units are then
placed in parallel. A priority decoder is used to determine the first unit
with less than four leading zeros. A final mux selects the correct man-
tissa and an adjustment factor for the exponent. This is illustrated in
Figure 8.3 for a mantissa width of 23 bits.
By incorporating this three stage normalizer into our floating point
adder and extending the compare/select pipeline stage into two stages,
we have constructed a floating point adder capable of operating at 361 MHz.
In this case the width of the mantissa is 23 bits (plus the implicit 1) and
the exponent 8 bits. This can be seen in Figure 8.4.
Other optimizations include a special signal to zero out the mantissa
of the smallest number (marked with 1 in the figure) if the control signal
to the shifter (marked with 2) is too large. This means that the shifter
itself only has to consider the five least significant bits. The cost of the
special “set to zero” signal is small as an otherwise unused LUT input in
the adder is used for it as shown in Figure 8.5.
Given these optimizations, the floating point adder with 7 pipeline
stages is able to support a throughput of 361 MHz in a Virtex-4 (speed-
grade 12).

8.6 Comparison with Related Work


A comparison with the related work shows that our performance is simi-
lar to commercial IP cores. In Table 8.1, we compare our modules (DA in
the table) to the floating point modules generated by CoreGen in ISE. As
can be seen, our processor has a higher frequency but also higher area us-
age. Our architecture is therefore very interesting to architectures where
it is necessary that the latency is kept low and where it is acceptable to
use a higher area. However, if the Xilinx components are configured for
maximum throughput it will be faster than our solution, but the latency
8.6 Comparison with Related Work 105

Compare/Select Exponent Mantissa

CMP

2
Align

1
Add
Normalization

Find leading one

°2006
c IEEE. Reprinted from Norchip Conference, 2006. 24th, High Performance, Low Latency FPGA
based Floating Point Adder and Multiplier Units in a Virtex 4, Karlström, P. Ehliar, A. Liu, D.

Figure 8.4: Optimized floating point adder


106 Floating point modules

Carry out

Sub
Set to zero
B
A Sum

LUT

Carry in
°2006
c IEEE. Reprinted from Norchip Conference, 2006. 24th, High Performance, Low Latency FPGA
based Floating Point Adder and Multiplier Units in a Virtex 4, Karlström, P. Ehliar, A. Liu, D.

Figure 8.5: Adder/subtracter with set to zero functionality

Design Fmax LUTs Flip-flops DSP48


Our FP Adder 370 894 643 0
Our FP Mul 372 140 281 4
Xilinx FP Adder 320 547 404 0
Xilinx FP Mul 341 122 290 4

Table 8.1: Performance comparison of two different floating point imple-


mentations with 8 pipeline stages in a Virtex-4 speedgrade -12

will also be much higher than our solution. (Up to 16 pipeline stages in
the adder and 11 pipeline stages in the multiplier.)

8.7 ASIC Considerations


In an ASIC, the performance of these floating point components are not
very good when ported using the compatibility library described in Sec-
tion 5.2. The critical path in the ASIC version of the adder is in the addi-
8.8 Conclusions 107

Technology Component Fmax Area


130nm ASIC (speed) FP adder 796 0.050 mm2
130nm ASIC (area) FP adder 146 0.037 mm2
130nm ASIC (speed) FP mul 978 0.072 mm2
130nm ASIC (area) FP mul 83 0.042 mm2

Table 8.2: Performance of the floating point components when ported to


an ASIC.

tion stage which may be caused by the fact that this part is implemented
using instantiated FPGA primitives as seen in Figure 8.5. Once this part
is replaced with code that infers the same functionality the performance
is increased from around 400 MHz to 796 MHz.
For the multiplier, the critical path is not surprisingly in the multi-
plier itself. When using DSP48 based version ported to the ASIC, the
maximum performance is 421 MHz. When replacing this version with a
DesignWare based multiplier the performance is increased significantly.
The performance and area of the ASIC port is summarized in Table 8.2.

8.8 Conclusions
Optimizing a floating point components for a specific FPGA is an inter-
esting problem with many opportunities to trade area for frequency and
vice versa. It is typically not a problem to create a high performance
floating point multiplier in an FPGA, but a floating point adder is a real
challenge, especially the normalization stage.
By liberal use of instantiated FPGA primitives it was possible to reach
a very high performance, even higher than Xilinx’ floating point adder
with the same pipeline length. The price we pay for the performance is
a higher area which means that our floating point adder is a good choice
when few but fast adders are required. Our adder would probably be a
good choice for a soft microprocessor whereas Xilinx’ adder would be a
good choice when a datapath with high throughput but modest latency
108 Floating point modules

requirement is needed.
The designs will not have very good performance in an ASIC if they
are ported directly without any modifications at all, but the performance
increases dramatically after minor modifications to the design even though
most of the design still consists of many instantiated FPGA primitives.
Part III

On-Chip Networks

109
Chapter 9
On-chip Interconnects

Abstract: This chapter is intended to serve as a brief introduction to on-chip


interconnections. Readers who are already familiar with this concept may wish
to skip this chapter.

9.1 Buses
A bus is a simple way to connect different parts of a computer at low cost.
This was recognized early on as even the very first computers utilized
buses, such as for example the electromechanical Z3 [59].
The advantages of a bus is clear when looking at Figure 9.1. Instead
of 20 dedicated connections between each of the five components there
is only one shared bus to which all components are connected. This will
significantly reduce the complexity and cost of the system under the as-
sumption that the traffic between the components do not overload the
bus.
One assumption here is that the total number of messages that will
be sent during a certain time period does not exceed the capacity of the
bus during the same time period. In many cases, this will also mean that
messages must be buffered for a while before they can be sent over the
bus.
Traditionally, buses were implemented using three-state drivers to

111
112 On-chip Interconnects

(a) Connecting five components with dedicated connections

(b) Connecting five components with a shared bus

Figure 9.1: Dedicated connections vs a shared bus

save area but this is very rarely used for on-chip buses any longer due to
the increased verification cost and slow performance of such buses [4].
Instead, on-chip buses can be implemented by using muxes as shown in
Figure 9.2. (Unless otherwise noted, a bus in this thesis refers to a bus
implemented using multiplexers.)

9.1.1 Bus Performance


A variety of parameters affect the performance of a bus. The most im-
portant parameters are the operating frequency, the width of the bus and
the choice of bus protocol.
9.1 Buses 113

(a) Bus based on three-state signals (b) Bus based on muxes

Figure 9.2: Buses implemented using three-state signaling and muxes

Other things that will indirectly impact the performance is the num-
ber of components connected to the bus. If an on-chip bus is constructed
that has many components connected to it, it will not be able to operate
as fast as a bus with few components due to physical constraints (e.g.
wire delays).

9.1.2 Bus Protocols


A simple bus protocol has only two types of transaction, single address
read and single address write. Typically it is possible for a bus slave to
delay a transaction if it cannot respond immediately, for example if a read
transaction to external memory would take more than one clock cycle to
complete.
In many cases a bus master may want to read more than one word at
a time. This is sometimes handled by signaling how many words to read
in the beginning of the transaction. This is how the PLB bus in IBM’s
CoreConnect architecture operates. Another way is to merely signal the
intention to read multiple words which will allow a slave to specula-
tively read ahead in memory. This is how the PCI bus works.
If a bus is used by many bus masters and there are some high latency
devices connected to the bus, a single bus master could lock the bus for
a hundreds of cycles when reading from a slow device. (For example,
trying to read an I/O register in a PCI device connected to a PCI bridge.
One way to avoid this problem is to allow a slave to tell a bus master to
114 On-chip Interconnects

retry the transaction later on when the requested value may be available.
This allows the bus to be used by another bus master while the value
requested by the first bus master is fetched.
A drawback with bus retries is that the bus can be saturated with
retry requests. A common way to avoid this is by using split transac-
tions. The idea is similar to how retried transactions work, but instead of
the bus master retrying the transaction, the slave will automatically send
the requested value back to the bus master using a special kind of bus
transaction.
Besides the techniques outlined above, most buses also have some
sort of error control. This can both be used to signal a link layer error
(e.g parity error or CRC error) or that the current transaction is invalid in
some way (e.g writing to a read only register).

9.1.3 Arbitration
If there is more than one potential bus master connected to a bus it is
important to make sure that only bus master is granted access to the bus
simultaneously. A popular way to solve this is to use an arbiter to which
all bus masters are connected. If a bus master needs to access the bus, it
first requests access to the bus from the arbiter which will grant access to
only bus master at a time. If more than one bus master request access at
the same time, a variety of algorithms can be used based on for example
priorities or fairness.
It is also possible to statically schedule all bus transactions at design
time. This is useful in real time systems where failure to meet a deadline
(because of for example a busy bus) can be catastrophic. The drawback
with static scheduling is that it is hard to analyze and schedule a complex
system with many components and several buses.

9.1.4 Buses and Bridges


If a single bus does not provide enough performance for a certain system
the system may be partitioned so that it contains more than one bus.
9.1 Buses 115

Graphics memory

Graphics unit Memory controller

GPIO

Bus bridge

Central processor Memory controller Network interface

Main memory

Figure 9.3: Partitioning a design into two buses

Consider for example the hypothetical system shown in Figure 9.3. The
system is divided so that it contains two buses instead of one. One of
the buses connects the graphics unit to the graphics memory and the
other connects the CPU to the main memory and other peripherals. This
includes a bus bridge that allows the CPU to access the graphics unit and
its memory. The idea behind this division is that the CPU rarely needs
to access graphics memory and the graphics unit rarely (or never) needs
to access main memory. In this way, the accesses done by the graphics
processor to show the screen are not noticed by the CPU except when
accessing graphics memory and vice versa.

Since it is easier to create a fast bus if it is only connected to few com-


ponents the system described in the previous paragraph will be easier
to design and integrate than a system with similar performance utilizing
only one bus.
116 On-chip Interconnects

Crossbar

Figure 9.4: Crossbar implemented with muxes

9.1.5 Crossbars

In some situations it is not possible to divide a system as easily as the


hypothetical system described above. In the worst case it would be com-
mon that all components communicate with all other components con-
nected to the bus. If the bandwidth requirement is high enough in this
case it will not be possible to use either a single bus or several buses.
In this situation the designer can use a crossbar instead of a bus. A
crossbar is an interconnect component that will allow any input port to
communicate with any output port. A crossbar can be constructed by
using muxes, as seen in Figure 9.4.
The area cost for a crossbar is significantly higher than for a single
bus. While the cost of a bus is roughly linear with the number of ports,
the cost of a crossbar increases quadratically as more ports are added to
it. Similarly to a simple bus, the latency of a crossbar will increase as
more ports are added to it.
9.2 On Chip Networks 117

Figure 9.5: 12 modules connected using a crossbar and a distributed net-


work

9.2 On Chip Networks


In an ideal world, a designer would be able to select a crossbar of a suit-
able size and connect all components to that. Unfortunately the cost of
a large crossbar is prohibitive in most designs and compromises have to
be made. One way to do this is basically an extension of Section 9.1.4.
Instead of dividing a single bus into two separate buses, the system now
uses many different crossbars and buses. The advantage is that it is eas-
ier to create a system with high throughput if individual crossbars are
kept small. An illustration of this is seen in Figure 9.5. This is usually
called Network-on-Chip (NoC) or sometimes On-Chip-Network (OCN).

9.2.1 Network Protocols


A network is substantially more complex than a bus and the protocols
used on a network reflect that. Perhaps the most important decision to
make when designing a network is to chose between a circuit switched
118 On-chip Interconnects

and a packet switched network.


A circuit switched network is based on exclusive connections that
are setup for a long period of time between different components. The
archetypal example of circuit switching is the telephone system (although
this is no longer completely true, especially as VoIP is getting more pop-
ular). The main advantage of such a network is that it is easy to imple-
ment. Once a connection is setup, each node along the route knows that
the input from a certain port should always be sent directly to another
port. The disadvantage is that a circuit switched network is often inef-
fective at using the available bandwidth.
Packet switching on the other hand is based on a system where a
connection is setup for the duration of an incoming packet and tore down
when the last part of that packet has been sent to the next switch. The
main advantage is that it is much easier to utilize a communication link
fully in a packet switched network because links are allocated to packets
instead of to a connection. That way most allocations are short lived
and the link can be reused for another connection immediately after a
packet has been sent. The disadvantage is that it is necessary to buffer
packets if a link is busy whereas a circuit switched NoC does not need
any buffering once a connection has been established.
There are two ways to specify the destination address for a NoC
packet. A NoC with source routing specifies the complete way that the
packet will take in the NoC from the beginning. In a NoC with dis-
tributed routing only the destination address is sent to the NoC.

9.2.2 Deadlocks

A deadlock can be caused when there is a circular dependency on re-


sources in a system. Consider a system where module X will first try to
allocate resource on the first clock cycle a and then resource b in the sec-
ond clock cycle while module Y will first try to allocate resource b and
then resource a. Neither module will release a resource before it has al-
located both resources. If X allocates a and Y allocates b, the system will
9.2 On Chip Networks 119

CPU
Memory
Master
Port Slave port
1
3

Accelerator 2
Slave Master
Port Port

Step 1: CPU issues a read request to the accelerator


Step 2: Accelerator must read a value from memory
to answer the read request
Step 3: Deadlock because the bus is already busy

Figure 9.6: Deadlock caused by a badly designed system

deadlock because neither module can release a resource before allocating


the other resource.
An example of how a deadlock can appear in the context of a bus,
Figure 9.6 shows a hypothetical system with a shared bus, a CPU acting
as master, an accelerator acting as both slave and master and a memory
acting as a slave. In the figure, the CPU tries to read from the accelerator,
which causes the accelerator to try to access the bus. Since the bus is
already used by the CPU, no further progress can be made. (The situation
above could be solved if the accelerator could issue a “retry” to the CPU.)
Because a NoC is a distributed system there are many more oppor-
tunities for a deadlock to occur. By restricting the number of possible
routes it is possible to create a NoC where it is impossible for a circular
dependency to form. A well known method for this in a 2D mesh is X-Y
routing. A complete discussion of routing algorithms is out of the scope
of this thesis, the reader is instead referred to for example [60] for a in
depth discussion.
120 On-chip Interconnects

However, simply using a deadlock free routing algorithm is not going


to guarantee that the system is deadlock free. The example in Figure 9.6
shows a deadlock caused by the components connected to the bus and
not the bus itself. Similarly, a NoC with deadlock free routing can still
be part of a deadlock if the components connected to the bus are badly
designed.
To avoid a deadlock in a NoC, a device connected to the NoC must
be guaranteed to eventually accept an incoming packet for delivery. It
doesn’t have to accept it immediately, but the acceptance should not de-
pend on being able to send a packet to the NoC.

9.2.3 Livelocks
A livelock situation is similar to a deadlock. If we modify the system
described first in Section 9.2.2 so that a module automatically releases an
allocated resource if it cannot allocate a required resource the following
situation may occur:

• X allocates a, Y allocates b

• X fails to allocate b, Y fails to allocates a

• X releases a, Y releases b

• X allocates a, Y allocates b

• ...

In this case a livelock has occurred. Each module is continually doing


something, but the system is not able to perform any real work.
Chapter 10
Network-on-Chip
Architectures for FPGAs

Abstract: In this chapter we will investigate the performance of Network on


Chip architectures optimized for FPGAs. This is a challenging research problems
because FPGAs are not very suitable for NoCs in the first place due to the high
area cost of multiplexers. In this chapter the focus has been to create a highly
optimized NoC architecture that is based on well understood principles. Both
a circuit switched and a packet switched NoC were investigated although the
packet switched NoC is probably a better fit for FPGAs than a circuit switched
NoC. The maximum frequency of the packet switched NoC is 320 MHz in a
Virtex-4 of the fastest speedgrade and the latency through an unloaded switch is
3 clock cycles. When compared with other publications this is a very good result.

10.1 Introduction
At the Division of Computer Engineering we have a relatively long his-
tory of NoC research targeted to ASICs. The work described in this chap-
ter is targeted at FPGAs instead while partially building on experiences
gained from the SoCBUS [61] research project.

121
122 Network-on-Chip Architectures for FPGAs

There are many challenges and opportunities in FPGA based NoC


design. Many issues are identical or very similar to an ASIC based NoC,
such as high level protocol and design partitioning. On the other hand,
the architecture of an FPGA based NoC is limited by the FPGA whereas
an ASIC based NoC can be optimized down to the layout of individual
transistors in the most extreme case.

While early FPGAs were small enough to barely justify an on-chip


bus, the largest FPGA today can fit a significant number of complex
IP cores. One of the largest FPGAs available on the market today, the
Virtex-4 LX200 has almost 180000 available 4 input look-up tables (LUTs).
This can be compared to the resource usage of for example the Openrisc
1200 processor, which consumes around 5000 LUTs when synthesized to
a Xilinx FPGA. Over 30 such processors or other IP cores of compara-
ble complexity could fit into one such FPGA. It is only a matter of time
before FPGAs of similar complexity becomes available at cost-effective
prices. At that point, designs will need an efficient and scalable inter-
connection structure and many researchers think that Network-on-Chip
research area will provide this structure.

When this case study was initiated, few FPGA based NoC seemed to
exist that really pushed an FPGA to its limits. It therefore made sense to
take a critical look at FPGAs to try to create an optimal match for NoC ar-
chitecture and FPGA architecture. The goal was to optimize fairly simple
architectures with well known behavior. In other words, this case study
focuses on FPGA optimization techniques for NoCs instead of new and
novel NoC architectures.

When the case study was initiated, statically scheduled NoCs were
deliberately excluded from the study as they are quite easy to implement
in an FPGA using for example the architecture in Figure 10.1. NoCs capa-
ble of handling dynamically changing traffic are more interesting except
for specialized applications.
10.2 Buses and Crossbars in an FPGA 123

North input

West input

Local input

South input

Schedule Counter
Memory

Figure 10.1: Minimalistic statically scheduled NoC switch

10.2 Buses and Crossbars in an FPGA

In Figure 10.2 a comparison is shown where the area and maximum fre-
quency of a simple crossbar and a simple bus are shown for various num-
ber of ports. A Virtex-4, speedgrade 12 was used in this comparison.
(Note that no modules are connected to this bus so Figure 10.2 shows the
ideal case where the entire FPGA can be dedicated solely to the bus.)
It is no surprise that the maximum operating frequency of the com-
ponents drop as more components are added, both for the bus and the
crossbar. And of course, while not shown in the graph, the area of the
crossbar grows extremely large as the number of ports are increased.
It should be noted that neither the bus, nor the crossbar was pipelined
in this comparison. Faster operation (at the expense of increased area)
could be had by pipelining the bus/crossbar. This example is still valid
for the majority of uses since many buses are not pipelined in practice.
124 Network-on-Chip Architectures for FPGAs

450
Bus
Crossbar
400

350

300
Maximum frequency [MHz]

250

200

150

100

50

0
4 6 8 10 12 14 16 18 20 22 24 26
Number of ports

Figure 10.2: Maximum frequency for a bus and a crossbar with various
number of ports

10.3 Typical IP Core Frequencies

To determine what frequency a NoC should operate at in order to be us-


able, we need to determine typical frequencies for the IP cores that will
be connected to it. As an example of what to expect from typical IP cores,
the maximum frequency of 40 cores when synthesized for a Virtex-4 were
taken from the datasheets available at Xilinx IP Center [62]. This in-
cludes cores with a wide range of functionality including floating point,
image coding, memory controller, and cryptographic cores. Extremely
simple cores that are just wrappers around for example Block RAMs,
DSP blocks, or distributed RAMs have been deliberately excluded in this
figure as these cores are intended to be instantiated by a module that
typically cannot achieve the same kind of operating frequencies as the
10.3 Typical IP Core Frequencies 125

Maximum IP core Maximum IP core


frequency frequency
88 MD5 187 Floating point comparator
100 H.264 Encoder, Baseline 200 3GPP Turbo Encoder
107 XPS Ethernet Lite MAC 200 DDR SDRAM Controller
133 SHA-384, SHA-512 200 GFP
138 Modular Exponentiation Engine 200 MPEG-2 HDTV I & P Encoder
141 LIN Controller 200 MPEG-2 SDTV I & P Encoder
143 JPEG-C 200 SHA-1, SHA-256, MD5
148 JPEG-D 201 16550 UART w/ FIFO
148 JPEG-E 204 IPsec ESP Engine
162 IEEE 802.16e CTC Decoder 215 Tiny AES
165 Interleaver/Deinterleaver 225 3GPP Turbo Decoder
166 Floating point adder 225 H.264 Deblocker
166 Floating point divider 228 AES Fast Encryptor/Decryptor
166 MPEG-2 HDTV/SDTV Decoder 238 DDR SDRAM Controller
167 Floating point square root 250 AES-CCM
167 SDRAM Controller 255 Standard AES Encrypt/Decrypt
173 Integer to Floating point 256 PRNG
175 LZRW3 Data Compression 267 DDR2 SDRAM Controller
183 Floating point multiplier 292 IEEE 802.16e CTC Encoder
184 Floating point to integer 322 3GPP2 Turbo Encoder

Table 10.1: Maximum frequency of various Virtex-4 based cores

primitives them self. The numbers are summarized in Table 10.1. (Note
that this include cores synthesized for both speedgrade -10, -11, and -12
Virtex-4 devices.)

While this data is by no means a complete survey of typical IP core


frequencies it should at least give an indication of what to can expect
from typical IP cores on a Virtex-4 based device. The data indicates that
it is currently uncommon with a frequency of more than 250 MHz. Based
on these numbers a NoC is therefore usable by most cores if it can operate
at between 200 and 250 MHz. When compared with Figure 10.2, this
means that a crossbar based solution will start to have problems when
more than 10 typical IP cores are connected to it.
126 Network-on-Chip Architectures for FPGAs

10.4 Choosing a NoC Configuration


As mentioned in Section 9.2.1, there are several protocols that can be used
over a NoC. To determine their relative merits in an FPGA three different
types of NoC switches for an FPGA were implemented:

• Circuit switched

• Packet switched

• Minimalistic (no congestion/flow control)

The minimalistic NoC is included to get an idea of the maximum per-


formance it is possible to get from a NoC optimized for an FPGA. It does
not contain any sort of congestion control, if two words destined for the
same output port arrive at a switch simultaneously one word will be
ignored. While not very useful except for specialized applications, the
clock frequency and area of this switch should be hard to improve on if
distributed routing is used.

10.4.1 Hybrid Routing Mechanism


As mentioned earlier, there are two major kinds of routing mechanisms
that can be used in a NoC, source routing and distributed routing. Source
routing has the advantage that the routing decision in each switch is very
simple. Distributed routing has the advantage that only the destination
address is needed to determine the output port in a switch. Typically
the route lookup in distributed routing will be part of the critical path of
a switch. To retain most of the advantages of distributed routing while
minimizing the drawback, a hybrid between source and distributed rout-
ing is used. A switch determines the output port for the next switch
instead of the current switch. The output port of the current switch is
directly available as a one-hot coded input signal. Figure 10.3 illustrates
this compared to distributed routing.
In the proposed NoC, five signals are used to signal the final destina-
tion address. As a further optimization for the 2D mesh case, there is no
10.4 Choosing a NoC Configuration 127

Step 1) Step 2)
Node 9 to send a packet to Node 14 Final destination is 14
Final destination is 14

Switch C Switch C
Route B Route
Lookup Lookup B
Node 9

A A
Output port set to A
Output port set to B
Step 3)
Final destination is 14

Switch
Route
Lookup
C B
Node 14
Output port set to B A

(a) A switch performs route lookup for the current output port

Step 1)
Node 9 to send a packet to Node 14 Step 2)
Output port set to B Outpout port set to A
Final destination is 14 Final destination is 14

Switch C Switch C
Route B Route
Lookup Lookup B
Node 9

A A
Next hop set to B
Next hop set to A
Step 3)
Output port set to B
Final destination doesn’t matter

Switch
Route
Lookup
C B
Node 14

(b) A switch performs route lookup for the next switch

Figure 10.3: Route lookup as performed in the three NoCs

possibility for a message to be sent back to the same direction that it came
from. Given this limitation and a maximum of 32 destination nodes, the
route lookup tables can in theory handle any kind of topology with up to
32 destination nodes and any kind of deterministic routing algorithm. In
practice a routing algorithm and topology that is deadlock free, such as
the well known X-Y routing algorithm on a 2D mesh, should be used. (If
the 32 destination nodes turns out to be a limitation it should be easy to
extend this NoC to support more than 32 destination nodes if some sort
of hierarchical addressing scheme is acceptable.)

In addition to the routing mechanism described above, similar signal-


ing are used for all NoCs as shown in Table 10.2. The following sections
128 Network-on-Chip Architectures for FPGAs

Table 10.2: Data and control signals in a unidirectional NoC link


Signals used in all three NoCs
Signal name Direction Width Description
Strobe Sender to receiver 1 Qualifies a valid
transaction
Data Sender to receiver 36 Used as data signals
Last Sender to receiver 1 Last data in transaction
Dest Sender to receiver 5 Address of
destination node
Route Sender to receiver 3-4 Destination port on the
switch (one hot coded)
Only in packet switched NoC
Signal name Direction Width Description
Ready Receiver to sender 1 The remote node is
ready to receive data
Only in circuit switched NoC
Signal name Direction Width Description
Nack Receiver to sender 1 A connection setup
was not successful
Ack Receiver to sender 1 Acknowledges a
successful connection

will describe each type of network in detail.

10.4.2 Packet Switched

The most complex part of this switch is the input part, which is shown
in Figure 10.4. The FIFO is based on SRL16 primitives that allows a very
compact 16 entry FIFO to be constructed. As the SRL16 has relatively
slow outputs, a register is placed immediately after the SRL16. This
means that the input part has a latency of two cycles in case the FIFO
is empty and the output port is available.
10.4 Choosing a NoC Configuration 129

Route
DEST[4:0] NEXTROUTE[2:0]
Lookup

DATA [35:0] DATA is sent to


all output ports
Shift
Flip Flops
Register
(SRL16)
LAST LAST is sent
to all arbiters
ROUTE_WEST_TO_NORTH

ROUTE_NORTH
CHECK
ROUTE_EAST ROUTE_WEST_TO_EAST
EMPTY
ROUTE_SOUTH
CE ADDR CE
ROUTE_WEST_TO_SOUTH
ROUTE_* signals from
Generate
STROBE FIFO the south, east, and
Read
READY ADDRESS north input ports
Enable
Signals from the north,
east, and south arbiter

READY signals from north,


east and south output ports

°2007
c IEEE. Reprinted from Field Programmable Logic and Applications, 2007. FPL 2007.
International Conference on , An FPGA Based Open Source Network-on-Chip Architecture, Ehliar, A. Liu,
D.

Figure 10.4: A detailed view of an input port of the packet switched NoC
switch

The block named “check empty” makes sure that no spurious ROUTE_*
signals are sent to the arbiter if the FIFO is empty. By doing this, the ar-
biter will be simplified as compared to having both the ROUTE_* signals
and separate signals for WEST_EMPTY, NORTH_EMPTY, etc. In partic-
ular, it is easier to identify the case where only one input port needs to
send a packet to the output port and send the packet immediately with-
out any arbitration delay.
The block that generates the read enable signal to the input FIFO has
to consider a large number of signals and it is therefore crucial to imple-
ment that block efficiently and place it so the routing delay is minimized.
Through RLOC directives most of that logic can be placed into one CLB
in order to minimize the routing delay. Finally, the READY signal is ad-
justed for pipeline latency so that the FIFO will not overflow if the sender
does not stop sending as soon as READY goes low.
130 Network-on-Chip Architectures for FPGAs

LASTFLIT_C
LASTFLIT_B
LASTFLIT_A
Arbiter
SEL_C STROBE
SEL_B
SEL_A
A_CHOSEN
B_CHOSEN
C_CHOSEN

DAT_A[45:0]
0
DAT_B[45:0] 1
0
DAT_O[45:0]
DAT_C[45:0] 1

°2007
c IEEE. Reprinted from Field Programmable Logic and Applications, 2007. FPL 2007.
International Conference on , An FPGA Based Open Source Network-on-Chip Architecture, Ehliar, A. Liu,
D.

Figure 10.5: A view of the output part of the packet switched NoC switch

Figure 10.5 shows a detailed view of an output port of a 4-port switch.


Each output port can only select from one of three input ports since there
should be no need to route a packet back to where it came from in most
topologies. If more than one input port needs to send a packet to the
same output port, an arbiter in the output port uses round robin to select
the port that may send. In the four port NoC switch, the output port is
essentially a 3-to-1 mux controlled by the arbiter (or a 4-to-1 mux in the
case of a five port switch). The DAT_* signals are formed by combining
the destination address with the NEXTROUTE_* signals and the payload
signal. It should also be noted that the part inside the dotted rectangle
can be implemented inside one LUT for each wire in the DAT_* signals
which will reduce the delay for this part somewhat.
The latency of the switch when the input FIFO is empty and the out-
put port is available is 3 clock cycles. If more than one input port has
data for a certain output port, the latency is increased to 4 clock cycles
due to an arbitration delay of one clock cycle for the packet that wins the
arbitration.
There are two main critical paths in this switch. One path is caused by
the read enable signal that is sent to the input FIFO. The other is from the
10.4 Choosing a NoC Configuration 131

FIFO to the route look-up due to the slow output of the SRL16 elements.

10.4.3 Circuit Switched NoC

The circuit switched NoC has a similar design to the SoCBUS [61] net-
work on chip architecture. The main difference between the circuit switched
and the packet switched NoC is that there are no FIFOs in the input
nodes. If the output port is occupied, a negative acknowledgment is
instead sent back to the transmitter. In this case the transmitter has to
reissue the connection request at a later time. Correspondingly, an ac-
knowledgment is sent once the packet has reached the destination.
The overall design is similar to packet switched version with the ex-
ception of the input module. In the circuit switched switch there is no
input FIFO as mentioned earlier. The arbiter is also different from the
arbiter in the packet switched version. In particular, it has to arbitrate
immediately if two or more connections arrive simultaneously to one
output port. It does this by using a fixed priority for each input port.
The critical path of the circuit switched switch is the arbiter, which
has to decide immediately if a circuit setup request should be accepted
or rejected.

10.4.4 Minimal NoC

The main reason for including this architecture is to provide an upper


bound on the achievable performance of an FPGA based NoC. Due to
its low complexity it should be hard to create a NoC with distributed
routing that can run at a higher frequency without making severe com-
promises on area and latency.
This NoC architecture does not use arbitration at all. If two words
arrive at one output port at the same time, one of them is discarded. This
means that such a network would have to be statically scheduled or use
some other means of guaranteeing that messages do not collide if a lost
or damaged message is unacceptable.
132 Network-on-Chip Architectures for FPGAs

Table 10.3: Frequency/area of the different NoC architectures in different


FPGAs
FPGA Packet Packet Circuit Circuit No congestion
switched switched switched switched control
(4 port) (5 port) (4 port) (5 port) (4 port)
xc4vlx80-12 321 MHz 284 MHz 341 MHz 315 MHz 506 MHz
xc4vlx80-10 232 MHz 206 MHz 230 MHz 225 MHz 374 MHz
xc2vp30-7 267 MHz 235 MHz 284 MHz 264 MHz 390 MHz
xc2v6000-4 176 MHz 160 MHz 190 MHz 183 MHz 241 MHz

Resource
utilization
(In Virtex-4)
LUTs 784 1070 633 828 396
Flip flops 448 572 452 595 368
Latency 3 3 2 2 2
(cycles)

10.4.5 Comparing the NoC Architectures

The performance and area of the three different NoCs are shown in Ta-
ble 10.3. The resource utilization of individual modules of the NoC switches
can be found in Table 10.4. LUTs that were only used for route-thru are
also included in these numbers. ISE 10.1 was used for synthesis and
place and route. Note that the performance numbers in Table 10.3 is only
for a single switch. The performance of the NoC will also be affected
by the distance between the switches, but this will not be a huge prob-
lem since flip-flops are used on the both the inputs and outputs of the
NoC switches. An experiments on a Virtex-4 SX35 has shown that a NoC
with 12 nodes and 4 switches is not limited by the distance between the
switches even though the switches were placed in different corners of the
FPGA.
The clock frequencies of the packet switched and circuit switched net-
10.5 Wishbone to NoC Bridge 133

Switch type Module type LUTs Flip-flops


5 ports, Arbiter 29 8
packet Input FIFO 92 64
switched Output Mux 98 47
4 ports, Arbiter 20 6
packet Input FIFO 68 55
switched Output Mux 96 47
4 ports, Arbiter 29 5
circuit Input Module 20 59
switched Output Mux 98 47
5 ports, Arbiter 41 6
circuit Input Module 20 64
switched Output Mux 101 47
4 ports, Input module 2 45
No congestion Output mux 92 46
control

Table 10.4: The resource utilization of the individual parts of the switches

works are both high, although there is a relatively large gap to the upper
limit established by the NoC without congestion control. A more efficient
flow control mechanism would certainly be a welcome addition to these
NoCs although inventing such an architecture is probably non-trivial.
Due to the small difference in performance between the circuit switched
and packet switched network the packet switched network is probably
the best fit for Xilinx FPGAs.

10.5 Wishbone to NoC Bridge


In addition to the NoC switches described above, a bridge between the
Wishbone [63] bus and the packet switched NoC has been developed.
The general architecture of the bridge is shown in Figure 10.6. A write
requests issued to the bridge from the Wishbone bus is handled using
134 Network-on-Chip Architectures for FPGAs

Address generator
Wishbone Address
Read request FIFO

CE
Input FIFO

NoC Wishbone ACK

CE

Wishbone data
Route lookup

NoC
Wishbone Address

Wishbone data

°2007
c IEEE. Reprinted from Field Programmable Logic and Applications, 2007. FPL 2007. Inter-
national Conference on , An FPGA Based Open Source Network-on-Chip Architecture, Ehliar, A. Liu,
D.

Figure 10.6: Simplified view of the data flow of the Wishbone to NoC
bridge.

posted writes. That is, the write request is immediately acknowledged to


the Wishbone master even though there will be a delay of several cycles
before the write is guaranteed to reach its destination. A read request
is handled by issuing retries for all read requests on the Wishbone bus
until the requested value has been returned over the NoC. (This is very
similar to how a PCI bridge works.)

To avoid deadlocks, read requests have lower priority than write re-
quests and read replies. That is, the bridge will immediately service a
write request in the input FIFO. The bridge will also service a read reply
as soon as possible (by waiting for the originator of the read request to
retry the read). On the other hand, if a read request comes in from the
NoC downlink it cannot be serviced until the NoC uplink is available.
This means that read requests has to be queued in a separate queue if
10.6 Related Work 135

the NoC uplink is not available (which will happen if the FIFO in the
NoC switch the bridge is connected to is full). At the moment, the de-
signer has to make sure that the read request queue is large enough to
hold all possible incoming read requests that could be issued to a certain
Wishbone bridge over the NoC.
A big problem with the Wishbone bus in the context of bus bridges is
that the bus has been designed with a combinatorial bus in mind. While
Wishbone does provide a couple of signals for burst handling, the only
length indication for a linear burst is the fact that at least one more word
is requested. To mitigate this, the bridge has an input signal that is used
for reads to indicate the number of words to read, but this is no longer
strictly Wishbone compliant.
Another area where the bridge is not fully Wishbone compatible is the
error handling. It would be relatively easy to add support for the ERR
signal to read requests/replies. Unfortunately, writes cannot be imple-
mented using posted write requests if the ERR signal in Wishbone should
be handled correctly. The easiest way to add error handling would be to
add a status register that can be read by a master processor so that such
errors can be detected by the operating system.
In our opinion, the complexity of the bridge is not a good sign. An
interesting future research topic would be how to design a simple bus
protocol which can both serve a bus at high performance while at the
same time being easy to connect to a NoC.

10.6 Related Work


The ASIC community has been very active in the NoC research area. An
early paper discussing the advantages of a NoC in an ASIC was written
by Dally and Towles [64]. Other well known NoC architectures for ASICs
include the Aethereal research project [65] and xpipes [66].
The FPGA community has not been quite as active but recently a
number of publications have appeared. There seems to be a lot of in-
terest in NoCs as a way to connect dynamically reconfigurable IP cores
136 Network-on-Chip Architectures for FPGAs

or even using dynamic reconfiguration on the NoC itself [67].


Bartic et al describes a packet switched network on a Virtex-II Pro
FPGA [68]. Another packet switched network is described by Nachiket
et al and compared to a statically scheduled networked on a Virtex-II
6000 [69]. There are also circuit switched FPGA based networks such as
PNoC [70] which has also been studied in the context of an application
and compared to a system with a shared bus.
Recently, a number of high speed NoCs have been presented such as
MoCReS [71]. This is a packet switched NoC with support for both vir-
tual channels and different clock domains with a reported performance
of up to 357 MHz in a Virtex-4 LX 100. MoCReS seems to primarily utilize
a BlockRAM per link which makes a much more expensive solution than
the solution presented in this paper. The latency of a single switch is also
not reported in the paper. Another recent publication is [72], in which
the author describes a NoC intended for distributed and safety-critical
real time systems which is pseudo-statically scheduled and makes heavy
use of Cyclone II’s 4kbit embedded memories.

10.7 Availability
The source code for the packet switched NoC can be downloaded at
http://www.da.isy.liu.se/research/soc/fpganoc/. The Wish-
bone to NoC bridge is also available for download. Hopefully this will
allow NoC researchers interested in FPGAs to easily compare their NoC
against another NoC with good performance in an FPGA.

10.8 ASIC Ports


By using the compatibility library described in Section 5.2 it was possible
to port the NoC to a 130nm ASIC process. Table 10.5 shows the perfor-
mance and area of various NoC configurations and compares it to the
performance of the FPGA versions.
10.8 ASIC Ports 137

Technology Configuration Fmax Area


xc4vlx80-12 4 port 320 784 LUT,
packet switched 448 FF
130nm ASIC 4 port 705 0.20mm2
(speed optimized) packet switched
130nm ASIC 4 port 89 0.18mm2
(area optimized) packet switched
xc4vlx80-12 5 port 284 1070 LUT,
packet switched 572 FF
130nm ASIC 5 port 599 0.26mm2
(speed optimized) packet switched
130nm ASIC 5 port 84 0.24mm2
(area optimized) packet switched

Table 10.5: Performance of packet switched NoC in different technologies

It is interesting that the critical path is still in the read enable signal
to the FIFOs in the input ports even in the ASIC version of the packet
switched switch. However, as can be seen in the table there is little pos-
sibility to improve the ASIC timing by trading area for frequency. This is
not surprising as the switch consists mostly of muxes and flip-flops and
there is little the synthesizer can do about these.
The difference between the packet switched NoC switch and circuit
switched NoC switch is substantial in the ASIC port. This is because
the packet switched switch is using many SRL16 primitives. While this
primitive is very cost effective in an FPGA as it allows a LUT to be used
as a 16 bit shift register, it is likely to be expensive to port this to an ASIC.
In fact, since a circuit switched NoC is so much cheaper it is actually
possible to use a much more complex network with more nodes in it
if circuit switching is used instead of packet switching. Doubling the
number of switches in the network is not a problem area wise. In fact, it
is possible to both double the number of switches and the width of the
links and still use less area than the packet switched network in an ASIC
138 Network-on-Chip Architectures for FPGAs

Technology Configuration Fmax Area


xc4vlx80-12 4 port 341 633 LUT,
circuit switched 452 FF
130nm ASIC 4 port 948 0.025mm2
(speed optimized) circuit switched
130nm ASIC 4 port 259 0.023mm2
(area optimized) circuit switched
xc4vlx80-12 5 port 315 828 LUT,
circuit switched 595 FF
130nm ASIC 5 port 846 0.038mm2
(speed optimized) circuit switched
130nm ASIC 5 port 232 0.032mm2
(area optimized) circuit switched

Table 10.6: Performance of circuit switched NoC in different technologies

when using these components.

10.9 Conclusions
It is possible to create high speed NoC switches on a Xilinx FPGA that
are both fast and relatively small. By manually instantiating FPGA prim-
itives it is possible to achieve the level of control which is needed to reach
the highest performance. Floorplanning is not a requirement to reach this
performance, but investigating the output from the placer was necessary
to understand how the design could be further optimized at many times
during the development.
In our experience, circuit switched and packet switched NoCs will
have roughly the same operating frequency and area in Xilinx devices
and the developer is therefore free to chose which to use depending on
his or her needs. However, if the design might eventually be ported to an
ASIC, the packet switched NoC will be much more expensive in terms
of area than the circuit switched NoC. In fact, in terms of area, a packet
10.9 Conclusions 139

switched NoC is more than 5 times as expensive as a circuit switched


NoC. The maximum frequency of a circuit switched node is also slightly
higher than for a packet switched node which is another advantage of
the circuit switched network in an ASIC.
However, it is possible that once Network-on-Chip reaches main-
stream acceptance in the ASIC community it will be possible to buy hard
IP blocks with a NoC switch created using full custom methods. Under
this assumption it will probably become natural to replace FPGA based
NoC switches with optimized ASIC versions in the same way that block
RAMs are replaced with custom memory blocks when porting a design
to an ASIC.
However, it is not so easy to interface a NoC to a normal bus. An
interesting future research area would be to design a simple bus protocol
that has a high performance on a regular bus while still being easy to
interface to a high speed NoC.
140 Network-on-Chip Architectures for FPGAs
Part IV

Custom FPGA Backend


Tools

141
Chapter 11
FPGA Backend Tools

Abstract: Sometimes a designer encounters a situation where the FPGA ven-


dor’s tool is not quite good enough. Of course, in most cases the existing tools
are adequate, even if they are not optimal by any means. However, sometimes
there are situations where a designer would really like a little more control over
the backend part of the design flow. This chapter is intended to serve as an inspi-
ration for those who would like to write their own backend tools for the Xilinx
design flow.

11.1 Introduction
XDL is a file format which contains a text version of Xilinx’ proprietary
NCD file format. (The NCD file format is used for netlists created by
both the mapper and the place and route tool.) The xdl command can
be used to convert between XDL and NCD. The XDL file format is no
longer documented by Xilinx, but earlier version of ISE contained some
information about it [73].
Due to the simplicity of the file format it is quite easy to parse it in a
custom program or script. Unfortunately it is difficult to understand the
part which deals with routing as those parts require knowledge about
the FPGA which is difficult to obtain.
It is also possible to modify a netlist in XDL format to include or

143
144 FPGA Backend Tools

change the functionality. A common usage for this scenario is debug-


ging. ChipScope [74] is a logic analyzer developed by Xilinx which can
be inserted into an FPGA without having to resynthesize/place and route
the entire design.
A typical design flow for a backend tool utilizing the XDL file format
is shown in Figure 11.1. It is important to note that merely modifying the
XDL file is not enough. It is also necessary to modify the PCF constraints
file to avoid unspecified timing paths.

11.2 Related Work


A number of different ways to modify a design after synthesis has been
implemented. As already mentioned, Xilinx has their own tool which
allow a logic analyzer to be inserted [74].
At one point, Xilinx also distributed jbits [75], which allows a user
to manipulate a design in Java. Sadly, jbits has been discontinued and
doesn’t support any design newer than a Virtex-II.
Of course, it is also possible to use the fpga_editor to modify or in-
spect a design. Unfortunately, this is a limited method as there is no
general purpose script language in the FPGA editor.
A final tool of interest is abits [76], which allows an Atmel bitstream
to be manipulated.

11.3 PyXDL
PyXDL is a library designed by us for reading and writing XDL files.
While somewhat limited at the moment, it has three demonstration pro-
grams:

• Design viewer

• Resource usage viewer

• Logic analyzer inserter


11.3 PyXDL 145

Source code Constraints

Synthesizer (xst)

NGC

ngdbuild

NGD

map

NCD (Mapped) Constraints (PCF)

par

NCD (Routed)

xdl Design to merge


XDL (Mapped)
XDL (Routed)

Constraints (PCF)
PyXDL design merger

XDL (Partially Constraints (PCF)


routed)

xdl

NCD (Partially
routed)

par

NCD (Routed)

°2007
c FPGAWorld.com. Reprinted from 4th annual FPGAworld Conference, Thinking outside the flow:
Creating customized backend tools for Xilinx based designs, Ehliar, A. Liu, D.

Figure 11.1: Typical design flow when utilizing the XDL file format
146 FPGA Backend Tools

Of these, the logic analyzer inserter is probably the most useful. It


inserts a logic analyzer core into an already placed and routed design
and allows most signals to be probed. (Some signals cannot easily be
probed e.g. carry chains.). After insertion of the logic analyzer, the core
can be controlled via a serial port.
For those who are interested, PyXDL is available under the GPL at
http://www.da.isy.liu.se/~ehliar/pyxdl/ together with the sam-
ple applications listed above.

11.4 Future Work


While the current version of PyXDL and its demo applications is lim-
ited, the concept is very interesting. An obvious improvement would
be to extend the logic analyzer. Right now it is hard coded for a certain
channel width and memory depth. It is desirable that these values are
configurable at runtime instead.
Another improvement could be to add other types of instrumenta-
tion to the logic analyzer core. Statistics gathering (possibly with some
interpreter for typical buses) would be easy to implement.
Some more interesting uses relates to partial reconfiguration. With
XDL it is easy to replace part of a design with another part. A most
interesting use of XDL/Partial reconfiguration would be if a tool could
be created which automatically divided a large design into several parts
which are loaded into the FPGA on an as-needed basis using partial re-
configuration.
Part V

Conclusions and Future


Work

147
Chapter 12
Conclusions

Optimizing a design for a certain platform will always include trade-


offs between many parameters such as performance, area, flexibility, and
development time. The case studies in this thesis was developed with
the intention of aiming for very high performance without sacrificing
too much flexibility and area. In all of these case studies, careful high
and low level optimizations allowed the designs to reach very high clock
frequencies.

12.1 Successful Case Studies


The first case study shows that a very high clock frequency can be achieved
in an FPGA based soft processor without resorting to a huge number of
pipeline stages. By optimizing the execution units, in particular the arith-
metic unit, it is possible to forward a result calculated in the arithmetic
unit immediately to itself without making any sacrifices regarding the
maximum clock frequency. The processor can operate at a frequency of
357 MHz in a Virtex-4 of the fastest speed grade which is significantly
higher than other soft processors for FPGAs. To achieve this frequency,
both high and low-level optimizations had to be used, including manual
instantiation of LUTs and manual floorplanning of the critical parts of
the processor.

149
150 Conclusions

The next case study studied floating point adders and multipliers and
showed how these can be optimized for the Virtex-4. By parallelizing the
normalizer we could achieve a clock frequency of 370 MHz in the float-
ing point adder with a latency of 8 clock cycles for a complete addition.
This is faster than previous publications at the same latency in terms of
clock cycles but this speed comes at a price of higher area than previous
publications as well. Our floating point units should be a good match for
situations where low latency is important
The final case study discusses how NoC architectures can be opti-
mized for FPGAs. We find that a circuit switched network will be smaller
than a packet switched network, but the difference is relatively small on
Xilinx FPGAs when using the SRL16 primitive. This means that a packet
switched network is very attractive to use in an FPGA.

12.2 Porting FPGA Optimized Designs to ASICs


Since all of the designs in this thesis depends on being able to instantiate
FPGA primitives the designs could not be directly ported to an ASIC.
However, writing a small compatibility library with synthesizable ver-
sions of flip-flop and other slice primitives allows the designs to be be
easily ported to an ASIC. When ported directly using such a library the
performance of the designs are adequate in the ASIC. However, by mod-
ifying the designs slightly it was possible to increase the clock frequency
significantly. In both the processor and the floating point modules it was
necessary to replace the DSP48 based multiplier with a version using a
multiplier from Synopsys’ DesignWare library. In the adder it was also
necessary to replace the adder based on instantiated FPGA primitives
with a behavioral version.
These were small changes that could be performed quickly and most
of the FPGA specific optimizations are still present, including some adders
consisting of instantiated FPGA primitives. This shows that FPGA opti-
mizations need not be an obstacle to a high performance ASIC port.
Chapter 13
Future Work

As with most other research projects, the designs described in this thesis
can not yet claim that they are finished. There are many interesting pos-
sibilities for future research here and the most important of these will be
described in this chapter.

13.1 FPGA Optimized DSP


The soft processor described in this thesis has a promising architecture.
However, it does lack some features that are necessary for massive de-
ployment. Most importantly, it does not currently have a compiler. Cre-
ating a basic backend for GCC should be a fairly easy task although get-
ting GCC to automatically use the DSP features of the processor will be
harder.
Another challenging problem is how to implement caches with very
high clock frequencies yet low enough latency to be suitable for this pro-
cessor. This is not only a matter of creating a fast cache, integrating it into
the processor is also not a trivial problem as a good strategy for handling
cache misses, particularly in the data cache, has to be invented. (Cache
misses during instruction fetch are easier to handle as they are visible
much earlier in the pipeline.) Caches are also more or less a necessity if
the address space should be increased from 16-bits to 32-bits.

151
152 Future Work

The instruction set should be benchmarked thoroughly to determine


if there are any important instructions that are missing by using stan-
dardized benchmarks.

Finally, there are some more minor details that could be added to
the processor without too much difficulty. Interrupts could be added
fairly easily if it is acceptable that a few clock cycles are spent to make
sure that the pipeline is not executing a delayed jump. If some sort of
improved branch prediction is used, it may be possible to avoid the use
of delay slots. Automatic stalling of the processor due to hazards would
also be a nice addition and could probably be implemented using extra
bits in the instruction word (these bits may only have to be present in the
instruction cache and not in main memory).

13.2 Floating Point Arithmetic on FPGAs

While the floating point adder and multiplier are probably the most pol-
ished projects described in this thesis there is still much that could be
done in this research area. The most obvious improvement is to make
the units more flexible by allowing parameters to be used to specify man-
tissa width and exponent width. This is not so interesting from a research
perspective but very important from a practical perspective.

An interesting problem is how to create a fast floating point MAC


unit. Right now, the units are not really optimal for this since the re-
sult of an addition is available after eight cycles, which will lead to some
problems when trying to perform a MAC operation. A possible way to
solve this is to perform eight or more convolution operations simulta-
neously but a better solution would be to create a MAC unit capable of
accumulating one floating point value each cycle. How to do this in an
FPGA is far from obvious if high performance is desired, especially if full
IEEE-754 compliance is required.
13.3 Network-on-Chip 153

13.3 Network-on-Chip
The NoC research area is still far from mature so there is obviously much
to do here. In the implementation described in this thesis, the most im-
portant improvement is probably to improve the bus bridge by for ex-
ample reducing the area and allowing different clocks to be used on the
NoC and on the Wishbone bus.
Another interesting area is NoC friendly bus protocols as many of
todays bus protocols are not very suitable when a pipelined NoC (or bus
for that matter) is used.
Finally, adding some support for quality of service would be a nice
addition to this NoC although it is not clear if this can be done without
huge area penalties.

13.4 Backend Tools


The PyXDL package is already useful for a few tasks, but it is mainly in-
tended to serve as an inspiration for other researchers to show that it is
pretty simple to manipulate the netlists generated by Xilinx software. It
would only be a matter of some programming to add support for more
Xilinx FPGAs to PyXDL, improving the logic analyzer inserter, and cre-
ating more tools similar to the logic analyzer like statistics gathering.
However, a much more interesting research direction would be to cre-
ate a toolchain that allows for automatic creation of partially dynamically
reconfigurable designs. A stable tool which allows this to be done could
be a huge boon to the FPGA community.

13.5 ASIC Friendly FPGA Designs


Further work is needed on how to design systems so that they are effi-
cient in both ASIC and FPGAs. While the data in this thesis indicates
that an FPGA optimized architecture is overall pretty easy to do with
154 Future Work

only small changes to the designs, more work is required in this area, es-
pecially to determine how the power consumption depends on the FPGA
optimization. If structured ASICs increase in popularity it would also be
interesting to determine the impact of FPGA optimizations when porting
an FPGA design to a structured ASIC.
Bibliography

[1] D. Selwood, “Ip for complex fpgas,” FPGA and Structured ASIC
Journal, 2008. [Online]. Available: http://www.fpgajournal.com/
articles_2008/20081209_ip.htm

[2] S. Singh, “Death of the rloc?” Field-Programmable Custom Computing


Machines, 2000 IEEE Symposium on, pp. 145–152, 2000.

[3] Ray Andraka, private communication, 2009.

[4] M. Keating and P. Bricaud, Reuse Methodology Manual for System-On-


A-Chip Designs. Kluwer Academic Publishers, 2002.

[5] D. Liu, Embedded DSP Processor Design: application specific instruction


set processors. Elsevier Inc., Morgan Kaufmann Publishers, 2008,
ch. 4 DSP ASIP Design Flow.

[6] J. Stephenson, “Design guidelines for optimal results in


high-density fpgas,” in Design & Verification Conference,
2003. [Online]. Available: http://www.altera.com/literature/cp/
fpgas-optimal-results-396.pdf

[7] Altera, Guidance for Accurately Benchmarking FPGAs v1.2, 12


2007. [Online]. Available: http://www.altera.com/literature/wp/
wp-01040.pdf

[8] K.-C. Wu and Y.-W. Tsai, “Structured asic, evolution or revolution?”


in ISPD ’04: Proceedings of the 2004 international symposium on Physi-
cal design. New York, NY, USA: ACM, 2004, pp. 103–106.

155
[9] I. Kuon and J. Rose, “Measuring the gap between fpgas and asics,”
in Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, 2007.

[10] Actel, “Igloo low-power flash fpgas handbook,” 2008. [Online].


Available: http://www.actel.com/documents/IGLOO_HB.pdf

[11] T. Tuan, S. Kao, A. Rahman, S. Das, and S. Trimberger, “A 90nm low-


power fpga for battery-powered applications,” in Proceedings of the
2006 ACM/SIGDA 14th international symposium on Field programmable
gate arrays, 2006, pp. 3–11.

[12] D. G. Chinnery and K. Keutzer, “Closing the power gap between


asic and custom: an asic perspective,” in DAC ’05: Proceedings of the
42nd annual conference on Design automation. New York, NY, USA:
ACM, 2005, pp. 275–280.

[13] T. Savell, “The emu10k1 digital audio processor,” Micro, IEEE,


vol. 19, no. 2, pp. 49–57, Mar/Apr 1999.

[14] Xilinx, “Easypath fpgas.” [Online]. Available: http://www.xilinx.


com/products/easypath/index.htm

[15] Altera, “Asic, asics, hardcopy asics with transceivers.”


[Online]. Available: http://www.altera.com/products/devices/
hardcopy-asics/about/hrd-index.html

[16] Pat Mead, private communication, 2008.

[17] eASIC, “nextreme zero mask-charge new asics.” [Online]. Avail-


able: http://www.easic.com/pdf/asic/nextreme_asic_structured_
asic.pdf

[18] L. S. Corporation, “Maco: On-chip structured asic blocks,” 2009.


[Online]. Available: http://www.latticesemi.com/products/fpga/
sc/macoonchipstructuredasicb/

156
[19] L. Wirbel, “Fpga survey sees sunset for gate arrays,
continued dominance by xilinx, altera,” EE Times, 2008.
[Online]. Available: http://www.eetimes.com/miu/showArticle.
jhtml;jsessionid=QM4Y35UD5BX3AQSNDLPSKHSCJUNN2JVN?
articleID=211200184

[20] STMicroelectronics, “Methodology & design tools.” [Online].


Available: http://www.st.com/stonline/products/technologies/
asic/method.htm

[21] C. Baldwin. Converting fpga designs. [Online]. Available: http:


//www.chipdesignmag.com/display.php?articleId=2545

[22] Application Note 311: Standard Cell ASIC to FPGA Design Methodology
and Guidelines ver 3.0, Altera, 2008.

[23] K. Goldblatt, XAPP119: Adapting ASIC Designs for Use with Spartan
FPGAs, Xilinx, 1998.

[24] Xilinx, “Recorded lectures: Asic user.” [Online]. Available:


http://www.xilinx.com/support/training/rel/asic-user-rel.htm

[25] M. Hutton, R. Yuan, J. Schleicher, G. Baeckler, S. Cheung, K. K.


Chua, and H. K. Phoon, “A methodology for fpga to structured-asic
synthesis and verification,” Design, Automation and Test in Europe,
2006. DATE ’06. Proceedings, vol. 2, pp. 1–6, March 2006.

[26] T. Danzer. (2006) Low-cost asic conversion targets consumer suc-


cess. [Online]. Available: http://www.fpgajournal.com/articles_
2006/20061107_ami.htm

[27] J. Gallagher and D. Locke. (2004, 3) Build complex asics without


asic design expertise, expensive tools. [Online]. Available: http://
electronicdesign.com/Articles/Print.cfm?AD=1&ArticleID=7382

[28] Ian Kuon, private communication, 2009.

157
[29] P. Metzgen. (2004) Optimizing a high performance 32-bit processor
for programmable logic. [Online]. Available: http://www.cs.tut.fi/
soc/Metzgen04.pdf

[30] G. Bilski, “Re: [fpga-cpu] paul metzgen on multiplexers and the


nios ii pipeline,” 7 2007. [Online]. Available: http://tech.groups.
yahoo.com/group/fpga-cpu/message/2795

[31] Xilinx, XtremeDSP for Virtex-4 FPGAs User Guide UG073


(v2.7), 2008. [Online]. Available: http://www.xilinx.com/support/
documentation/user_guides/ug073.pdf

[32] O. Gustafsson, A. G. Dempster, K. Johansson, M. D. Macleod, and


L. Wanhammar, “Simplified design of constant coefficient multipli-
ers,” Circuits, Systems and Signal Processing, vol. 25, no. 2, pp. 225–
251, 2006.

[33] M. J. Wirthlin, “Constant coefficient multiplication using look-up


tables,” J. VLSI Signal Process. Syst., vol. 36, no. 1, pp. 7–15, 2004.

[34] Atmel, ATC35 Summary. [Online]. Available: http://www.atmel.


com/dyn/resources/prod_documents/1063s.pdf

[35] D. Drako and H.-T. A. Yu, “Apparatus for alternatively accessing


single port random access memories to implement dual port first-in
first-out memory,” U.S. Patent 5 371 877, 12 6, 1994.

[36] B. Flachs, S. Asano, S. Dhong, H. Hofstee, G. Gervais, R. Kim,


T. Le, P. Liu, J. Leenstra, J. Liberty, B. Michael, H.-J. Oh, S. Mueller,
O. Takahashi, A. Hatakeyama, Y. Watanabe, N. Yano, D. Broken-
shire, M. Peyravian, V. To, and E. Iwata, “The microarchitecture of
the synergistic processor for a cell processor,” Solid-State Circuits,
IEEE Journal of, vol. 41, no. 1, pp. 63–70, Jan. 2006.

[37] N. Sawyer and M. Defossez, “Xapp228 (v1.0) quad-port memories


in virtex devices,” 2002. [Online]. Available: http://www.xilinx.
com/support/documentation/application_notes/xapp228.pdf

158
[38] K. J. McGrath and J. K. Pickett, “Microcode patch device and
method for patching microcode using match registers and patch
routines,” U.S. Patent 6 438 664, 8 20, 2002.

[39] “Opencores.org.” [Online]. Available: http://www.opencores.org/

[40] M. Olausson, A. Ehliar, J. Eilert, and D. Liu, “Reduced floating point


for mpeg1/2 layer iii decoding,” Acoustics, Speech, and Signal Pro-
cessing, 2004. Proceedings. (ICASSP ’04). IEEE International Conference
on, vol. 5, pp. V–209–12 vol.5, May 2004.

[41] Spirit DSP, “Datasheet: Spirit mp3 decoder,” 2009. [Online].


Available: http://www.spiritdsp.com/products/audio_engine/
audio_codecs/mp3/

[42] P. Ahuja, D. Clark, and A. Rogers, “The performance impact of in-


complete bypassing in processor pipelines,” Microarchitecture, 1995.
Proceedings of the 28th Annual International Symposium on, pp. 36–45,
Nov-1 Dec 1995.

[43] J. Hennessy, N. Jouppi, S. Przybylski, C. Rowen, T. Gross, F. Bas-


kett, and J. Gill, “Mips: A microprocessor architecture,” SIGMICRO
Newsl., vol. 13, no. 4, pp. 17–22, 1982.

[44] J. Gray, “Building a risc system in an fpga part 2,” Circuit Cellar, vol.
117, 2000.

[45] Xilinx, MicroBlaze Processor Reference Guide UG081 (v9.0),


2008. [Online]. Available: http://www.xilinx.com/support/
documentation/sw_manuals/mb_ref_guide.pdf

[46] Altera, Nios II Processor Reference Handbook, Internet, 2007.

[47] Lattice, LatticeMico32 Processor Reference Manual, Internet, 2007.

[48] J. Nurmi, Processor Design - System-On-Chip Computing for ASICs and


FPGAs. Springer, 2007.

159
[49] P. Metzgen, “A high performance 32-bit alu for programmable
logic,” in FPGA ’04: Proceedings of the 2004 ACM/SIGDA 12th interna-
tional symposium on Field programmable gate arrays. New York, NY,
USA: ACM, 2004, pp. 61–70.

[50] G. Research, The LEON processor user’s manual, Internet, 2001.

[51] D. Lampret, OpenRISC 1200 IP Core Specification, 2001.

[52] Göran Bilski, private communication, 2008.

[53] G. Govindu, L. Zhuo, S. Choi, and V. Prasanna, “Analysis


of high-performance floating-point arithmetic on fpgas,” in
Parallel and Distributed Processing Symposium, 2004. Proceedings.
18th International, 2004, pp. 149+. [Online]. Available: http:
//ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=1303135

[54] R. Andraka, “Supercharge your dsp with ultra-fast floating-


point ffts,” DSP magazine, no. 3, pp. 42–44, 2007. [Online].
Available: http://www.xilinx.com/publications/magazines/dsp_
03/xc_pdf/p42-44-3dsp-andraka.pdf

[55] Xilinx, Floating-Point Operator v3.0, 3rd ed., Internet, Xilinx,


www.xilinx.com, September 2006.

[56] Nallatech, Nallatech Floating Point Cores, Internet, Nallatech,


www.nallatech.com, 2002.

[57] J. Detrey and F. de Dinechin, “A parameterized floating-point


exponential function for fpgas,” in Field-Programmable Technology,
2005. Proceedings. 2005 IEEE International Conference on, 2005, pp.
27–34. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs_
all.jsp?arnumber=1568520

[58] J. Eilert, A. Ehliar, and D. Liu, “Using low precision floating point
numbers to reduce memory cost for mp3 decoding,” Multimedia Sig-
nal Processing, 2004 IEEE 6th Workshop on, pp. 119–122, 29 Sept.-1
Oct. 2004.

160
[59] R. Rojas, “Konrad Zuse’s legacy: The architecture of the z1 and z3,”
IEEE Annals of the history of computing, vol. 19, 1997.

[60] W. Dally and B. Towles, Principles and Practices of Interconnection Net-


works. Morgan Kaufmann, 2004.

[61] D. Wiklund and D. Liu, “Design of a system-on-chip switched net-


work and its design support,” Communications, Circuits and Sys-
tems and West Sino Expositions, IEEE 2002 International Conference on,
vol. 2, no. 29, 2002.

[62] Xilinx, “Xilinx ip center.” [Online]. Available: http://www.xilinx.


com/ipcenter/index.htm

[63] “Wishbone system-on-chip (soc) interconnection architecture for


portable ip cores,” 2002. [Online]. Available: http://www.
opencores.org/

[64] W. J. Dally and B. Towles, “Route packets, not wires: On-chip


interconnection networks,” in Design Automation Conference, 2001,
pp. 684–689. [Online]. Available: citeseer.ist.psu.edu/dally01route.
html

[65] K. Goossens, J. Dielissen, and A. Radulescu, “Aethereal network on


chip: concepts, architectures, and implementations,” Design & Test
of Computers, IEEE, vol. 22, 2005.

[66] D. Bertozzi and L. Benini, “Xpipes: a network-on-chip architecture


for gigascale systems-on-chip,” Circuits and Systems Magazine, IEEE,
vol. 4, 2004.

[67] L. Braun, M. Hubner, J. Becker, T. Perschke, V. Schatz, and S. Bach,


“Circuit switched run-time adaptive network-on-chip for image
processing applications,” Field Programmable Logic and Applications,
2007. FPL 2007. International Conference on, pp. 688–691, 27-29 Aug.
2007.

161
[68] T. Bartic, J.-Y. Mignolet, V. Nollet, T. Marescaux, D. Verkest, S. Ver-
nalde, and R. Lauwereins, “Highly scalable network on chip for re-
configurable systemsin,” System-on-Chip, 2003. Proceedings. Interna-
tional Symposium on, 2003.

[69] N. Kapre, N. Mehta, M. deLorimier, R. Rubin, H. Barnor, M. J.


Wilson, M. Wrighton, and A. DeHon, “Packet switched vs. time
multiplexed fpga overlay networks,” IEEE Symposium on Field-
programmable Custom Computing Machines, 2006.

[70] C. Hilton and B. Nelson, “Pnoc: a flexible circuit-switched noc


for fpga-based systems,” Computers and Digital Techniques, IEE
Proceedings-, vol. 153, 2006.

[71] A. Janarthanan, V. Swaminathan, and K. Tomko, “Mocres: an area-


efficient multi-clock on-chip network for reconfigurable systems,”
VLSI, 2007. ISVLSI ’07. IEEE Computer Society Annual Symposium on,
pp. 455–456, 9-11 March 2007.

[72] M. Schoeberl, “A time-triggered network-on-chip,” Field Pro-


grammable Logic and Applications, 2007. FPL 2007. International Con-
ference on, pp. 377–382, 27-29 Aug. 2007.

[73] Xilinx, “Xilinx design language,” help/data/xdl/xdl.html in ISE 6.3,


2000.

[74] ——, “Chipscope pro.” [Online]. Available: http://www.xilinx.


com/ise/optional_prod/cspro.htm

[75] ——, “Jbits sdk.” [Online]. Available: http://www.xilinx.com/


products/jbits/

[76] A. Megacz, “A library and platform for fpga bitstream manipula-


tion,” Proceedings of IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM’07), 2007.
Appendix A: ASIC Porting
Guidelines

• Adders with more than two inputs are typically more area efficient
in ASICs than in FPGAs. (See Section 5.5)

• While multiplexers are very costly in an FPGA, they are quite cheap
in an ASIC. Optimizing the mux structure in the FPGA based de-
sign will have little impact on an ASIC port. (See Section 5.6)

• When porting an FPGA optimized design to an ASIC it may be


possible to increase the performance of the ASIC by adding muxes
to strategic locations such as for example by replacing a bus with a
crossbar. (See Section 5.6)

• While a lot of performance and area can be gained in an FPGA by


merging as much functionality into one LUT as possible, this will
typically not decrease the area cost or increase the performance of
an ASIC port. (See Section 5.7)

• If a design is specifically optimized for the DSP blocks in an FPGA


the ASIC port is likely to have performance problems. The datap-
ath with multipliers may have to be completely rewritten to correct
this. (See Section 5.8)

• Create wrapper modules for memory modules so that only the


wrappers have to be changed when porting the design to a new

163
technology. (See Section 5.9)

• Avoid large dual port memories if it is possible to do so without


expensive redesigns. (See Section 5.9.1)

• If many small register file memories with only one write port and
few read-ports are used in a design, the area cost for an ASIC port
will be relatively high compared to the area cost of the FPGA ver-
sion. On the other hand, if more than one write port is required,
the ASIC port will probably be much more area efficient. (See Sec-
tion 5.9.2)

• Design Guideline: If some of the memories can be created with a


ROM-compiler, the area savings in an ASIC port will be substantial.
(See Section 5.9.3)

• Avoid relying on initialization of RAM memories at configuration


time in the FPGA version of a design. (See Section 5.9.4)

• When porting a design to an ASIC, consider if specialized memo-


ries like CAM memories can give significant area savings or per-
formance boosts. (See Section 5.9.5)

• Consider if special memory options that are unavailable in FPGAs,


like write masks, can improve the design in any way. (See Sec-
tion 5.9.5)

• It is possible to port a design with instantiated FPGA primitives


using a small compatibility library. For adders and subtracters, the
performance will be adequate unless they are a part of the critical
path in the ASIC. If this approach is used it is imperative that the
modules with instantiated FPGA primitives are flattened during
synthesis and before the optimization phase! (See Section 5.10)

• Manual floorplanning of an FPGA design will not have any impact


on an ASIC port unless the design is modified to simplify floor-
planning in the FPGA. (See Section 5.11)

164
• While pipelining an FPGA design will certainly not hurt the max-
imum frequency of an ASIC, the area of the ASIC will often be
slightly larger than necessary, especially if the pipeline is not a part
of the critical path in the ASIC. Designs that contains huge number
of delay registers will be especially vulnerable to such area ineffi-
ciency. (See Section 5.12)

165
166
Part VI

Papers

167
Paper I
Paper I

Using low precision floating point


numbers to reduce memory cost
for MP3 decoding

Johan Eilert, Andreas Ehliar, Dake Liu


Department of Electrical Engineering
Linköping University
Sweden
email: {perk,ehliar,dake}@isy.liu.se

Portions reprinted, with permission, from International Workshop on Multimedia Signal Processing,

2004. Using low precision floating point numbers to reduce memory cost for MP3 decoding, Eilert, J. Ehliar, A.

Liu, D. (°2004
c IEEE)

This paper has been reformatted from double column to single column format for ease of readability.

169
Abstract
The purpose of our work has been to evaluate if it is practical to use a 16-
bit floating point representation to store the intermediate sample values
and other data in memory during the decoding of MP3 bit streams. A
floating point number representation offers a better trade-off between
dynamic range and precision than a fixed point representation for a given
word length. Using a floating point representation means that smaller
memories can be used which leads to smaller chip area and lower power
consumption without reducing sound quality. We have designed and
implemented a DSP processor based on 16-bit floating point intermediate
storage. The DSP processor is capable of decoding all MP3 bit streams at
20 MHz and this has been demonstrated on an FPGA prototype.

1 Introduction
MPEG-1 layer III [1], commonly referred to as MP3, is well understood,
both on desktop systems and in embedded systems. Decoders for desk-
top systems can be implemented using either fixed point or floating point
arithmetic, whereas embedded systems typically use fixed point arith-
metic.
Embedded MP3 decoders usually have to use two 16-bit memory
words for each intermediate value to achieve the required dynamic range
and precision with fixed point arithmetic. We have investigated the feasi-
bility of using a 16-bit floating point representation to reduce the memory
cost without sacrificing sound quality. This would halve the data mem-
ory usage which would have a significant impact on power consump-
tion and chip area. Another advantage with floating point arithmetic is
that the hardware eliminates all scaling operations associated with fixed
point arithmetic which leads to shorter firmware development time.
One drawback of floating point is the complexity of the arithmetic
units. However, for a given dynamic range, the multiplier in a floating
point data path is smaller than the corresponding multiplier in a fixed

170
point data path.
In order to evaluate our floating point approach, we have used the
MPEG audio compliance test [3]. In short, a decoder can be classified as
full precision, limited accuracy, or not compliant depending on the difference
between the provided reference output and the decoded output. We have
also conducted informal listening tests since there are no formal criteria
for evaluating the quality of an MP3 decoder for an arbitrary bit stream.

2 Floating Point Requirements


In order to design a system with floating point arithmetic, two important
design decisions of the system have to be made. One is the floating point
format which decides the range and precision of all values that can be
handled by the system. The other decision is the arithmetic operations
that should be supported in hardware for a given target application.

2.1 The Floating Point Format


Although it is possible to analytically determine the maximum values
encountered in an MP3 decoder, this information is not really useful. For
example, by setting the gain and scale factors to their maximum values,
it is possible to create a synthetic MP3 bit stream where the final output
samples are magnitudes larger than the allowed output range. Because
of this, we did not try to perform any formal analysis of the possible
number ranges occurring in MP3 decoding.
Instead, we instrumented the ISO MP3 decoder [2] to use our own
custom floating point arithmetic library with configurable mantissa and
exponent widths. The library also supported arithmetic with mixed pre-
cision in order to mimic a processor with high precision data path but
lower precision memory. By keeping track of the smallest and largest
values encountered in the decoder, the library was used for determining
the required dynamic range.
Our goal was to find an exponent configuration where all MP3 bit

171
streams could be decoded without having to saturate any intermediate
value. We did not consider hand-crafted bit streams with extreme values
but we tested more than 200 different music and speech bit streams.

We concluded that all normal bit streams could be decoded success-


fully with an exponent size of 5 bits in data memory. The exponent bias
was selected to give a number range of approximately 2−26 to 25 which
would correspond to the dynamic range of a 32-bit fixed point processor.

In order to simplify the hardware, we used the same bias for register
values, but we had to increase the exponent to 6 bits to accommodate
larger intermediate values. The register number range is 2−42 to 221 . The
larger exponent of the registers simplified software development.

While the choice of exponent influences the magnitude of the floating


point values, the size of the mantissa corresponds to the number of sig-
nificant digits in the calculations. A larger mantissa leads to higher pre-
cision, but also larger memories for intermediate storage. It is therefore
important to determine the minimal size that gives acceptable results.
This can be determined through listening tests, or numerical methods,
such as the one used for MP3 decoder compliance testing.

An MP3 decoder is tested by decoding a bit stream supplied in the


compliance test and comparing the output with a supplied reference out-
put. If the rms of the difference is less than 8.8 · 10−6 and the absolute
difference is less than 2−14 relative to full scale for all samples, the de-
coder is classified as a full precision decoder. Otherwise, if the rms of
the difference is less than 1.4 · 10−4 regardless of the maximum absolute
difference, the decoder is classified as a limited accuracy decoder. If the
decoder fails to meet these criteria, the decoder is not compliant.

The compliance level for different sizes of the mantissa was investi-
gated and the result is given in Fig. 1. The exponent sizes used was 6 and
5 in registers and memory respectively.

172
Compliance results for different mantissa sizes

Not compliant Limited accuracy Full precision

19
Memory mantissa size (Implicit leading "1." not included)

18

17

16

15

14

13

12

11

10

6 7 8 9 10 11 12 13 14 15 16 17 18 19
Register mantissa size (Implicit leading "1." not included)

Figure 1: A comparison of the compliance results for different mantissa


sizes in the ISO decoder.

2.2 Operations
An analysis of the ISO MP3 decoder shows that the following floating
point operations should be supported in hardware to implement an effi-
cient MP3 decoder.

• Add

• Subtract

• Multiply

• Round (Before saving to memory)

• Load floating point value

• Floating point to integer conversion

173
These operations can be mapped to a floating point adder and a float-
ing point multiplier. All remaining operations can be reduced to these
primitives or implemented as table look-ups. Because the memory and
registers have different word lengths it is necessary to convert between
different floating point formats. The round operation converts from the
register word length to the memory word length, and the floating point
load operation expands a memory word to a register word.

3 Hardware Implementation

As a proof of concept, we developed a simple pipelined DSP core to


prove the feasibility of the approach outlined above. The DSP core is a
load-store architecture with separate program, data, and constant memo-
ries. The general idea was to keep the hardware reasonably simple with-
out making the software unreasonably complex. In our experience, soft-
ware is generally easier to debug than hardware. The instruction set was
kept at a minimum and the hardware had no inter-instruction depen-
dency checking.

3.1 Data types

Each general purpose register can contain a 16-bit integer or a 23-bit float-
ing point value. In the former case, the upper 7 bits are unused. When
a floating point value is loaded from memory it is expanded from 16 to
23 bits. Before storing a floating point value it is rounded to 16 bits. The
data types are summarized in Fig. 2.
The most important reason for using these values is to avoid a config-
uration where the decoder barely meets the requirements for limited ac-
curacy. Another reason is the convenience of having a 16-bit wide mem-
ory.

174
Register integer data type:

bit: 22 ... 16 15 ... 0


(unused) value
(signed or unsigned)

Register floating point data type:

bit: 22 21 . . . 16 15 ... 0
sign exponent mantissa
(signed) (unsigned)

exponent ∈ [−32, 31], mantissa ∈ [0, 65535].


mantissa
The number x = (−1)sign · 2exponent−11 · (1 + 65536
),
except when exponent = −32 for which x = 0.

Memory floating point data type:

bit: 15 14 . . . 10 9 ... 0
sign exponent mantissa
(signed) (unsigned)

exponent ∈ [−16, 15], mantissa ∈ [0, 1023].


mantissa
The number x = (−1)sign · 2exponent−11 · (1 + 1024
),
except when exponent = −16 for which x = 0.

Figure 2: The main data types in our DSP.

3.2 Instruction Set

The instruction set basically consisted of load and store from any of the
general purpose registers, register to register integer and floating point
operations, and I/O operations.
There are 16 general purpose registers. This number was decided

175
upon after studying the algorithms used in MP3 decoding. It allowed
us to keep all intermediate values in registers for the most important
algorithms.
There is a hardware stack for saving the program counter during sub-
routine calls. Conditional branches are limited to branch-if-zero, and
branch-if-not-zero.
There are a few application specific instructions. The Huffman de-
coder part is accelerated by bit access instructions, and some signal pro-
cessing parts are accelerated with a MAC (multiply-and-accumulate) in-
struction. The address generation capabilities are in most cases limited to
absolute or register indirect, but the bit access instructions and the float-
ing point MAC instruction can use the single dedicated address register
with auto-increment and modulo addressing.
The integer pipeline has five pipeline stages, and the floating point
pipeline has eight stages. The pipelines share fetch, decode, and write-
back stages. In hindsight, the pipeline could have been shorter.
RTL code for the DSP was written in VHDL and tested on an FPGA
prototype board. The estimated gate count, excluding memories, is 32500 gates
when synthesized for Leonardo Spectrum’s sample SCL05u technology.
There is room for improvement in the RTL code, especially in the instruc-
tion decoder.

4 Software Implementation

We decided to implement a new MP3 decoder from scratch rather than


building upon the ISO MP3 decoder. This was done partly to learn as
much as possible about MP3 decoding and partly because we felt that
the ISO MP3 decoder was too complex and inefficient. This new decoder
was then used as our internal reference during the assembly code devel-
opment for the DSP.

176
4.1 Algorithms
In order to achieve high performance with a deep pipeline and a limited
instruction set, algorithms had to be carefully written. Since the integer
part of a register is used as the mantissa in a floating point value, some
operations can be accelerated by manipulating the mantissa directly. For
example, integer shift and integer to floating point conversion can be
implemented by using the floating point subtract instruction.
The Huffman decoder uses a simple, one bit at a time, tree traversal
technique. This approach is memory inefficient but reasonably fast since
each tree node is one instruction.
The x4/3 calculation in the sample dequantization can be implemented
with a large look-up table with more than 8000 entries. We used a fifth
order polynomial approximation for the mantissa and a table look-up for
the exponent. Finally, a look-up table was used for small values in the
range [−15, 15] to accelerate this common case.
The 36-point inverse modified DCT, IMDCT, was implemented using
a fast IMDCT algorithm [4] and the 12-point IMDCT was implemented
using 36 floating point multiply and accumulate instructions.
The 32-point DCT used in the subband synthesis part was imple-
mented using Lee’s fast DCT algorithm [5]. With careful scheduling, the
16-point kernel could be implemented in registers only, without loading
or storing temporary values to memory.

4.2 Quality
According to the MP3 compliance test, our decoder is classified as a lim-
ited accuracy MPEG-1 Layer III decoder. The rms of the difference be-
tween our decoded output and the reference provided with the compli-
ance test is 3.2 · 10−5 which is well below the limit for limited accuracy,
1.4 · 10−4 .
Even though our decoder is not a full precision layer III decoder, in-
formal listening tests could not discern files decoded with our decoder
from files decoded with the full precision ISO MP3 decoder.

177
Typical execution profile

Bitstream parsing
Read samples
Restore samples
Calculate stereo
Reorder samples
Aliasing reduction
IMDCT
Frequency inversion
DCT
Windowing
Output PCM Floating point instructions
Misc Floating point MAC
Huffman instructions
Integer, control flow, I/O, etc
0 0.5 1 1.5 2 2.5 3 3.5 4
MIPS

Figure 3: Profiling of the decoder while decoding a typical MP3 bit


stream. (14.6 MIPS in total.)

Worst case execution profile

Bitstream parsing
Read samples
Restore samples
Calculate stereo
Reorder samples
Aliasing reduction
IMDCT
Frequency inversion
DCT
Windowing
Output PCM Floating point instructions
Floating point MAC
Misc
Huffman instructions
Integer, control flow, I/O, etc
0 0.5 1 1.5 2 2.5 3 3.5 4
MIPS

Figure 4: Profiling of the decoder while decoding the worst case MP3 bit
stream. (19.6 MIPS in total.)

4.3 Memory Use

The final version of the decoder used approximately 6800 24-bit words
for program memory, 900 23-bit words for the constant memory, and
6100 16-bit words for data memory. We have not spent any time try-
ing to reduce the program memory size. More than 40% of the program

178
memory is used for the Huffman tables.

4.4 Performance
In order to measure the performance of the decoder on a typical MP3 bit
stream we used a 44.1 kHz music bit stream, with an average bit rate of
202 kbps. A profile of the decoder is shown in Fig. 3.
The time spent in the Huffman decoding and sample dequantization
is data dependent. A bit stream was constructed to trigger worst case
execution time in the data dependent parts. In our case, this consisted of
a 48 kHz bit stream using only short blocks and joint-stereo. By selecting
the right Huffman table, a maximum number of big values could be fitted
into a frame to stress the sample dequantization. The resulting worst
case execution path requires 19.6 MIPS to sustain a real time decoding
process. The worst case profile is shown in Fig. 4

5 Future Work
The focus of this work has so far been on the effects of using floating
point arithmetic. Therefore, we have not put very much effort in opti-
mizing the instruction set beyond what is needed to support the required
floating point operations. Future improvements could include hardware
assisted loops, and better address generation such as general support for
pointer auto-increment. It would be relatively easy to implement a sim-
ple Huffman accelerator unit that would both significantly reduce the
size of the Huffman tables as well as speed up the Huffman decoder.
We investigated the word lengths required for full precision, but only
in the ISO MP3 decoder, as shown in Fig. 1. It would be interesting to
verify that full precision can be achieved also in our MP3 decoder by
increasing the width of the floating point data types.
Finally, it would be very interesting to know if anything could be
gained by implementing an MP3 encoder or other audio coding stan-
dards such as Ogg Vorbis and AAC using a similar floating point scheme.

179
Program memory 6800 words (24-bit)
Data memory 6100 words (16-bit)
Constant memory 900 words (23-bit)
Clock frequency 20 MHz
Gate count 32500
MIPS cost (worst case) 19.6 MIPS
MIPS cost (typical) 14.6 MIPS
Compliance Limited accuracy
(rms is 3.2 · 10−5 )

Figure 5: Performance of our MP3 decoder.

6 Conclusions
Our MP3 decoder stores intermediate data in a 16-bit floating point for-
mat to limit memory usage. It is classified as a limited accuracy ISO/IEC
11172-3 MPEG-1 layer III decoder.
The hardware has been implemented in VHDL and it has been tested
on an FPGA prototype board. The gate count, excluding memories, is
32500 gates when synthesized for Leonardo Spectrum’s sample SCL05u
technology. A clock frequency of 20 MHz is enough to decode all bit
streams.
The performance of the decoder is summarized in Fig. 5. We see some
possible improvements that could reduce the program memory size and
increase the performance.

[1] ISO/IEC, “Information Technology — Coding of Moving Pictures


and Associated Audio for Digital Storage Media at up to About
1.5Mbit/s, Part 3: Audio,” 1993

[2] “ISO MP3 sources (dist10),”


ftp://ftp.tnt.uni-hannover.de/pub/MPEG/audio/mpeg2/software/
technical_report/dist10.tar.gz

180
[3] ISO/IEC, “Information Technology — Coding of Moving Pictures
and Associated Audio for Digital Storage Media at up to About
1.5Mbit/s, Part 4: Compliance Testing,” 1995

[4] Lee, S.-W., “Improved algorithm for efficient computation of the


forward and backward MDCT in MPEG audio coder,” Circuits and
Systems II: Analog and Digital Signal Processing, IEEE Transactions on,
Vol. 48, Iss. 10, Oct 2001

[5] Lee, B., “A new algorithm to compute the discrete cosine Transform,”
IEEE Transactions on Acoustics, Speech, and Signal Processing, Vol. 32,
Iss. 6, Dec 1984

181
182
Paper II
Paper II

An FPGA based Open Source


Network-on-chip Architecture

Andreas Ehliar and Dake Liu


Department of Electrical Engineering
Linköping University
Sweden
email: {ehliar,dake}@isy.liu.se

Portions reprinted, with permission, from Field Programmable Logic and Applications, 2007. FPL 2007.

International Conference on , An FPGA Based Open Source Network-on-Chip Architecture, Ehliar, A. Liu,

D. (°2007
c IEEE)

This paper has been reformatted from double column to single column format for ease of readability.

183
Abstract
Networks on Chip (NoC) has long been seen as a potential solution to the
problems encountered when implementing large digital hardware de-
signs. In this paper we describe an open source FPGA based NoC archi-
tecture with low area overhead, high throughput and low latency com-
pared to other published works. The architecture has been optimized
for Xilinx FPGAs and the NoC is capable of operating at a frequency
of 260 MHz in a Virtex-4 FPGA. We have also developed a bridge so that
generic Wishbone bus compatible IP blocks can be connected to the NoC.

1 Introduction
As chip manufacturing techniques continue to improve, more complex
maksystems are being designed. Designing such a large system is not
easily done however and much research from both academia and indus-
try is focused on this problem. One of the problems encountered is how
to handle the on-chip interconnections between different modules.
One promising solution to the on-chip interconnection problem is the
Networks on Chip (NoC) paradigm which has seen a lot of research
lately. A thorough review of the concepts involved in NoCs is outside
the scope of this article and we refer readers unfamiliar with the topic
to [1].
Most publications in this research area are targeting ASICs however
with only a few publications considering the problems and opportunities
of an FPGA based NoC. However, as entry level FPGAs are increasing in
size, interest in NoCs for FPGAs will also increase in both academia and
industry.
In this paper we present an open source NoC architecture. The ar-
chitecture, which is optimized for the Virtex-4 FPGA family is based
on packet switching with wormhole routing. In addition, we have also
developed a bridge which allows Wishbone compatible components to
communicate over the NoC.

184
2 Background
Networks-on-chip has been a popular research area for some time now.
An early paper that discusses the advantages of an ASIC based NoC
is [2] when compared to more traditional approaches. Other well known
ASIC based NoC research projects include the Æthereal project [3] and
the xpipes project [4].

2.1 FPGA based NoCs


While the majority of NoC publications are discussing ASIC based NoC,
there are some publications that explicitly deal with FPGA based NoCs,
an early one is [5] where a packet switched NoC is studied on the Virtex-
II Pro FPGA. One of the main goals of this NoC is that it should be able
to be used in a dynamically reconfigurable system.
A recent example of an FPGA based NoC is described in [6] in which
the authors describe a packet switched NoC running on a Virtex-II and
compares it to a statically scheduled NoC on the same FPGA. A circuit
switched NoC for FPGAs named PNoC is described in [7].
Another recent example of an FPGA based NoC is NoCem [8] which
is a NoC aimed at multicore processors in an FPGA. The source code for
NoCem is also available on the Internet [9].
The interested reader can find a survey of some additional FPGA
based NoCs in [10]. It also includes comparison with other interconnect
architectures such as buses.

3 Our NoC architecture


The main goals of the NoC architecture described in this paper is high
throughput, low latency (especially for small messages), and low area
overhead. Another goal is to make it possible to easily interface a stan-
dard bus protocol such as Wishbone to it. A third goal is that there should
be a certain amount of flexibility in regards to the choice of topology.

185
Table 1: A list of the data and control signals that exist in a link between
two NoC switches.
Name and Width Description
direction
Strobe → 1 Valid data is present
Data → 36 Used as data signals
Last → 1 Last data in a transaction
Dest → 5 Address of destination node
Route → 3-4 Destination port on
the switch (one hot coded)
Ready ← 1 Signals that the remote node
is ready to receive data

The authors’ experience from SoCBUS [11] also indicates that the large
latency involved in transmitting small messages can be a huge problem
in a real system. Since it is critical to be able to handle small messages in
a system where a a standard bus is connected to a NoC, the architecture
presented in this paper is based upon packet switching. Wormhole rout-
ing is used to avoid the need for large packet buffers and to reduce the
latency.
We have mostly used 2D meshes during simulation and hardware de-
velopment although almost any topology is possible, as long as a dead-
lock free routing algorithm is used. (A discussion on deadlock free rout-
ing algorithms is outside the scope of this paper, the interested reader is
referred to for example [1].)
The signaling used on the NoC is shown in Table 1.

3.1 Input part


An incoming packet is first buffered in an input FIFO. As long as an out-
put port is available, the input FIFO will be emptied as fast as it can be
filled. However, if no output port would be available, the input FIFO
will quickly fill up. To avoid overruns, the input module will signal the

186
sender that no further data should be sent as soon as only a few entries
are left. This is required because the pipeline latency will cause addi-
tional entries to be written before the sender can react.
This FIFO is efficiently implemented by using the SRL16 components
of the Virtex-4. Due to the high delay in the SRL16 outputs, it is necessary
to minimize the logic between the output of the SRL16 and the following
flip-flop. Therefore, only a simple routing decision is performed in this
stage. Since our current architecture supports 32 destination nodes, one
five-input look-up table per output port is enough to make a routing
decision. Unfortunately this does not take into account that the FIFO
might be empty and contain stale destination data. In order to handle
this situation, the route look-up also has to know whether the input FIFO
is empty or not. Adding this logic increased the critical path beyond
what was deemed acceptable. Therefore, in order to shorten the critical
path of the route look-up, the NoC architecture was modified so that a
route look-up is instead performed in the previous switch. The result of
a route look-up is then sent using one hot coding to the next switch.
The other critical path of the input signal is the read enable signal
of the input FIFO. In order to keep the latency down, the read enable
signal is generated by looking at the destination port of all other input
ports. If no other input port is trying to communicate with the selected
output port and the output port is ready to send, the packet will be sent
immediately.

3.2 Output part

Once the first part of a packet is available in the input FIFO, the arbiter of
the selected output port will be notified. If the port is already busy or if
several input ports are trying to send at once, the arbiter uses round robin
arbitration to choose the next packet to be sent once the current sender is
finished. The arbitration is therefore distributed between the input port
where the read enable signal has to be generated without waiting a clock
cycle on the arbiter. If the output port is available and no other input

187
port is trying to send to this port, the arbitrator will allocate the output
port for the duration of the incoming packet.

Beside the arbiters, only one mux for each output port is needed. A
small logic depth optimization that has been done is to move a small
portion of the arbiter into the output mux. This can be done because a
4-to-1 mux only uses three of the available inputs on the two LUTs that
are required to implement such a mux. It should be noted that the output
mux is not connected to all input ports since messages are not supposed
to be routed back to the same port it arrived on.

4 Wishbone bridge

In addition to the NoC architecture described above we have also devel-


oped an interface that allows Wishbone [12] compatible components to
be easily connected to our NoC architecture. The protocol that is used to
communicate over the NoC is summarized in Table 2. The data-flow in
the bridge is shown in Fig. 1.

In order to be able to operate a bus connected to the Wishbone side


of the bridge at a high clock frequency it is important that as many sig-
nals as possible are registered before being allowed to enter or leave the
Wishbone bus. This causes problems during a write burst, because the
bridge does not know beforehand whether a slave will acknowledge a
transaction or not. This means that it is necessary to use the unregis-
tered acknowledgment signal. The usage of this signal is shown in Fig. 1
where it is used to trigger the CE input of the data and address flip-flops.
For all other uses of the acknowledgment signal, the registered version
is used. In the current version of the bridge a few other control signals
are also sparingly used in their unregistered version but the acknowl-
edgment signal is the most critical of these.

188
Table 2: The protocol used by the Wishbone bridge. A Write request
packet can contain up to N words, a read request packet will always
contain 2 words, and a read reply can contain up to M words. (M ≤ 31).
Request Word Bits Value
type
Write 0 35:34 “00” (Write request)
0 29:0 Address (in 32 bit words)
1..N 35:32 Byte select signals
1..N 31:00 Data
Read 0 35:34 “01” (Read request)
0 29:0 Address (in 32 bit words)
1 34:30 Number of requested words
1 29:26 Byte selects for non
burst read
1 25:21 Source node address
1 20:18 Request ID
Read 1..M 35 “1” (Read reply)
Reply 1..M 34:32 The request ID of this
read request
1..M 31:0 Data

4.1 Deadlock avoidance


In order to avoid deadlocks we must make sure that all messages will be
accepted. As a counterexample, consider a system with two nodes that
have sent a large number of read requests to each other. If too many read
requests are present in the network, there would be no space available in
the network for the replies to these read requests and no further progress
could be made.
We have solved this problem by having a short queue for read re-
quests in the Wishbone bridge. As soon as a read request is received from
the NoC, no new incoming Wishbone transactions will be accepted. This
queue can be sized so that it is guaranteed that it cannot ever be filled in

189
Address generator
Wishbone Address
Read request FIFO

CE
Input FIFO

NoC Wishbone ACK

CE

Wishbone data
Route lookup

NoC
Wishbone Address

Wishbone data

Figure 1: Simplified view of the data flow of the Wishbone to NoC


bridge. The dotted lines is the registered acknowledgment signal. The
dashed line is an internal control signal that forces a load of the first value
in a transaction.

a given system.

4.2 Limitations
One problem in the Wishbone standard is that it is designed with a com-
binatorial bus in mind. If the bus is pipelined, it is no longer possible
to utilize Wishbone to its full potential. Wishbone provides signals for
handling burst reads but the only length indication which is provided
for a linear burst is the fact that at least one more word is requested. This
causes problems if many pipeline stages separate the slave and the mas-
ter. We have augmented the Wishbone interface with a transaction length
signal so that a read reply will contain exactly the number of words that
have been requested.
The current version of the Wishbone bridge also assumes that a slave

190
will not answer a Wishbone request with a retry or an error. Handling
these signals in a fully Wishbone compliant way would severely reduce
the performance of the NoC. As a future extension some sort of error
reporting register should be introduced to the bridge.
Also, while the bridge does not handle retries itself, it will issue a
retry to a wishbone slave if a wishbone read request is received when the
answer to a previous read request has not yet arrived. It will also issue a
retry if a wishbone write request is received when the NoC is unable to
receive further messages due to a full FIFO. The Wishbone master must
honor this request and release the bus for at least one clock cycle if any
other device is connected to the same Wishbone bus in order to avoid
deadlocks.

4.3 Testing
Both the Wishbone wrappers and the NoC architecture has been tested
in RTL simulations in different NoC configurations (different number of
nodes and switches). The largest design we have tested contains 16 NoC
switches, 32 wishbone/NoC bridges, 96 memories, and 96 transaction
generators. The NoC have also been tested on a Virtex-4 SX35 based
FPGA where we tested a four node NoC with 12 Wishbone bridges con-
nected to memories and transaction generators.

5 Results
The resource utilization of our design is shown in Table 3 and compared
with three other publications1 .
When compared to the packet switched architecture in [6], our archi-
tecture can operate at the same frequency in the same FPGA technology
whereas our switch only uses 30% of the slices (in fairness, the authors
1 Errata: A mistake was made when preparing this table. For the 4 port switch, the
number of LUTs also include the number of LUTs used as SRL16 but we forgot to take
the number of SRL16 into account for the five port switch. The 230 LUTs used as SRL16
components are missing in the figure used for the five port switch.

191
Data Virtex-II Virtex-II Virtex-4 Latency Slices LUTs Flip
width 6000-4 Pro 30-7 LX80-12 (cycles) Flops
Our 4 port switch 36 bits 166 MHz 257 MHz 272 MHz 3 431 780 452
Our 5 port switch 36 bits 151 MHz 244 MHz 260 MHz 3 659 826 615
[6] (4 ports) 32 bits 166 MHz - - 6 1464 - -
PNoC [7] (4 ports) 32 bits - 138 MHz - - 364 - -
NoCem [8] † 32 bits - 150 MHz - - - 1455‡ -
† The number of ports for this value is not stated in the paper. ‡ Not explicitly mentioned
in the paper, calculated from the size of a 2×2 NoC.

Table 3: The performance of our NoC compared to other FPGA based


NoCs.

hint that their NoC could be faster but they do not give a maximum num-
ber). When compared to [7], the system is capable of operating at a sig-
nificantly higher frequency while being only slightly larger (in addition
to serving slightly wider links). The authors also do not mention how
deadlocks are avoided or handled in their design. The latency of their
NoC is also unknown.
Our NoC can also operate at a higher clock frequency than NoCem [8]
with less resource usage. However the resource usage comparison is
not completely fair since NoCem is capable of handling virtual channels
(although [8] do not mention if the reported LUT resource usage is with
or without virtual channels).
Finally, the size of the Wishbone bridge depends on the routing table
and the size of the read request FIFO, but a typical bridge with a simple
routing table and a 32-entry read request FIFO will use 450 LUTs and 429
Flip Flops.

6 Future work
Since we will release this work as open source, it is our hope that this
research project can both be a platform upon which further FPGA based
NoC research can take place.
The NoC architecture is available for use under the MIT license at
http://www.da.isy.liu.se/research/soc/fpganoc/

192
7 Conclusion
In this paper we have presented an open source Network-on-Chip ar-
chitecture optimized for the Virtex-4 FPGA. The network can operate at
over 260 MHz and the area for a NoC switch is significantly smaller than
for previous results at the same operating frequency. We have also pre-
sented a bridge which allows Wishbone compatible components to be
connected to this NoC.

References
[1] W.J. Dally and B. Towles. Principles and Practices of Interconnection
Networks. Morgan Kaufmann, 2004.

[2] William J. Dally and Brian Towles. Route packets, not wires: On-
chip interconnection networks. In Design Automation Conference,
pages 684–689, 2001.

[3] K. Goossens, J. Dielissen, and A. Radulescu. Aethereal network on


chip: concepts, architectures, and implementations. Design & Test of
Computers, IEEE, 22, 2005.

[4] D. Bertozzi and L. Benini. Xpipes: a network-on-chip architecture


for gigascale systems-on-chip. Circuits and Systems Magazine, IEEE,
4, 2004.

[5] T.A. Bartic, J.-Y. Mignolet, V. Nollet, T. Marescaux, D. Verkest, S. Ver-


nalde, and R. Lauwereins. Highly scalable network on chip for re-
configurable systemsin. System-on-Chip, 2003. Proceedings. Interna-
tional Symposium on, 2003.

[6] Nachiket Kapre, Nikil Mehta, Michael deLorimier, Raphael Rubin,


Henry Barnor, Michael J. Wilson, Michael Wrighton, and André De-
Hon. Packet switched vs. time multiplexed fpga overlay networks.
IEEE Symposium on Field-programmable Custom Computing Machines,
2006.

193
[7] C. Hilton and B. Nelson. Pnoc: a flexible circuit-switched noc
for fpga-based systems. Computers and Digital Techniques, IEE
Proceedings-, 153, 2006.

[8] Graham Schelle and Dirk Grunwald. Onchip interconnect explo-


ration for multicore processors utilizing fpgas. 2nd Workshop on Ar-
chitecture Research using FPGA Platforms, 2006.

[9] Nocem – network on chip emulator.

[10] T. Mak, P. Sedcole, P. Y.K. Cheung, and W. Luk. On-fpga communi-


cation architectures and design factors. 16th International Conference
on Field Programmable Logic and Applications, 2006.

[11] D. Wiklund and D. Liu. Socbus: switched network on chip for hard
real time embedded systems. Parallel and Distributed Processing Sym-
posium. Proceedings. International, 2003.

[12] Wishbone system-on-chip (soc) interconnection architecture for


portable ip cores, 2002.

194
Paper III

Paper III
Thinking outside the flow:
Creating customized backend
tools for Xilinx based designs

Andreas Ehliar and Dake Liu


Department of Electrical Engineering
Linköping University
Sweden
email: {ehliar,dake}@isy.liu.se

°2007
c FPGAWorld.com. Reprinted from 4th annual FPGAworld Conference, Thinking outside the flow:
Creating customized backend tools for Xilinx based designs, Ehliar, A. Liu, D.

This paper has been reformatted from double column to single column format for ease of readability.

195
Abstract
This paper is intended to serve as an introduction to how to build a cus-
tomized backend tool for a Xilinx based design flow. A Python based li-
brary called PyXDL is presented which allows a user to manipulate XDL
files which contain a placed and routed design. Three different tools are
presented which uses this library, ranging from a simple resource uti-
lization viewer to a tool which will insert a logic analyzer into an already
routed design, thus avoiding a costly complete rerun of the place and
route tool.

1 Introduction
Traditionally, users are not very interested in the inner workings of the
FPGA tool chain they are using. As long as everything is working cor-
rectly there is no perceived need to invest time and effort on learning
about obscure implementation details. Although most users have prob-
ably looked at a routed design in for example Xilinx’ FPGA editor rela-
tively few users have modified such a design.
There are however large opportunities for those who are interested
in inspecting and modifying placed and routed designs. For example, a
design viewer could be constructed that not only shows the slices of the
design, like the floorplanner does, but also figures out the functionality
of a slice and shows a symbol for a mux, adder, inverter, and so on. This
will allow a user to quickly see if the synthesizer has created reasonable
logic without having to load the FPGA editor which usually shows much
more detail than necessary.
In terms of modifying a placed and routed design, most users are
probably interested in tools that are helpful for debugging a design such
as instrumenting a design to improve the visibility of internal signals.
The FPGA editor has included functionality to insert probes into a de-
sign and route those signals to external pins for a long time and the
ChipScope [1] product has improved on this functionality by allowing

196
the user to insert a full logic analyzer into the FPGA.
Finally, when the usage of partial reconfiguration of FPGAs is more
widespread it is likely that already placed and routed designs will have
to be modified before deployment.
This paper presents a simple way to write useful programs capable
of inspecting and modifying placed and routed Xilinx designs. The used
method is to use the xdl tool to translate Xilinx proprietary NCD (Native
Circuit Description) files into XDL (Xilinx Design Language) text files
which can easily be processed by an application. A Python library called
PyXDL has been developed to analyze and modify XDL files and three
different backend tools written in Python has been written to demon-
strate the capabilities of this library. The first tool can take a design and
report the resource utilization of individual modules in the design. The
second tool is a design viewer capable of showing the type of logic in
each LUT as described above. The final tool allows a logic analyzer core
to be inserted into an already routed design and present a user interface
over RS232.
While it might seem esoteric and cumbersome to write your own
backend tool the main parts of the Python library and tools described
in this paper were actually written over a period of less than two weeks
(except for the logic analyzer core which was already written for another
project where it had to be manually instantiated in the RTL source code).
It is therefore feasible for even smaller developers to write their own cus-
tomized tools and we hope that this paper might serve as an inspiration
for like-minded developers.

2 Related work
As previously mentioned, the FPGA editor included in ISE can show a
design in more detail than most users care for. It is also possible to change
the design although this is probably impractical for larger changes. There
is also a command line version of the FPGA editor available called fpga_edline
which is capable of executing scripts created by the FPGA editor.

197
Unfortunately there is no documented way to control the FPGA edi-
tor from a user written program. The included scripting support is just a
way to repeat previously defined commands, the script language is not
a complete programming language. This makes it unsuitable for an ap-
plication that needs to read data from a design as opposed to making
changes to a design at fixed locations.
A much more interesting alternative is the JBits SDK [2] from Xilinx.
This allows Xilinx designs to be manipulated from Java. In fact, it proba-
bly contains all the functionality that a user could want in terms of design
manipulation. It isn’t publicly available and users have to ask for access
to it. The main drawback is that JBits has been discontinued and there is
no support at all for newer FPGAs in it (newer than Virtex-II) and there
seems to be little interest from Xilinx to add such support. In fact, if JBits
was publicly available with support for all new FPGAs from Xilinx, there
wouldn’t have been any need to write this paper.
Finally, abits [3] is a tool similar in spirit to JBits which allows Atmel
bit streams to be manipulated.

3 The XDL format


The XDL file format is an ASCII based translation of Xilinx’ proprietary
NCD file format. It will typically contain two types of statements, in-
stances and nets. An instance can be any logic element in the FPGA such
as for example a slice, ram block, or DSP block. It may or may not be
placed at a certain location. A net statement will describe the name of
a certain net and the instances it is connected to. It may also contain
routing information. An example of a very simple XDL file is shown in
Figure 1.
A drawback of the XDL file format is the scarcity of documentation.
Earlier releases of ISE such as 6.3 contained written documentation about
the file format [4]. Unfortunately this documentation has been removed
in later versions of ISE. Even so, some details of the XDL format wasn’t
documented in 6.3 either. Luckily some basic information about the for-

198
net "simple_net" ,
outpin "slice1" XQ ,
inpin "slice2" BX ,
;

inst "slice1" "SLICEL",unplaced ,


cfg "BXINV::BX CEINV::CE CLKINV::CLK
DXMUX::BX FFX:slice1_r:#FF
FFX_INIT_ATTR::INIT0" ;

inst "slice2" "SLICEL",unplaced ,


cfg "BXINV::BX CEINV::CE CLKINV::CLK
DXMUX::BX FFX:slice2_r:#FF
FFX_INIT_ATTR::INIT0" ;

Figure 1: An example of a simple XDL file which shows two slices each
containing one flip flop connected by a wire.

mat is included in every XDL output file created by the xdl tool unless
the -noformat switch is given.

4 PyXDL - Python based XDL manipulation li-


brary

A Python based library called PyXDL has been developed to simplify


development of backend applications. The basic idea behind the library
is to convert a placed and routed design into XDL by using the xdl tool
included in ISE. This file can be modified as required and converted back
into Xilinx native NCD format. This allows small changes to be made to
a design without requiring a complete and often time consuming syn-

199
thesize, placement, and routing iteration. This is accomplished by telling
par (the place and routing tool) to only route un-routed nets and only
place unplaced instances. (The guide-file feature of par is used for this
purpose.) This flow is illustrated in Figure 2.

4.1 Constraints

One problem which occurs when merging two designs, which isn’t im-
mediately obviously when looking at the XDL files, is the constraints
files. The timing constraints in these must also be merged if reliable tim-
ing estimates is expected.

4.2 Resource analyzer script

The design resource analyzer is a small tool written for a designer who
wants to know the resource utilization of a certain module or modules in
larger design. One way to figure this out is to synthesize that particular
module separately. This method may or may not work depending on the
properties of the larger design. For example, if the synthesizer can deter-
mine that only relatively few values can appear on a certain input port of
a module included in a larger design, the synthesizer could potentially
remove large parts of the module.
As hinted at in the previous section it would be better to be able to
analyze a large design directly to find the resource usage of individual
components. This is exactly what the resource analyzer script does as
shown in Figure 3. The script itself is very simple and the most complex
part is actually printing the design usage in a hierarchical and cumula-
tive fashion. This kind of XDL parsing, although easy, can still lead to
useful results. A regression test incorporating this script could for exam-
ple warn about a submodule which has grown (or shrinked) by a large
factor when compared to the previous run.

200
Source code Constraints

Synthesizer (xst)

NGC

ngdbuild

NGD

map

NCD (Mapped) Constraints (PCF)

par

NCD (Routed)

xdl Design to merge


XDL (Mapped)
XDL (Routed)

Constraints (PCF)
PyXDL design merger

XDL (Partially Constraints (PCF)


routed)

xdl

NCD (Partially
routed)

par

NCD (Routed)

Figure 2: The typical Xilinx flow augmented with the PyXDL tool to
merge a design such as a logic analyzer into a placed and routed design.
The new part of the flow is shown in gray.

201
Figure 3: Using the resource analyzer script to view the resource utiliza-
tion of various parts of a design.

4.3 Design viewer

The design viewer is capable of viewing a design and showing the con-
figuration of the slices. It is similar in functionality to the floorplanner.
In Figure 4 a part of an OpenRisc based design is analyzed by the design
viewer.

4.4 Logic analyzer

Putting a logic analyzers into a chip is not a new idea. Both Xilinx and
Altera already offers such products (ChipScope and SignalTap). There
are also some logic analyzers written by hobbyists available on the net
such as Fpgadbg [5].
The main idea behind this section is to show that it is easy for any
user to duplicate the main selling point of ChipScope, i.e. the capability
to insert a core into an already synthesized and routed design. While it
would be easy to create a logic analyzer core which fully mimics Chip-
Scope by connecting to the internal boundary scan primitive we did not
intend this tool to be a ChipScope clone. Instead, the intention was that

202
Figure 4: An example of the output from the design viewer when run on
a OpenRisc 1200 based design.

this tool should be useful in systems that might not easily be connected
to a PC with a ChipScope client such as remote systems. Therefore the
logic analyzer core is operated via a simple serial port interface.
An example of the output of the logic analyzer is shown in Figure 6

203
and an example of a simple GUI which allows the core to be easily in-
serted into a design is shown in Figure 7.

Implementation details

The design of the the logic analyzer is shown in Figure 5. It consists of


a simple 8 bit microcontroller which is responsible for presenting a text
based user interface to a serial port. The MCU is connected to a logic
analyzer core via a Wishbone bus. This bus also creates an easy way to
extend the functionality of this core with additional modules. The logic
analyzer is currently hardcoded for a maximum of 64 signals which is
stored to a 2 kilo-word large buffer.
The Python GUI allows the user to load an XDL design and select
which nets to monitor. After the user is satisfied with the selection the
program will load the synthesized version of the logic analyzer and re-
move any elements which will make it hard to merge the logic analyzer
into the design (e.g. IOBs and BUFGs). The appropriate flip-flops in the
logic analyzer is added as an extra destination of the selected nets. The
program memory of the MCU is also modified so that net information
such as name and width is available to it. Finally, a user selected clock
net is connected to all flip-flops in the logic analyzer core.
The curious reader is also referred to 6 which contains an example of
how PyXDL can be used to merge a small design into a large design.

4.5 Availability of PyXDL

The PyXDL library will be published under the GPL at


http://www.da.isy.liu.se/~ehliar/pyxdl/
together with the sample applications described in the previous sections.
The RTL code of the logic analyzer core will also be made available un-
der the MIT license so that users can use and distribute merged designs
without worrying about the stricter terms of the GPL license.

204
RS232 clk Signals to monitor

UART
Memory
8 bit MCU
Trig mask
&

PM Trig val Ctrl


=
Logic analyzer

Wishbone bus

Figure 5: An overview of the logic analyzer module.

5 Discussion

The applications presented in this paper shows only a few of the many
possibilities that could be tapped by a creative designer. The applications
described earlier could of course be improved by improving them. The
design viewer could be improved to show more points of interest to a
designer such as clock domain crossings, pipeline depths, and perhaps
even show some sort of design complexity metrics for different parts of
the design (a long pipeline without feedback is far less complicated and
probably easier to test and verify than a state machine with many feed-

205
back paths).
The logic analyzer could be improved by adding additional modules
to it such as counter modules for statistic gathering. Another interesting
addition would be to replace the RS232 interface with another interface
such as for example Ethernet or USB.

5.1 Other possible applications

There are many other interesting applications which would be possible


to develop. One example would be for those interested in very large
FPGA designs that must be mapped onto several FPGAs. A tool could
be created that automatically partitioned the XDL file into more than one
FPGA.
A similar tool could be made that partitioned a design for a large
FPGA into different region of such an FPGA. The advantage of such a
design would be that the time consuming placement and routing of the
partitioned design could easily be parallelized on a cluster of computers.

5.2 Remaining issues

There are unfortunately some issues that are hard to solve in a satis-
factory fashion. The main problem is that there is very little informa-
tion available about routing. Whereas placement is relatively straight-
forward, reliably routing a design requires detailed timing information
about the internals of the FPGA, something which Xilinx hasn’t released
for modern FPGAs and most likely will not release for the foreseeable
future.
Another problem that any tool of this kind will face is that the syn-
thesized design isn’t exactly the same as the RTL source code. The vari-
ous optimizations employed by the synthesizer will remove and rename
many nets, making it harder to find the correct signal/bus to inspect.
This could be mitigated if more back-annotation information was avail-
able to the tools.

206
Figure 6: The logic analyzer user interface showing instruction fetches
on a Wishbone bus. The analyzer has been set to trigger when STB and
ACK are both asserted.

Finally, the PyXDL library has only been tested on Virtex-4 based de-
signs.

6 Conclusion

We have shown that it is easy to create powerful backend tools for a Xil-
inx based design flow such as a logic analyzer inserter. By manipulating
the design file directly a time consuming full synthesis/placement/rout-

207
Figure 7: The GUI used to insert the logic analyzer core into a design.

ing iteration is avoided and therefore increasing productivity. It is our


intention that this paper will inspire other designers to explore these pos-
sibilities as well.

References
[1] Xilinx. Chipscope pro.

[2] Xilinx. Jbits sdk.

[3] Adam Megacz. A library and platform for fpga bitstream manipulation. Pro-
ceedings of IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM’07), 2007.

[4] Xilinx. Xilinx design language. help/data/xdl/xdl.html in ISE 6.3, 2000.

[5] Wojtek Zabolotny. Fpgadbg - a tool for fpga debugging. 2006.

208
PyXDL example

This appendix contains an example of how to use PyXDL to merge a synthesized


design into a larger design. The example consists of a design which will monitor
a signal and assert an external signal forever if an internal signal has ever been
asserted (e.g. an error signal of some sort). In order to shorten the example,
the constraints file is not updated with the timing group from the small design.
Some values are also hardcoded instead of dynamically getting the values from
the XDL files such as the name of the clock networks.
PyXDL source code to merge a synthesized design (test.xdl) into a large de-
sign (system.xdl):
from xdl import xdl,xdlnet

from pcf import pcf

from xdlutil import par_with_guide

largedes = xdl("system.xdl")

largedespcf = pcf("system.pcf")

# Clock network for the large design

clocknet = largedes.netsbyname["clk_i_BUFGP"]

tinydes = xdl("test.xdl")

# Unplace stuff we don’t need

tinydes.unplace_design()

tinydes.remove_unused_dcminsts()

tinydes.remove_inst("clk")

tinydes.remove_net("clk")

# Create a unique prefix for the other design so

# that we don’t have to worry about name clashes

tinydes.add_prefix("TEST/")

# Convert flip flop in the IOB to an internal signal

myiob = tinydes.insts["TEST/testin"]

testinpin = tinydes.convert_input_to_internal(myiob)

oldclknet = tinydes.netsbyname["TEST/clk_BUFGP"]

209
# Remove old clock network

tinydes.remove_net("TEST/clk_BUFGP")

tinydes.remove_inst("TEST/clk_BUFGP/BUFG")

# Merge designs

largedes.mergedesign(tinydes)

# Merge old clock network into new design

for pin in oldclknet.inpins:

largedes.add_inpin_to_net(clocknet,pin[0],pin[1])

# Select signal to monitor

thenet = largedesign.netsbyname["traceit/state_r_FFd1"]

largedes.add_inpin_to_net(thenet,testinpin[0],

testinpin[1])

# Add the IOB to the PCF constraint file and

# select where to place it (at pin AC6)

largedespcf.addiob("TEST/testout","AC6")

# Place and route the design

par_with_guide(largedes,largedespcf,"new.ncd","tmp")

Verilog source code for a simple monitor application. testout will be asserted
if testin has ever been asserted:

module test(

input clk,

input wire testin,

input wire rst,

output reg testout);

reg tmp,sample;

wire fbloop;

always @(posedge clk) begin

sample <= testin;

210
tmp <= fbloop;

testout <= tmp;

end

FD monitorfd(.C(clk),.D(fbloop | sample),

.Q(fbloop));

endmodule // test

211
212
Paper IV

A High Performance
Microprocessor with DSP
Extensions Optimized for the

Paper IV
Virtex-4 FPGA

Andreas Ehliar, Per Karlström, Dake Liu


Department of Electrical Engineering
Linköping University
Sweden
email: {ehliar,perk,dake}@isy.liu.se

Portions reprinted, with permission from Field Programmable Logic and Applications, 2008. FPL 2008.

International Conference on , , Ehliar, A. Karlström, P. Liu, D. (°2008


c IEEE)

This paper has been reformatted from double column to single column format for ease of readability.

213
Abstract
As the use of FPGAs increases, the importance of highly optimized pro-
cessors for FPGAs will increase. In this paper we present the microarchi-
tecture of a soft microprocessor core optimized for the Virtex-4 architec-
ture. The core can operate at 357 MHz, which is significantly faster than
Xilinx’ Microblaze architecture on the same FPGA. At this frequency it
is necessary to keep the logic complexity down and this paper shows
how this can be done while retaining sufficient functionality for a high
performance processor.

1 Introduction
The use of FPGAs has increased steadily since their introduction. The
first FPGAs were limited devices, usable mainly for glue logic whereas
the capabilities of modern FPGAs allow for extremely varied use cases
in everything from high end communication and networking equipment
to consumer devices like flat screen televisions. In many cases, a soft
processor core is an important part of the design.
The main players in this market are Altera’s Nios, Xilinx’ Microblaze
and Lattice’ Mico32. All are capable microcontrollers based on a tradi-
tional RISC pipeline. However, there is little choice available if a soft DSP
processor core is needed. Some might argue that a DSP processor core
is unnecessary in an FPGA as DSP computations can instead be handled
by custom designed IP blocks. For example, a radar processing core can
easily fill an entire high end FPGA with high utilisation rate of all func-
tional units. On the other hand, it is harder to design a system which
will use a wide variety of different DSP algorithms if custom IP blocks
are used for each algorithm. As an example, a video conference system
might use a hardware accelerated video encoder and a software based
video decoder and audio codec. There are many reasons for partitioning
the design like this, including; better hardware utilization and shorter
development time due to software reuse and simplified debugging.

214
In this paper we will present a high speed soft microprocessor core
with DSP extensions optimized for the Virtex-4 FPGA family. The mi-
croarchitecture of the processor is carefully designed to allow for high
speed operation.

2 Related Work
There are many soft processor cores available for FPGA usage although
Nios II, Mico32, and Microblaze are common choices thanks to the sup-
port from their vendors.
Altera’s Nios II [1]. is a 32-bit RISC processor that comes in three
flavors; e, s, and, f with a one, five, or six pipeline stages respectively.
Xilinx’ Microblaze is a 32-bit RISC processor [2] optimized for Xilinx
FPGAs.
Lattice’ Mico32 is a 32 bit RISC processor [3] with a six stage pipeline.
The source code of Mico32 is also available under an open source license.
Besides the vendor supported processors there are a wide variety of
processor cores available, both commercial and open source. Notable
cores include OR1200 [4], Leon [5], OpenSparc [6]. These processors are
targeted at ASICs but have found a use on FPGAs as well.

3 Overview
Our main design goal was to create a high speed soft processor core
with support for common DSP operations. In addition, the processor
should be reasonable easy to program without intimate knowledge of the
pipeline. It should also be possible to write a decent compiler backend
for the processor. Finally, the processor footprint should not be excessive.

3.1 Tradeoffs
It is hard to create a processor which is both fast and easy to program.
A fast processor will have a deep pipeline, forcing the programmer (or

215
compiler) to think hard about instruction scheduling and branch penal-
ties.
On the other hand, a programmer friendly processor presents an ar-
chitecture with few surprises trading either speed or hardware complex-
ity for ease of use.
Our goal was to create a high speed processor which is still relatively
easy to program. For example, in order to increase the maximum clock
frequency our processor only has partial support for register forwarding.
The result of some instructions cannot be forwarded directly to other
execution units. Typically one or two other instructions has to be issued
before the result of an operation can be reused on another execution unit.
We feel that this is an acceptable tradeoff, based on our experience with
other processors without any forwarding at all [7].

4 Architecture
The architecture is RISC based with six pipeline stages; fetch, decode /
read operands, register forwarding, execute1, execute2, and writeback.
The processor is a 32-bit microprocessor with 16 general purpose regis-
ters. The address space is limited to 16 bits. The instruction set contains
a fairly standard set of RISC instructions.
The instruction words are 27 bits wide and up to 7 bits can be used
for immediates. Longer immediates can be handled either by a 128 entry
lookup-table or by using an extra SETHI instruction.
Special purpose registers are used for I/O and processor configura-
tion.

4.1 Register Forwarding


Register forwarding is implemented as a separate pipeline stage. This
means that in general, the result of one operation cannot be forwarded
directly to the next instruction. To mitigate this, the arithmetic unit can
forward a result of an arithmetic operation directly to itself. Similarly,

216
Select active unit

R
Execution

From other pipeline stages


unit 1

R
Execution
unit 2

R
Execution
unit 3

Figure 1: Forwarding architecture

the result of a logic unit operation can be forwarded directly to the logic
unit.
The principle of the forwarding unit is shown in Fig. 1. To reduce
the size of the mux, the pipeline is constructed so that signals from one
pipeline stage can be or:ed together. This is accomplished by utilizing the
reset input of the flip-flops after each execution unit to set all non-active
execution unit outputs in a certain pipeline stage to zero.

4.2 Arithmetic Unit

The arithmetic unit (AU), shown in Fig. 2 is one of the most critical parts
of the entire processor. As mentioned earlier we could not afford to have
full register forwarding in this processor. The main reason for this is the
32 bit adder in the AU. If a large mux is inserted before the inputs to the
adder the critical path would be too long (e.g. a 32 bit adder with 4-to-1
muxes in front of each operand can be synthesized to only 290 MHz in a
Virtex-4 speedgrade 12).

217
Figure 2: The architecture of the arithmetic unit and the principles of the
final part of the register forwarding unit

However, the processor is able to forward results from the adder back
to the adder without any penalty to support a sequence of AU instruc-
tions. As can be seen in Fig. 2, the result of the addition can only be for-
warded to one of the inputs of the adder. Due to the design of a slice in a
Virtex-4 it is not possible to put a mux in front of the other operand when
only one LUT is used per bit in the adder. This complicates forwarding
since either operand has to be able to be forwarded to any input of the
adder. To solve this problem the previous pipeline stage is responsible
for ensuring that the correct operand appears on the inputs. An example
is shown in Table 1. It should also be noted that only the principles for
the register forwarding pipeline stage is shown in the figure. In our im-

218
Instruction Forwarded Control signals
sequence operand
add r2,r1,r0 - -
add r2,r1,r2 OpB Force0=1 Select=1
add r2,r2,r1 OpA Swap=1 Force0=1 Select=1
sub r2,r1,r2 OpB Force1=1 Select=1
sub r2,r2,r1 OpA Swap=1 InvA=1 Select=1
sub r2,r2,r2 Both Replace with set r2,#0
add r2,r2,r2 Both Cannot forward directly

Table 1: Forwarding operands to the arithmetic unit

plementation this has been merged into the same LUTs that are used to
implement the forwarding shown in Fig. 1 to reduce the logic level.

4.3 Branching
Branches always has one delay slot. If absolute addressing is used for the
jump address, the processor can immediately start executing the target
instruction after the delay slot.
The processor has 4 status flags: Z (zero), V (overflow), N (negative),
C (carry). An arithmetic or logic instruction will change these flags. The
critical path of this unit is the Z flag generation. This is performed partly
in the AU and LU units. In the AU unit, the 20 lower bits are preprocessed
in groups of four bits using five 4-input or-gates. In the LU unit, the
entire 32 bit result is preprocessed in the same way using eight 4-input
or-gates. Thanks to this preprocessing of the Z flag it is possible to start
branch condition computation one pipeline stage earlier.
Conditional jumps are statically predicted using a bit in the instruc-
tion word. A correctly predicted conditional jump has no penalty cycles.
A mispredicted jump has a penalty of either three or four cycles.
A register indirect jump always has a penalty of four cycles.
If the branch prediction was wrong, the speculatively fetched instruc-
tions are invalidated before entering the execute1 stage.

219
4.4 Memory Architecture
There are three memories in the system: program, data, and constant
memory. The program memory is 27 bits wide, the data and constant
memory is 32 bits wide. Both the constant and data memory can be ad-
dressed using address generator units described in the next section. The
constant memory is also used as a lookup table for the 128 constants de-
scribed in Section 4.
The data memory can be addressed using a value from the register
file plus an 8 bit offset in the instruction word. The adder is located in
the same pipeline stage as register forwarding. This is done to minimize
the complexity before the memory. This also means that the register used
must be written to the register file before being used for addressing mem-
ory. We believe that this is an acceptable tradeoff as one very common
usage for this addressing mode is accessing variables on the stack and
the stack pointer is unlikely to change very often.
The data memory is byte addressable which is important if high level
languages like C and C++ are used to write programs for the processor.

4.5 DSP Extensions


A few architectural features can greatly improve the performance of DSP
applications. These are the multiply-and-accumulate (MAC) unit, the
circular buffer and zero overhead loop support.
The MAC operation is ubiquitious in DSP applications and fairly easy
to implement in hardware. Four DSP48 blocks were used to implement
a 32 × 32 bit multiplication followed by a 64 bit accumulation unit. In
total, the MAC unit has six pipeline stages. Due to the long latency of
this unit the results are written to a special accumulation register instead
of the normal register file. The MAC unit is also used for multiplication
without accumulation.
The operands of the MAC instruction can either be fetched from the
register file or from the data and constant memory. Special address gen-
erator units (AGU) are connected to the data and constant memory to

220
STEP-SIZE
STEP TOP-2 STEP

MSB
ADDR
Figure 3: Address generation unit

allow for a steady stream of data from the memories to the MAC unit.
The AGUs support linear and circular addressing.
For each memory access, the AGU increases the current address with
a configurable stepsize. In circular addressing, a start and end address
constrains the range of valid addresses. If the next address is located be-
yond the end address, the next address is set to CURRENT_ADDRESS +
STEPSIZE - BUFFER_SIZE, where BUFFER_SIZE is the size of the circu-
lar buffer.
A straight forward hardware implementation of this calculation
could be synthesized to 209 MHz. The next address is compared to
END_ADDRESS and, if it is too large, adjusted as described above.
Pipelining is used to improve the performance of the address gen-
erator. Due to the pipelining, the address must be compared to
END_ADDRESS-2*STEPSIZE (END_ADDRESS-STEPSIZE for the first
iteration) instead of END_ADDRESS. The pipelined address generator
is shown in Fig. 3.

221
To improve the performance of small loops typical for DSP kernels
there is also a loop instruction available which allows for up to 65535
loop iterations.

5 Results
A floorplanned version of the processor can operate at 357 MHz in a
Virtex-4 LX80, speedgrade 12 according static timing analysis. Without
floorplanning, 334 MHz is the maximum frequency. The processor uses
1197 slices, 1716 LUTs, and 1301 flip-flops. The largest parts of the pro-
cessor are the shifter (405 LUTs, 131 flip-flops) and the register forward-
ing pipeline stage (264 LUTs, 64 flip-flops).

6 Discussion
In order to reach a clock frequency of 357 MHz in a Virtex-4 FPGA, a
number of compromises had to be made. This means that the proces-
sor will have a few quirks not found in more general processors. The
most important impact of this is that the pipeline is partly visible to the
programmer.
According to [8] on Xilinx’ homepage, the Microblaze processor can
run at 160 MHz in a Virtex-4. We have, however seen figures of up to
200 MHz reported for the Microblaze on Virtex-4 [9]. Even so, our pro-
cessor has a maximum clock frequency which is is almost 80% faster than
Microblaze. In addition, it is also operating at a significantly higher fre-
quency than the Microblaze on a Virtex-5. This does not mean that all
applications will be 80% faster when running on our processor. Some
programs will require more clock cycles to run on our processor, due
to the incomplete register forwarding. However, DSP applications can
typically be rewritten to compensate for the lack of register forwarding
by proper instruction scheduling and algorithm selection. For example,
in [7], only 10% of the cycles were wasted on NOP instructions and that

222
processor has no support for register forwarding at all. A more thorough
examination of the results of incomplete forwarding can be found in [10].
We also acknowledge that standardized benchmarks are required to
fully evaluate our processor.

6.1 Future Work


Our final goal is a soft processor core optimized for DSP computations
on FPGAs. To reach this goal it is necessary to benchmark the processor
using a number of realistic DSP applications. Unfortunately such bench-
marks are not easily performed as there is not yet a compiler available
for this processor.
As already explained in Section 4.2 there is not enough time avail-
able to have full forwarding in front of the arithmetic unit, but it might
be possible to forward operands from the adder directly to the logic
unit, shifter, and memory unit. This should be evaluted with the help
of benchmarks.
Other possible improvements are caches, interrupts, floating point
instructions, and a memory management unit.

7 Conclusion
It is not possible to design a really fast processor in an FPGA without
some quirks. It is however possible to design a processor where the im-
pact of these quirks are reduced.
Like all high speed designs, a high speed microprocessor has to keep
the logic complexity between flip-flops at a minimum. Unlike many
other high speed designs, the pipeline also has to be short.
This paper has demonstrated a number of ways to deal with these
issues, resulting in a processor which can operate at a much higher clock
frequency than Xilinx’ Microblaze. The architectural details and trade-
offs presented here should be of interest to anyone who is interested in
processor design for FPGAs.

223
Acknowledgments
Thanks to Prof. Lars Svensson for an interesting discussion regarding the
processor described in this chapter.

References
[1] Altera. Nios II Processor Reference Handbook, 2007.

[2] Xilinx. MicroBlaze Processor Reference Guide UG081 (v9.0), 2008.

[3] Lattice. LatticeMico32 Processor Reference Manual, 2007.

[4] Damjan Lampret. OpenRISC 1200 IP Core Specification, 2001.

[5] Gaisler Research. The LEON processor user’s manual, 2001.

[6] Sun. OpenSPARC T2 Core Microarchitecture Specification, a edition,


December 2007.

[7] J. Eilert, A. Ehliar, and Dake Liu. Using low precision floating point
numbers to reduce memory cost for mp3 decoding. Multimedia Sig-
nal Processing, 2004 IEEE 6th Workshop on, pages 119–122, 2004.

[8] Xilinx Inc. Microblaze - the industry’s most flexible embedded pro-
cessing solution, 2006.

[9] Peter Clarke. Xilinx raises soft processor clock frequency 25%, 2005.

[10] P.S. Ahuja, D.W. Clark, and A. Rogers. The performance impact
of incomplete bypassing in processor pipelines. Microarchitecture,
1995. Proceedings of the 28th Annual International Symposium on, pages
36–45, Nov-1 Dec 1995.

224
Paper V

High performance, low-latency


field-programmable gate
array-based floating-point adder
and multiplier units in a Virtex 4

Per Karlström, Andreas Ehliar, Dake Liu

Paper V
Department of Electrical Engineering
Linköping University
Sweden
email: {perk,ehliar,dake}@isy.liu.se

°2008
c IET. Reprinted from IET Computers & Digital Techniques, Vol. 2, No. 4, pp. 305-313, July 2008
High-performance, low-latency field-programmable gate array-based floating-point adder and multiplier units in
a Virtex 4, Karlström, P. Ehliar, A. Liu, D.

This paper has been reformatted from double column to single column format for ease of readability.

225
Abstract
There is increasing interest about floating point arithmetics in FPGAs,
thanks to the increase in their size and performance. While FPGAs are
generally good at bit manipulations and fixed point arithmetics, they
have a harder time coping with floating point arithmetics. In this paper
we describe, in detail, an architecture used to construct high performance
floating point components in a Virtex-4 FPGA. We have constructed a
floating point adder/subtracter and multiplier. Our adder/subtracter
can operate at a frequency of 377 MHz in a Virtex-4SX35 (speed grade
-12).

1 Introduction
Modern FPGAs are great assets as hardware components in small vol-
ume projects or as hardware prototyping tools. The increasing cost of
ASIC production is also a contributing factor to the increased use of
FPGAs [1].
More features are added to the FPGAs with every generation, making
it possible to perform computations at higher clock frequencies. Dedi-
cated carry chains, memories, multipliers and in the most recent FPGAs,
larger blocks aimed at DSP computations and even processors have been
incorporated into the otherwise homogenous FPGA fabric. All of these
improvements accelerate fixed point computations but no improvements
are directly aimed at improving floating point performance. Lacking any
direct support for floating point computations, it is important for design-
ers to know how to utilize the available resources as efficiently as possi-
ble.
FPGAs are not limited to just small volume production and proto-
typing. There is active research in the field of reconfigurable computing,
where processors reconfigures FPGAs (or similar devices) during run-
time to speed up critical inner loops. These systems range from multichip
systems with dedicated processors and FPGAs to solutions where the en-

226
tire system has been integrated into a single chip, [2] describes this field
in more details. Many of these solutions aims to automatically transform
C code (or code at similar level of abstraction) into FPGA configurations
to speed up critical parts of programs. In many cases the applications
might require floating point computations and it is therefore of impor-
tance to have good floating point units in the FPGA.
Floating point arithmetics is useful in applications where a large dy-
namic range is required or in rapid prototyping for applications where
the required number range has not been thoroughly investigated. Float-
ing point numbers are used extensively in modern applications, e.g. 3D
graphics, audio codecs, radar, and scientific computing. Many of these
applications are limited by the available computation power. This short-
age of computation power has started a trend of using FPGAs to boost
performance in a cost effective manner. In particular, scientific comput-
ing rely on floating point arithmetic [3].
This paper outlines one solution for integrating single precision float-
ing point computations into an FPGA. Previous solutions are either slow,
have high latency, or fail to disclose the architecture used to reach the
published performance. A solution, for single precision floating point
computing, comparable to the performance of commercial IP cores is pre-
sented, as well as the details of such an implementation. This is some-
thing not earlier done to the knowledge of the authors. For example, this
paper will present the details of a fast normalizer architecture for FPGAs
and the often overlooked aspect of the sticky bit generation, which needs
some special care to achieve timing closure.
In general, it is possible to trade higher throughput for longer latency
(in terms of clock cycles) by increasing the number of pipeline stages.
However, in many systems the point of diminishing returns are quickly
reached as the number of pipeline stages are increased. This is especially
true for algorithms with many data dependencies that cannot easily be
parallelized due to an increasing number of cycles used solely to wait for
values to be computed. Therefore, the overall goal of our design was to
balance throughput, latency, and area.

227
In summary FPGAs are becoming more and more important as com-
puting devices and will be able to replace more ASICs thus avoiding
the expensive ASIC development process. But a good FPGA fabric is
not enough, there must be good designs to configure the fabric with if
FPGAs are to be used successfully. This article intends to show good
design techniques for floating point units in FPGAs.

2 Related Work
A number of attempts at constructing floating point arithmetics in FPGAs
have been done and presented in the academia. However many of the
papers are a bit old and few targets modern FPGAs such as the Virtex-
4. This work is based on a study [4] that did not include the round to
nearest even mode, which is important for IEEE 754 compliance.
High-performance floating point arithmetics on FPGA is discussed
in [5]. Although the paper has some interesting figures about the area
versus pipeline depth tradeoff, their design seems to be a bit to general
to utilize the full potential of the FPGA. As an example, to reach a clock
frequency of 250 MHz for the adder they have to use 19 pipeline stages
on a Virtex2Pro speed grade -7.
To be fully IEEE 754 compliant the floating point unit needs to sup-
port denormalized numbers, be it either with exceptions, letting a pro-
cessor deal with these uncommon numbers or having direct support for
denormalized numbers in hardware. For a good discussion on different
strategies to handle denormalized numbers see [6]. Although it is a good
general discussion the paper does not cover any FPGA specific details.
An interesting approach to tailor floating point computations to FPGAs
are to use higher than radix-2 floating points since this maps better to the
FPGA fabric. This is better described in [7]. This makes it harder to
achieve full IEEE 754 compliance though.
Full IEEE 754 compliance requires the FPU to support round toward
nearest even, round toward −∞, round toward +∞, and round toward
zero. A more detailed discussion about rounding is presented in [8]. That

228
paper however, does not deal with any FPGA specific implementations.
A system for configuring and building floating point accelerators is
presented in [9], where the target device used is a Stratix (speed grade
5) FPGA, which is roughly comparable to the Virtex-2 Pro (speed grade
6) FPGA. Having realized this it is easy to see that their work does not
come close to the performance presented in this article. For example
their floating point adder has a latency of 5 cycles and runs at a clock
frequency of 77 MHz.
The Arénaire project [10] has published a configurable floating point
library (including elementary functions as well as addition and multipli-
cation), the latency can also be parameterized. Their modules can only
perform round to nearest and they report a clock frequency of 100 MHz
in a Virtex-II (XC2V1000-4) using the fully pipelined modules. While not
reaching the same performance as our solution, the project is still inter-
esting as it is not focused solely on basic operators such as addition and
multiplication.
The only work so far presented with performance comparable to our
results are commercial IP cores from e.g. Nallatech [11] and Xilinx [12].
But neither of these companies publish the low level techniques used in
their IP cores.
It is possible to design IEEE 754 single precision floating point arith-
metics that can run at a clock frequency of 400 MHz in a XC4VSX55-10
according to [13]. But that work targets FFT and can as such be more
aggressively optimized, therefore it can not be directly compared to our
work. The authors do not report any latency figures for their computa-
tion units either.
A quantitative performance comparison between our work and oth-
ers will be given in section 6.

3 Floating Point Numbers and Computing


A floating point number consists of a mantissa (M) and an exponent (e)
as shown in Equation (1). One way to represent the sign of the mantissa

229
is to use a two’s-complement representation. Another common approach
is to use a sign magnitude representation where a sign bit (S) decides the
sign and the mantissa holds the magnitude of the number. The sign of the
exponent must also be represented, a common approach to this is to store
the exponent in an excess representation. In the excess representation
the exponent is stored as a positive number from which a constant is
subtracted to form the final exponent.
Since the mantissa in a normalized binary floating point number, us-
ing the sign bit representation, always will have a single one in the MSB
position, this bit is normally not stored together with the floating point
number. This bit is referred to as an implicit one. The IEEE 754, a stan-
dard for floating point numbers [14], dictates the format presented in
Equation (2). The IEEE 754 single precision format is 32 bit wide and
uses a 23 bit fraction, an eight bit exponent represented using excess 127,
and one bit is used as a sign bit. The value zero is represented by setting
all bits to zero.

x = M · 2e (1)
S e−excess
x = (−1) · 1.M · 2 (2)

Truncation has to be performed after a floating point operation to en-


sure that the end result has the correct number of bits. In order to im-
prove the accuracy of the result, rounding is performed before trunca-
tion. IEEE 754 specifies that the result after computation and rounding
shall be equal to the result reached if the computation had been done
with infinite precision. To conform to the IEEE 754 standard it must be
possible to choose between four rounding modes; round toward zero,
round toward +∞, round toward −∞, and round toward nearest even.
The round to nearest even is the hardest to implement and our work has
thus focused on that rounding mode. The requirement that the result
after rounding shall be equal to the result of an infinite precision oper-
ation, can seem hard to meet for the addition/subtraction operation. It
can however be solved by adding at most three additional bits after the

230
original LSB of the mantissa. Two of the bits, often called the guard (g)
and round (r) bit, are a buffer for bits to be shifted into and the third bit
is called the sticky bit (s) and is an or-operation of all bits shifted into
and to the right of the s bit. The new fractional number to be used in the
computations will take the form of Equation (3), where g, r, and s are the
bits described above, and the m:s are the original mantissa bits.

fnew = 1.m · · · m g r s (3)

To achieve the correct result after multiplication and rounding, no


extra bits needs to be added to the mantissas before the multiplication.
After the multiplication however, two extra bits are needed for the round
operation, the round (r) and sticky bit (s). The r bit is the bit to the
right of the LSB in the mantissa and the s bit is the or:ed result of all bits
to the right of the r bit. This is exemplified in Equation (4)–(14) where
two numbers with a mantissa of three bits are multiplied. mξ and iξ
represent individual bits, Equation (7) represents the multiplied result,
Equation (8) and (11) simply represent two ways to create a new number
from the multiplied result. After the multiplication the final mantissa (f ),
r, and s are selected according to Equation (15).

a = 1.ma1 ma2 ma3 (4)


b = 1.mb1 mb2 mb3 (5)
c=a·b (6)
c = i1 i0 .m1 m2 m3 m4 m5 m6 (7)
1.f0 = 1.m1 m2 m3 r0 s0 (8)
r0 = m4 (9)
s0 = m5 ∨ m6 (10)
(11)

231
1.f1 = 1.i0 m1 m2 r1 s1 (12)
r1 = m3 (13)
s1 = m4 ∨ m5 ∨ m6 (14)
(
{f0 , r0 , s0 } if i1 = 0
{f, r, s} = (15)
{f1 , r1 , s1 } if i1 = 1

Our floating point format is similar to IEEE 754 [14]. An implicit one
is used and the exponent is excess-represented. However, we do not
handle denormalized numbers, nor do we honor NaN or Inf. The rea-
son for excluding denormalized numbers is due to the large overhead
in taking care of these numbers, especially for the multiplier. These are
commonly excluded from high performance systems, e.g. the CELL pro-
cessor does not use denormalized numbers for the single precision for-
mat in its SPUs [15].

4 Methodology

As a reference for the RTL code we implemented a C++ library for float-
ing point numbers. The number of bits in the mantissa and exponents
could be configured from 1 to 30 bits. The C++ model was later used to
generate the test vectors for the RTL test benches. Using a mantissa width
of 23 and an exponent width of 8 the C++ model was tested against the
floating point implementation used in the development PC. The only dif-
ferences occurred due to the lack of support for denormalized numbers,
Inf, and NaN.
Initial RTL code was written using Verilog adhering to the C++ model.
The performance of the initial RTL model was evaluated and the most
critical parts of the design were optimized to better fit the FPGA. This
was repeated until the performance was satisfactory and no bugs were
discovered by the test benches.

232
5 Implementation
We chose to implement the most commonly used operations, addition,
subtraction, and multiplication. In order to test these components in a
realistic environment we constructed a complex radix-2 butterfly kernel
using our components.
Our implementation always uses the round to nearest even mode.
Since this is the mode requiring the most extra hardware to implement,
implementing the other modes should not significantly affect the per-
formance or resource utilization of the circuits. See section 7 for further
information about the other rounding modes.
We have tested the floating point units on an FPGA from the Virtex-4
family (Virtex-4 SX35-10). For further details about the Virtex-4 FPGA,
see the Virtex-4 User Guide [16]. The Virtex-4 contains a number of
blocks targeted at DSP computations, these blocks are called DSP48-blocks
and are thoroughly described in the XtremeDSP user guide [17]. Xilinx’
ISE 9.1i was used to synthesize, place, and route the design.

5.1 Multiplier
A floating point multiplier is conceptually easy to construct. The new
mantissa is formed as a multiplication of the old mantissas. In order
to construct a good multiplier some FPGA specific optimizations were
needed. The 24×24 bit multiplier for the mantissa is constructed using
four of the Virtex-4’s DSP48 blocks to form a 35×35 bit multiplier with a
latency of five clock cycles. For a thorough explanation of how to con-
struct such a multiplier the reader is referred to [17]. The new exponent
is calculated with a simple addition and the new sign is computed as an
exclusive-or of the two original signs. The result of the multiplication
has to be normalized, which is a simple operation since the most signifi-
cant bit of the mantissa can only be located at one out of two bit positions
given normalized inputs to the multiplier. The exponent is adjusted ac-
cordingly in an additional adder. The final stage is the round operation,
requiring an additional adder. The rounding operation can result in a

233
Mantissa Exponent

DSP48
Multiplier
1

Normalize

Round

PostProcess

Figure 1: The floating point multiplier architecture

number that has to be renormalized, this is henceforth called a round


overflow. A round overflow occurs if the mantissa consists of all ones
and the rounding results in an addition of one to the least significant bit
of the mantissa.
It is however easy to avoid yet another normalization stage since the
mantissa will always be zero if a round overflow has occurred. The ex-
ponent still needs to be adjusted in this case, this can be handled by com-
puting both the exponent sum and the exponent sum plus one in parallel,
choosing the correct exponent in the end. The resulting multiplier archi-
tecture is shown in Figure 1.

234
5.2 Adder/Subtracter

A floating point adder/subtracter is more complicated than a floating


point multiplier. The basic adder architecture is shown in Figure 2. The
first step compares the operands and swaps them if necessary so that the
number with the smallest magnitude enters the path with the alignment
shifter. If the input operands are non-zero, the implicit one is also added
in this first step. In the next step, the smallest number is right shifted by
the exponent difference so that the exponents of both operands match.
After this step, an addition or subtraction of the two mantissas are per-
formed, depending on the sign bit and type of operation. A subtraction
can never cause a negative result because of the earlier comparison and
swap step.
The sticky bit generation causes problems when aligning a value be-
fore the addition. The sticky bit is computed as an or-operation of all
bits shifted out to the right. To avoid a large shifter, the sticky bit is gen-
erated in parallel with the shift and then concatenated to the end of the
shifted result. In order to achieve timing closure for this operation is was
necessary to split the sticky bit generation into two steps. The first step
is performed on both mantissas, the SbP boxes in Figure 2. This step
simply generates a vector, the sticky_bit_prep vector, where each bit is the
or:ed result of four consecutive bits, i.e. bit 0 in the sticky_bit_prep vector
is the or:ed result of bit 0–3 in the mantissa, bit 1 in the sticky_bit_prep
vector is bit or:ed result of bit 4–7 in the mantissa, and so on. The choice
to operate on the bits in groups of four was done since it maps well the
four input look-up table (LUT) architecture of the Virtex-4. The correct
sticky_bit_prep vector is chosen when the result of the compare operation
is known. The sticky bit is in the end generated with the help of the
sticky_bit_prep vector and the mantissa to be shifted in the SbG box in
Figure 2.
The actual addition operation causes no architectural problems. The
only problem here is to ensure that the addition and subtraction opera-
tion is performed using a single adder in the FPGA.

235
Exponent Mantissa

SbP SbP
Compare/Select

CMP
1'b1/0

2 SbG
Align

1
Add

Find leading one


Normalization

ROP
Round&PostProcess

Add1?

PostProcess

Figure 2: The overall adder architecture

The normalization step is the most complicated step and is known


to be a major bottleneck in floating point computations, this is also true
for our design. The normalization is implemented using three pipeline
stages. Figure 3 depicts the architecture of the normalizer. The following
is done in each pipeline stage:

1. The mantissa is processed in parallel in a number of modules, each

236
looking at four bits of the mantissa. The first module operates on
the first four bits and outputs a normalized result assuming a one
was found in these bits. An extra output signal, shown as dotted
lines in Figure 3, is used to signal if all four bits were zero. The
second module assumes that the first four bits were all zero and
instead operates on the next four bits, outputting a normalized re-
sult. This is repeated for the remaining bits of the mantissa. Each
module also generates a value needed to correct the exponent, this
is marked as gray dotted lines in Figure 3.

2. One of the previous results, both mantissa and exponent offset value,
is selected to be the final output. If all bits were zero, a zero is gen-
erated as the final result.

3. The mantissa is simply delayed to synchronize with the exponent.


The exponent is corrected with the offset selected in the previous
stage.

Our normalization uses a rather hardware expensive approach. A


less expensive architecture could be used if a deeper pipelines is accept-
able. The modules in the first stage of the normalizer analyzes four bits
each, since this maps well to the four input LUTs of the Virtex-4.
After normalization the result has to be rounded and truncated. The
round operation can in some cases result in a round overflow. When
this happens the resulting mantissa will consists of only zeros and the
exponent will be the same or one larger then the largest input exponent
depending on whether a subtraction or addition operation is performed.
An additional exponent is thus precomputed after the swap operation
is completed. This exponent can finally be selected in the end of the
exponent pipeline if a round overflow has occurred. In order to meet
timing and not introduce an additional pipeline stage after the rounding
circuit, when it is known if a round overflow occurred or not, the round
overflow bit is predicted in the previous pipeline stage. Seen as the ROP
box in Figure 2.

237
Unnormalized mantissa Exp

Shifted by 0 4 20

ff1 in 4 ff1 in 4 ff1 in 4


shift shift shift

Priority
6-1 MUX
decoder

Normalized mantissa New exponent

Figure 3: The normalizer architecture

The final post processing stage is used to force the outputs to zero
if needed. The mantissa is forced to zero if the overall result is zero, in
case of an underflow, or in case of a round overflow. The exponent and
sign bit is forced to zero if the overall result is zero or an underflow has
occurred.

5.3 Low Level Optimizations


To achieve the performance presented in this article we had to use FPGA
specific optimizations. One optimization was to make sure that the adder/-
subtracter was implemented using only one LUT per bit. Figure 4 shows
a bit cell of the optimized adder. An additional input signal is used to
zero out the mantissa from the pre-alignment step, marked with 1 in Fig-

238
ure 2. This is done so that the shifter in the align step only has to consider
the five least significant bits in the exponent difference, marked with 2 in
Figure 2. If one of the more significant bits is one, the mantissa should be
shifted so much that all its bits become zeroes. This is handled by the Set
to zero signal in Figure 4. The least significant bit however is always used
as it is since it is the sticky bit and it is needed to ensure that the result
after the rounding operation is the same as if the addition operation had
been done with infinite precision.

Carry out

Sub
Set to zero
B
A Sum

LUT

Carry in

Figure 4: Combined adder and subtracter

6 Results
We have focused much of our measures and comparisons on the adder
since it is the bottleneck module in our current design. The clock frequen-
cies reported by us assumes a clock with no jitter. Xilinx’ place and route
tool was used to determine the maximum clock frequency by changing
the timing constraints until timing closure could not be achieved. The
clock frequencies reported by us is the maximum frequencies for which

239
timing closure occurred, rounded down to the nearest integer.
Table 1 list various performance metrics over different devices and
speed grades.

Clock frequency (MHz)


Unit Latency XC4VSX-10 XC4VSX-11 XC4VSX-12
Adder 8 290 302 377
Multiplier 8 330 383 440
Unit Latency XC5VLX-1 XC5VLX-2 XC5VLX-3
Adder 8 317 366 419
Multiplier 8 362 440 500

Table 1: Performance in various devices.

Table 2 compares the performance of the 23 bit format floating point


adder using the best speed grades from a number of FPGA families from
Xilinx.

XC5VLX-3 419 MHz XC2VP-6 234 MHz


XC4VSX-12 377 MHz XC3SE-5 199 MHz
XC2VP-7 278 MHz XC3S-5 176 MHz

Table 2: Family comparison

To get an idea of where resources are consumed in our implementa-


tion, Table 3 list the resource utilization, for a Virtex 4, of the steps in
Figure 2. To avoid the extra delays associated with the FPGA I/O pins
two extra pipeline stages before and one stage after were inserted into
the top module. These extra flip-flops are not included in the resource
utilization metrics.
Table 4, 5, 6, and 7, compares our results (DA) against some other
publications.
Please note that the figures are obtained from units with slightly dif-
ferent features in terms of IEEE 754 compliance. The numbers of LUTs in
the case of the Nallatech unit is an estimation since Nallatech only pub-

240
lish data for how many slices their design occupies. The number of LUTs
is in this case estimated to be twice as many as the slices since there is
two LUTs per slice in a Virtex-II. Although the comparisons here are not
completely fair they still give a good picture of how the performance of
our floating point units compare to other FPGA implementations.

LUT FF
Compare/Select 111 22
Align 134 66
Add 36 29
Normalization 436 191
Round 8 0
Other 121 121
Total 846 429

Table 3: Adder resource utilization in Virtex 4

Device: XC2VP-7
USC DA
Pipeline depth 19 8
LUTs 548 760
FFs 801 516
Clock frequency (MHz) 250 278

Table 4: Comparison with USC adder [5] in Virtex-II.

7 Discussion and Future Work


There are a number of optimization possibilities left in this design. For
example, instead of using CLBs for the shifting, a multiplier could be
used for this task by presenting the number to be shifted as one operand
and a bit vector with a single one in a suitable position as the other
operand.

241
Device: XC2VP-6
Nallatech DA
Pipeline depth 14 8
LUTs < 580a 758
FFs ? 517
Clock frequency (MHz) 184 278

Table 5: Comparison with Nallatech adder [11] in Virtex-II.


a Value estimated from number of slices

Device: XC4VSX-10
Adder Multiplier
Xilinx DA Xilinx DA
Pipeline depth 13 8 11 8
LUTs 578 846 116 173
FFs 594 429 235 150
DSP48 — — 5 4
Clock frequency (MHz) 368 290 391 331

Table 6: Comparison with Xilinx adder and multiplier in Virtex-4 [12].

Device: XC5VLX-1
Adder Multiplier
Xilinx DA Xilinx DA
Pipeline depth 12 8 9 8
LUTs 429 675 88 189
FFs 561 424 117 154
DSP48E — — 3 4
Clock frequency (MHz) 395 317 450 362

Table 7: Comparison with Xilinx adder and multiplier in Virtex-5 [12].

242
If the application of the floating point blocks are known it is possible
to do some application specific optimizations. For example, in a butterfly
with an adder and a subtracter, operating on the same operands, the first
compare stage could be shared between these. If the application can tol-
erate it, further pipelining could increase the performance significantly.
If the latency tolerance is very high, bit serial arithmetics could probably
be used as well. In this project we tried to achieve a high throughput
while still maintaining low latency. In the end, the latency tolerated for
any unit depends on the application. If our results are better than others
with deeper but faster pipeline, or if the resource utilization is acceptable,
can not be answered without knowing the target application.
It would also be interesting to take a closer look at the Virtex-5 FPGA.
The six input LUT architecture should reduce the number of logic levels
and routing all over the design. As an example, one could investigate
if the parallel shifting modules in the normalizer should take six bits as
input since it could map well to the six input LUT architecture of the
Virtex-5 or if the fact that a 4-to-1 mux can be constructed in a six input
LUT still favors the current four bits per module architecture. The num-
bers presented for the Virtex-5 in this paper is produced using the same
design as for the Virtex-4.
Our current implementation is not fully IEEE 754 compliant as it can-
not handle Inf, NaN, denormalized numbers and some of the rounding
modes. We estimate that no extra pipeline stage is needed for the miss-
ing rounding modes, i.e. round toward zero, round toward +∞, and
round toward −∞. These can be implemented by adding a few LUTs to
a non-critical path.
Inf and NaN can be handled by a parallel data path that will check for
and generate substitute values. This will cost an extra mux in the end to
choose the right value for the outputs and this extra mux will probably
require an extra pipeline stage to avoid performance degradation. De-
normalized numbers can be handled by raising an exception and letting
a processor deal with the situation. Detection of denormalized numbers
can be done in parallel with the computation and shouldn’t require much

243
extra hardware or any extra pipeline stage.

8 Conclusion
We have shown that it is possible to achieve good floating point perfor-
mance with low latency in modern FPGAs. Our adder and multiplier
can operate at a clock frequency of 377 MHz and 440 MHz respectively
in a Virtex 4 (speed grade -12). We have also disclosed the techniques
required for achieving the results reported.
To make maximal use of an FPGA it is important to take into ac-
count the specific architecture of the targeted FPGA. We have shown
techniques for how to do this when dealing with floating point opera-
tions. One of the most important optimization we did was to perform
the normalization in a parallel fashion. The parallel normalization ap-
proach proved to be efficient since it reduced the number of pipeline
stages needed to perform the normalization operation.

References
[1] Rick Mosher. FPGA Prototyping to Structured ASIC Produc-
tion to Reduce Cost, Risk & TTM. http://www.us.design-
reuse.com/articles/13550/fpga-prototyping-to-structured-asic-
production-to-reduce-cost-risk-ttm.html.

[2] Katherine Compton and Scott Hauck. Reconfigurable computing: a


survey of systems and software. ACM Comput. Surv., 34(2):171–210,
June 2002.

[3] Peter Messner and Ralph Bodenner. Accelerating scientific applica-


tion using fpgas. Xcell, (57):70–73, 2006.

[4] P. Karlstrom, A. Ehliar, and D. Liu. High performance, low latency


fpga based floating point adder and multiplier units in a virtex 4. In
Norchip Conference, 2006. 24th, pages 31–34, 2006.

244
[5] Gokul Govindu, L. Zhuo, S. Choi, and V. Prasanna. Analysis of
high-performance floating-point arithmetic on fpgas. In Parallel
and Distributed Processing Symposium, 2004. Proceedings. 18th Inter-
national, pages 149+, 2004.

[6] E. M. Schwarz, M. Schmookler, and S. D. Trong. Hardware imple-


mentations of denormalized numbers. In Computer Arithmetic, 2003.
Proceedings. 16th IEEE Symposium on, pages 70–78, 2003.

[7] Bryan Catanzaro and Brent Nelson. Higher radix floating-point rep-
resentations for fpga-based arithmetic. In FCCM ’05: Proceedings of
the 13th Annual IEEE Symposium on Field-Programmable Custom Com-
puting Machines (FCCM’05), pages 161–170, Washington, DC, USA,
2005. IEEE Computer Society.

[8] M. R. Santoro, G. Bewick, and M. A. Horowitz. Rounding algo-


rithms for ieee multipliers. In Computer Arithmetic, 1989., Proceedings
of 9th Symposium on, pages 176–183, 1989.

[9] C. Brunelli, F. Garzia, J. Nurmi, C. Mucci, F. Campi, and D. Rossi.


A fpga implementation of an open-source floating-point computa-
tion system. In System-on-Chip, 2005. Proceedings. 2005 International
Symposium on, pages 29–32, 2005.

[10] J. Detrey and F. de Dinechin. A parameterized floating-point expo-


nential function for fpgas. In Field-Programmable Technology, 2005.
Proceedings. 2005 IEEE International Conference on, pages 27–34, 2005.

[11] Nallatech. Nallatech Floating Point Cores. Nallatech,


www.nallatech.com, 2002.

[12] Xilinx. Floating-Point Operator v3.0. Xilinx, www.xilinx.com, 3.0 edi-


tion, September 2006.

[13] Ray Andraka. Supercharge your dsp with ultra-fast floating-point


ffts. DSP magazine, (3):42–44, 2007.

245
[14] IEEE. Ieee standard for binary floating-point arithmetic. Technical
report, 1985.

[15] Hwa-Joon Oh, S. M. Mueller, C. Jacobi, K. D. Tran, S. R. Cottier,


B. W. Michael, H. Nishikawa, Y. Totsuka, T. Namatame, N. Yano,
T. Machida, and S. H. Dhong. A fully pipelined single-precision
floating-point unit in the synergistic processor element of a cell pro-
cessor. Solid-State Circuits, IEEE Journal of, 41(4):759–771, 2006.

[16] Xilinx. Virtex-4 User Guide. Xilinx, www.xilinx.com, 2.3 edition, Au-
gust 2007.

[17] Xilinx. XtremeDSP for Virtex-4 FPGAs User Guide. Xilinx,


www.xilinx.com, 2.5 edition, June 2007.

246
Paper VI

An ASIC Perspective on High


Performance FPGA Design

Andreas Ehliar and Dake Liu


Department of Electrical Engineering
Linköping University
Sweden
email: {ehliar,dake}@isy.liu.se
Submitted for possible publication to Field Programmable Logic and
Applications, 2009
This paper has been reformatted from double column to single column format for ease of readability.
Paper VI

247
Abstract

In this paper we discuss how various design components perform in


both FPGAs and standard cell based ASICs. We also investigate how
various common FPGA optimizations will effect the performance and
area of an ASIC port. We find that most techniques that are used to op-
timize a design for an FPGA will not have a negative impact on the area
in an ASIC. The intended audience for this paper are engineers charged
with creating designs or IP cores that are optimized for both FPGAs and
ASICs.

1 Introduction

FPGAs are becoming more and more common and are used in both high
and low-end systems. In some cases it is easy to meet the performance
and area goals using non-optimized generic HDL code. This is not true
as often as designers would wish and various FPGA specific tricks are
often required to either meet timing or fit the design into the selected
FPGA. If the design is intended for a high volume ASIC product where
the FPGA version is only used for prototyping it is probably not a big
problem since such a design does not usually have to be optimized for
the FPGA.
However, when the design is intended for high volume production
using FPGAs and a future ASIC port if the FPGA based product is suc-
cessful, the ease of ASIC portability is very important indeed. This is the
scenario which the rest of this paper will investigate.
There are two parts in this paper. The first part examines common
components like adders, multiplexers, and memories and compares the
performance of these components in ASICs and FPGAs. The second part
of this paper examines a variety of different FPGA optimizations and
their impact on an ASIC port.

248
2 Related work
It is surprisingly hard to find information about porting FPGA designs
to ASICs, especially when considering the impact of FPGA optimization
strategies. A brief overview of how to do an ASIC port of an FPGA de-
sign is given in [1]. Some general guidelines on how to port an ASIC de-
sign to an FPGA is available from for example Xilinx [2] and Altera [3].
While the FPGA vendors would of course like us to port ASIC designs
to FPGAs, most of the advice that is given in these references are also
applicable when creating an FPGA design which will be migrated to an
ASIC.
An interesting comparison of the performance difference between
ASIC and FPGAs is given in [4] where the performance, area and power
consumption of a 90 nm ASIC and a 90 nm FPGA is measured. It is unfor-
tunate that the benchmarks selected by the authors in this paper do not
seem to include designs that are specifically targeted and optimized for
FPGAs. There are also some publications that discuss structured ASICs
and similar solutions and how to port an FPGA design to such prod-
ucts such as for example [5], [6], and [7]. The relatively fixed structure
of these solutions means that not all of the information is applicable to a
true standard cell based ASIC port.

3 Methods
It is not our intention to crown the fastest or most area efficient FPGA
in this paper. Therefore we have decided to use performance and area
cost numbers that are relative to the performance and area cost of a 32-bit
adder in the selected technology. Another reason to use relative numbers
instead of absolute number is to protect proprietary information like the
exact size and performance of ASIC memory blocks.
One problem of this kind of comparison is that it is not really clear
what area means in an FPGA design since it can contain components
like block rams and DSP blocks in addition to LUTs and flip-flops. One

249
way to measure this would be to simply measure the silicon area of
the various components in the FPGA, which is basically what was done
in [4]. While this comparison is very interesting from an academic point
of view, it is not very useful to a VLSI designer (unless he is working for
an FPGA manufacturer).

3.1 FPGA Area Cost


We propose another metric for the area cost which we will use in this
paper. First of all, we assume that flip-flops and LUTs will be packed
as tightly as possible into slices. Secondly, we set the slice cost for a
block RAM as the total number of slices in the device divided by the
total number of block RAMs in the device. If a device have block RAMs
of several sizes (like in Stratix III), the large block RAM is measured as if
it was several small block RAMs.
The slice cost for a DSP block is derived similarly except that we also
have to take into account whether only part of the block is used. For
example, a DSP block in the Stratix III can be divided into independent
9, 12, 18, and 36-bit multiplier blocks where the smaller multipliers are
used to create larger multipliers.
The advantage is that this metric is easily calculated for all FPGAs
and it is easily understandable for a VLSI designer. In addition, assuming
that the designer values all features of a certain FPGA equally (or at least
roughly equally) this area cost metric will be a good indication of the
monetary value of a certain kind of design element. As we realize that
this metric could be controversial to some readers we will therefore also
note whenever an area cost figure is based upon this kind of conversion.

3.2 Design Flow and Tools


The tools used to synthesize the FPGA designs in this paper were ISE 10.1
and Quartus II 8.1 for Xilinx and Altera respectively. To find the maxi-
mum performance (Fmax ) of the Xilinx designs the timing constraints
were increased until timing could no longer be met and then the fastest

250
time reported was selected. To find the Fmax of Altera designs, the tim-
ing were simply over constrained by setting the required frequency to 1
GHz. This approach is also recommended by Altera in [8].
Synopsys Design Compiler (A-2007.12-SP5) was used for ASIC syn-
thesis and Cadence SoC Encounter (v5.20) was used for ASIC place and
route. The selected ASIC technology is a standard cell based 130 nm
technology based on the relatively low NRE costs (this choice was made
consistent with the scenario outlined above where an FPGA based prod-
uct is ported to an ASIC for cost reasons). The timing analysis is based
upon worst case parameters.

4 Performance and area cost of important com-


ponents
This section contains a comparison of the relative cost of various com-
mon constructs in an FPGA and in a 130nm ASIC process. Table 1 shows
an overview of the relative costs of selected components. The Spartan
3A (xc3s700a), Virtex 5 (xc5vlx85), Cyclone 3 (EP3C40), and Stratix 3
(EP3SL150), (labeled FPGA 1, 2, 3, and 4 in Table 1) was used in this
comparison. In all comparisons, the fastest speedgrade was used (al-
though the speedgrade shouldn’t matter for the relative numbers). As a
side note, this table alone should show how futile it is to try to estimate
the gate count in an ASIC by counting LUTs in an FPGA design.

4.1 Adders
Adders (and subtracters) are probably one of the most common compo-
nents in any digital design. It is also a component which the architecture
of most FPGAs are optimized for by the use of dedicated carry-chains.
For this reason an adder in an FPGA tends to be a pretty simple compo-
nent which is usually using one LUT per bit and there is little reason to
deviate from this template (except possibly for pipelining of very large
adders and using bit-serial adders for non-performance critical tasks).

251
Table 1: Relative area and performance of common components
Design Relative area cost (lower is better) Relative Fmax (higher is better)
(Note that all designs FPGAs ASIC4 FPGAs ASIC4
have registered outputs) 1 2 3 4 (130 nm) 1 2 3 4 (130 nm)
32-bit adder 1.0 1.0 1.0 1.0 1.0 ( 0.21) 1.0 1.0 1.0 1.0 1.0 ( 0.11)
32-bit adder/subtracter 1.0 1.0 2.0 1.0 1.9 ( 0.25) 0.97 0.90 0.82 0.81 0.89 ( 0.13)
32-bit 3 operand adder 1.9 1.9 2.0 1.0 1.3 ( 0.40) 0.86 0.82 0.77 0.89 0.74 ( 0.12)
32-bit 4 operand adder 2.9 2.9 3.0 2.0 1.7 ( 0.55) 0.62 0.58 0.77 0.82 0.69 ( 0.10)
32-bit 16-to-1 mux 8.0 5.0 10 5.0 0.57 ( 0.48) 1.5 0.92 1.3 1.7† 0.90 ( 0.31)
17x17 unsigned multiplier 18∗ 34∗ 10∗ 19∗ 3.7 ( 1.3) 1.3 0.64 0.81 0.61 0.44 ( 0.11)
19x19 unsigned multiplier 75∗ 35∗ 36∗ 37∗ 4.1 ( 1.6) 0.46 0.40 0.59 0.49 0.43 ( 0.10)
Plain 18x18 MAC unit 25∗ 35∗ 19∗ 26∗ 5.3 ( 2.4) 0.55 0.41 0.51 0.53 0.42 ( 0.08)
(pipelined adder) 29∗ 35∗ 21∗ 42∗ 5.4 ( 2.7) 0.79 0.82 0.77 0.62 0.49 ( 0.11)
(pipelined adder, forwarding) 25∗ 36∗ 24∗ 29∗ 4.9 ( 2.7) 0.75 0.72 0.63 0.62 0.49 ( 0.09)
2048x32 bit memory 74∗ 34∗ 79∗ 57∗ - ( 33‡ ) 1.5 0.72 0.75 0.92 - ( 0.53‡ )
RF (16x32 bit register file) 1.0 1.0 10∗ 2.1 2.7 ( 2.5) 2.1† 1.2† 0.74 1.0 0.93 ( 0.31)
RF (Ports: 1 read, 1 write) 2.0 1.0 10∗ 2.1 2.6 ( 2.5) 1.9† 1.1 0.73 1.2 0.93 ( 0.23)
RF (Ports: 2 read, 1 write) 4.0 2.0 20∗ 4.3 3.2 ( 3.0) 1.9† 1.0 0.74 1.0 0.89 ( 0.22)
RF (Ports: 4 read, 2 write) 50 40 59 21 5.8 ( 4.6) 0.97 0.66 0.90 1.1 0.91 ( 0.13)

4 Values in parentheses are from designs optimized for area ∗ Relative area cost
includes DSP or RAM blocks (See Section 3.1) † Exceeds maximum frequency of
clock net as reported in the datasheet. ‡ The ASIC memory block was only
optimized for area.

However, when using ASICs, the area of an adder can vary widely
depending on the timing constraints as seen in Table 1. It can also be
seen that an ASIC enjoys an advantage for situations which the FPGA is
not optimized for, such as multi-operand adders. (Although it is inter-
esting that the architecture of the Stratix III allows a 3-operand adder to
be created without any area penalty.)

4.2 Multiplexers
Multiplexers and similar structures is a very common design component.
The performance of multiplexers in an FPGA is usually high due to the
use of specialized logic in the FPGAs such as the MUXF5-8 components
in most Xilinx FPGAs. On the other hand, the area cost for multiplexers is
very high when compared to the cost of the adders as shown in Table 1.
This means that tradeoffs that are valid in an FPGA such as avoiding
the use of crossbar based SoC interconnects may no longer be valid in
an ASIC. If a SoC system is well designed, replacing a SoC bus, such
as Wishbone or AMBA, with a crossbar may be a fast way to raise the
performance of an ASIC port without a costly redesign/reverification.

252
FPGA and ASIC optimization hint
Multiplexers are very expensive in an FPGA and very cheap in an ASIC.
The performance of an ASIC can sometimes be significantly enhanced at
little area cost by adding strategically placed multiplexers, such as using
crossbars instead of buses.

4.3 Multipliers and DSP blocks


If an FPGA is used which does not have any built-in multipliers, the
ASIC is clearly going to be much more resource efficient. When built-in
multipliers are added to the equation, it is possible that the performance
of an ASIC port will actually be slower than the FPGA, since the mul-
tipliers in the FPGA are well optimized and in many cases also enjoy a
technology node advantage over the ASIC based multiplier.
On the other hand, the ASIC process enjoys a huge advantage as
soon as a non-standard multiplication size is used. Going from 17×17
to 19×19 is very costly in the FPGA, whereas the area cost difference in
ASICs are low and the performance difference is negligible.
Similarly, architectures that cannot be mapped efficiently to the DSP
blocks will also gain when ported to ASICs. Take for example the Multiply-
accumulate unit in Table 1 (also shown in Figure 1a. The multiplier in this
example is 16×16 bits and the 4 accumulator registers contain 48 bits.
While the multiplier fits the DSP48E block of the Virtex-5 very well, the
design will have suboptimal performance as the accumulation register in
the DSP48E block cannot be used. An alternative is shown in Figure 1b
where the adder of the MAC unit has been pipelined. This architecture is
only limited by the performance of the DSP48E block. The drawback of
this architecture is that the pipelining means that it is no longer possible
to accumulate to the same register at all times as is possible in Figure 1a.
A compromise is the solution shown in Figure 1c where result forward-
ing is used to achieve the same functionality as in Figure 1a. The perfor-
mance of this architecture is not as high as in Figure 1b, but it is substan-
tially higher than the performance of the plain MAC. On the other hand,
in the ASIC, the performance of all three options is similar. (See also the

253
Figure 1: MAC units mapped to DSP48E blocks

discussion in Section 5.3.)

FPGA and ASIC optimization hint


If the design has been specifically optimized for the DSP blocks in the FPGA,
it is likely that there will be performance problems when porting the design
to an ASIC.

4.4 Large Memories


When synthesizing a design with memory blocks for an ASIC it is nec-
essary to use specialized memories that are optimized for that particular
ASIC process. Otherwise the area and performance of the design will be
abysmal. As an example, when we synthesized a standard cell based 8
kilobyte memory the area was 10 times larger than a custom made mem-
ory block. Even though this proves that it is critical to use specialized
memories for anything but the smallest memory, there is actually sur-
prisingly little publicly available information about memory blocks for
ASICs. One datasheet which does contain both area and frequency in-

254
formation is available from Atmel for a 0.35 µm process [9]. According
to this datasheet, a dual port memory is roughly 60% larger than a single
port memory. However, the authors have seen memory blocks in more
recent technologies where the area difference is considerably larger than
this.
Regardless of the technology which is used, it is clear that it is much
more area expensive to use a dual port memory than a single port mem-
ory. Therefore it makes sense to avoid dual port memories in ASICs if
the same performance can be reached using single port memories. As an
example of this, FIFOs are usually implemented using dual port memo-
ries in an FPGA but a synchronous FIFO can be implemented using only
single port memories as described in for example [10]. If it is not easy to
avoid a dual port memory it is necessary to consider the cost and time re-
quired to redesign the system (if possible) and compare that against the
cost of the increased ASIC area that dual port memory usage will lead to.
For memories which contain read-only information it can also be a
very good idea to use a ROM compiler instead of an SRAM. Not only
does this avoid the problem of initialization, a ROM is also considerably
smaller than an SRAM. In [9] for example, a 1 kilobyte 8-bit wide ROM
is about 1/7 the size of a (single port) RAM with similar size.

FPGA and ASIC optimization hint


It is extremely important that memory generators are used for large mem-
ories in an ASIC. Significant area savings are also possible if some of the
memories can be created using ROM-compilers. Finally, dual port memo-
ries should not be used if it can be avoided.

4.5 Small Memories


When using small memories in an ASIC it is still possible to use special-
ized memory modules (commonly referred to as register file memories)
although this is not as critical as when using a large memory. Unless a
large amount of register files are used in the design, the increased im-
plementation and verification cost of using specialized ASIC memories
might not be worth it.

255
Table 2: Pipelining a design will not necessarily increase the area
Pipeline Relative area Relative Fmax
stages Spartan3 ASIC4 Spartan3 ASIC4
1 260∗ 20 (5.5) 0.38 0.30 (0.073)
2 260∗ 13 (6.1) 0.33 0.35 (0.079)

3 260 14 (6.8) 0.38 0.41 (0.10)

4 260 13 (7.0) 0.37 0.40 (0.10)
4 Values in parentheses are from designs optimized for area ∗ Relative area cost
includes DSP or RAM blocks (See Section 3.1)

As can be seen in Table 1, FPGA based designs are usually fairly effi-
cient when using small single and dual-port memories. A configuration
of two read ports and one write port is also fairly efficient. As soon as
more than one write-port is used, the synthesis tool for the FPGAs are
no longer able to utilize distributed memory and has to resort to using
flip-flops with a significant area increase.

FPGA and ASIC optimization hint


Small register files with one write port are typically more area efficient in an
FPGA. If register files with more than one write port is used, they are likely
to be much more area efficient in an ASIC.

5 FPGA optimizations and their impact on an


ASIC

There are many optimizations that can be done on a design to improve


the performance in an FPGA. In the end they can all be summarized as
modifying the architecture of the design to better fit a given FPGA. This
section will classify these optimizations and discuss their impact on an
ASIC.

256
5.1 Deep Pipelining

Perhaps the most important tool in any digital designers toolbox is pipelin-
ing. This is even more important for an FPGA designer since flip-flops
are usually abundant in most FPGAs. Luckily pipelining is also benefi-
cial for the performance in ASICs in all but the most pathological cases.
It is not always a good idea in terms of area, although pipelining can
sometimes decrease the area of a design by enabling the use of less area
intensive circuits in for example a multiplier. As an example of how
pipelining effects the area and performance, an ASIC based 16×16 mul-
tiplier with 4 register stages was 12.6% percent larger and 37.9% faster
than the same multiplier with only 1 register stage. (All multipliers were
optimized for speed when we performed this experiment.) Adding a
pipeline stage is not guaranteed to increase the area though. This is seen
in Table 2 where an eight point 1D DCT pipeline has been synthesized
using different number of registers. The synthesis tool is clearly strug-
gling to meet timing when only one pipeline register is available.

FPGA and ASIC optimization hint


While pipelining an FPGA design will certainly not hurt the maximum
frequency of an ASIC, the area of the ASIC will often be slightly larger than
necessary, especially if the pipeline is not a part of the critical path in the
ASIC.

5.2 Utilizing Slices Efficiently

Another important task when optimizing a design for an FPGA is to se-


lect the architecture so that it is possible to utilize the slices efficiently.
For example, in a Spartan-3, a 32-bit adder will use 32 LUTs. At the
same time it is also possible to fit a combined 32-bit adder/subtracter
or a 32-bit adder with a 2-to-1 mux in front of one of the operands us-
ing only 32 LUTs. This is exemplified in Table 3 where a 32-bit adder
in a Spartan-3 is compared with adders with extra functionality. If the
relative area of a certain Spartan-3 based design is 1.00, this means that
it is possible to combine all functionality into only one LUT / bit. The

257
Table 3: Combining an adder with other functionality
32-bit adder Relative area Relative Fmax
Spartan3 ASIC4 Spartan3 ASIC4
Plain add 1.00 1.00 (0.21) 1.00 1.00 (0.11)
One 2-to-1 mux 1.00 1.15 (0.25) 0.99 0.69 (0.14)
Two 2-to-1 mux 2.03 1.20 (0.30) 0.85 0.67 (0.14)
Two 2-input 1.97 0.82 (0.27) 0.84 0.85 (0.11)
bitwise or
Two 2-input 1.00 0.89 (0.27) 0.99 0.89 (0.11)
bitwise and
32-bit adder Relative area Relative Fmax
and subtracter Spartan3 ASIC Spartan3 ASIC
Plain add/sub 1.00 1.66 (0.25) 1.00 0.86 (0.14)
One 2-to-1 mux 2.03 1.37 (0.31) 0.82 0.69 (0.14)
Two 2-to-1 mux 2.97 1.39 (0.36) 0.85 0.64 (0.12)
4 Values in parentheses are from designs optimized for area

maximum frequency is more or less the same as that of a plain adder.


For the ASIC based designs, the maximum frequency is lowered in all
cases when compared to the plain adder and the area is almost always
increased. This means that if even a single adder in the design is com-
bined with extra functionality it is very unlikely that the performance of
the design will equal that of a plain adder. On the other hand, the ASIC
port will have an area advantage as soon as functionality that cannot fit
into a single LUT / bit is used. On the Spartan-3 this happens when for
example a 2-input bitwise or function for both operands of an adder are
used or when combining a 2-to-1 mux with an adder/subtracter.

FPGA and ASIC optimization hint


Careful design can allow an FPGA design to combine adders with extra
functionality without any performance or area impact. The timing budget
of an FPGA design where all adders are optimized like this can be derived
from the maximum frequency of an adder. For the ASIC port this is not
possible and the extra functionality has to be taken into account.

258
Table 4: Inferring and instantiating components
Design Relative area Relative Fmax
Virtex4 ASIC Virtex4 ASIC
Mux (Inferred) 4.38 0.57 1.3 0.95
(Instantiated) 4.50 0.64 1.3 1.07
Addsub (Inferred) 1.00 1.86 0.98 0.89
(Instantiated) 1.00 1.42 0.98 0.76
AU (Inferred) 1.19 1.05 0.9 0.69
(Instantiated) 1.19 1.30 0.89 0.73
MAC(Instantiated) 118∗ 16.2 1.19 0.40
(Rewritten) - 10.7 - 0.70
∗ Relative area cost includes DSP blocks (See Section 3.1)

5.3 Manual instantiation of FPGA primitives

Synthesis tools are getting better with each version but there are still
some cases where it may be necessary to instantiate slice primitives like
LUTs and flip-flops manually. One reason for doing this was mentioned
in the previous paragraph. Another reason is that the designer is not able
to get the synthesis tool to infer the desired logic. Once FPGA primitives
are manually instantiated the design is no longer directly portable to an
ASIC. It is on the other hand fairly easy to write a portability library with
synthesizable code for the FPGA primitives like lookup-tables, flip-flops,
carry chain primitives, etc. This allows such a design to be synthesized to
an ASIC with surprisingly good results. Table 4 shows the performance
of a few different constructs when inferred and instantiated. Note that
it is very important that the parts of the design that contain instantiated
LUTs is flattened before the optimization phase of the synthesis. Other-
wise the synthesis tool will not be able to perform almost any combina-
tional logic optimization.
As can be seen from these experiments, it is not certain whether a
certain design will be faster or slower when inferred or instantiated af-
ter it has been ported to an ASIC. Even so, it is surprising that the per-

259
formance difference between the inferred and instantiated adder/sub-
tracter (addsub) is relatively small, considering the fact that the instan-
tiated component is ripple-carry based. The synthesis tool is obviously
able to optimize the combinational paths of the ripple-carry adder so that
the final end result is an optimized adder instead of a plain ripple-carry
adder.
The arithmetic unit (AU) and MAC component are taken from a soft-
core processor optimized for the Virtex-4 [11]. While the arithmetic unit
does fairly well when ported, the MAC doesn’t. When rewriting the
MAC unit using a pipelined DesignWare multiplier, the ASIC port gains
a distinct advantage on the other hand.

FPGA and ASIC optimization hint


If FPGA primitives are instantiated manually in the HDL source code it
is still possible to create an ASIC port by using a small compatibility li-
brary with synthesizable versions of these primitives. There may be a loss of
performance when using this method, especially when instantiating larger
components like DSP blocks and our recommendation is therefore to avoid
primitive instantiation unless the gains are huge. Nevertheless, if this ap-
proach is used, it is imperative that the design is flattened before the opti-
mization phase!

5.4 Manual Floorplanning and Routing


Although floorplanning is not commonly used in FPGA design it can be
a powerful tool. If it is done through the graphical tools it should have
no impact on an ASIC port since the HDL source code is not modified in
any way. If it is done using RLOC synthesis attributes in the HDL source
code, it is also likely that the source code will be modified to manually
instantiate FPGA primitives instead of inferring them. In that case it is
necessary to assess the impact of these manual instantiations separately.
Manually instantiating FPGA primitives can actually be a good idea even
if graphical tools are used for floorplanning, since this ensures that the
name of these primitives will stay the same if other part of the design are
changed or if different versions of the synthesis tool are used. Although

260
manual routing is rarely done in practice, the same reasoning is true here
as well (although this is almost exclusively done using graphical tools).

FPGA and ASIC optimization hint


Manual floorplanning will itself have no impact on an ASIC port. However,
it is likely that a design has to be modified to simplify floorplanning. In that
case these modifications have to be assessed for ASIC portability.

6 Other Porting Issues


An important issue which has not been discussed yet is the ability to
reconfigure the FPGA. This is a powerful ability which can be used to for
example reduce the area of a design, correct bugs or handle diagnostic
testing (through specially created bitstreams). It is obvious that an ASIC
port will be complicated if a design relies on the ability to reconfigure the
FPGA. If nothing else, the difficulty of fixing bugs in the ASIC may force
the designer to add functionality to the ASIC to make it possible to work
around at least some bugs.
There are also many other issues that have to be considered when
porting an FPGA design to an ASIC that are not directly related to FPGA
specific optimizations. This includes design for test, I/O, and power dis-
sipation. A thorough treatment of these topics are out of the scope of this
paper however.

7 Conclusions
In this paper we have discussed how important design constructs per-
form in terms of area and maximum frequency in FPGAs and ASICs. We
have also discussed how various FPGA optimization techniques can be
used. We conclude that most of these techniques are either beneficial
or relatively non-harmful for the performance and area of an ASIC port.
The most dangerous areas are memories and DSP blocks and extra care
must be taken to make sure that an ASIC port is efficient if a design has
been specifically optimized for these FPGA components.

261
References
[1] C. Baldwin. Converting fpga designs. [Online]. Available: http:
//www.chipdesignmag.com/display.php?articleId=2545

[2] K. Goldblatt, XAPP119: Adapting ASIC Designs for Use with Spartan
FPGAs, Xilinx, 1998.

[3] Application Note 311: Standard Cell ASIC to FPGA Design Methodology
and Guidelines ver 3.0, Altera, 2008.

[4] I. Kuon and J. Rose, “Measuring the gap between fpgas and asics,”
in Computer-Aided Design of Integrated Circuits and Systems, IEEE
Transactions on, 2007.

[5] M. Hutton, R. Yuan, J. Schleicher, G. Baeckler, S. Cheung, K. K.


Chua, and H. K. Phoon, “A methodology for fpga to structured-asic
synthesis and verification,” Design, Automation and Test in Europe,
2006. DATE ’06. Proceedings, vol. 2, pp. 1–6, March 2006.

[6] T. Danzer. (2006) Low-cost asic conversion targets consumer suc-


cess. [Online]. Available: http://www.fpgajournal.com/articles_
2006/20061107_ami.htm

[7] J. Gallagher and D. Locke. (2004, 3) Build complex asics without


asic design expertise, expensive tools. [Online]. Available: http://
electronicdesign.com/Articles/Print.cfm?AD=1&ArticleID=7382

[8] Altera, Guidance for Accurately Benchmarking FPGAs v1.2, 12


2007. [Online]. Available: http://www.altera.com/literature/wp/
wp-01040.pdf

[9] Atmel, ATC35 Summary. [Online]. Available: http://www.atmel.


com/dyn/resources/prod_documents/1063s.pdf

[10] D. Drako and H.-T. A. Yu, “Apparatus for alternatively accessing


single port random access memories to implement dual port first-in
first-out memory,” U.S. Patent 5 371 877, 12 6, 1994.

262
[11] A. Ehliar, P. Karlstrom, and D. Liu, “A high performance micro-
processor with dsp extensions optimized for the virtex-4 fpga,” Field
Programmable Logic and Applications, 2008. FPL 2008. International
Conference on, pp. 599–602, Sept. 2008.

263