Você está na página 1de 83

Recent Progress in Field Programmable Gate Arrays:

Hardware, CAD software, evaluation boards, and reconfigurable computing


Marek Perkowski Chengdu, June 2008

Programmable Logic
The simplest programmable logic devices are PALs (see 22V10 figure next page). PLD my students use them in their first year at PSU Programmable Logic Devices What is the next step in the evolution of PLDs?
More gates!

How do we get more gates? We could put several PALs on one chip and put an interconnection matrix between them!!
This is called a Complex PLD (CPLD).

22V10 PLD

Cypress CPLD

Programmable interconnect matrix.

Each logic block is similar to a 22V10.

Any other approaches?


Another approach to building a better PLD is place a lot of primitive gates on a die, and then place programmable interconnect between them:

FPGA Technology
1.

Birds Eye View of FPGA Technology

2. FPGAs in 2004: Virtex-4 Introduction


3.

Software and Design Special Problems and Solutions

4.

Field Programmable Gate Arrays


The FPGA approach to arrange primitive logic elements (logic cells) arrange in rows/columns with programmable routing between them. What constitutes a primitive logic element? Lots of different choices can be made! Primitive element must be classified as a complete logic family. A primitive gate like a NAND gate A 2/1 mux (this happens to be a complete logic family) A Lookup table (I.e, 16x1 lookup table can implement any 4 input logic function). Often combine one of the above with a DFF to form the primitive logic element.

Other FPGA features


Besides primitive logic elements and programmable routing, some FPGA families add other features Embedded memory
Many hardware applications need memory for data storage. Many FPGAs include blocks of RAM for this purpose

Dedicated logic for carry generation, or other arithmetic functions Phase locked loops for clock synchronization, division, multiplication.

Altera Flex 10K FPGA Family

Altera Flex 10K FPGA Family (cont)

Dedicated memory

16 x1 LUT

DFF

Emedded Array Block


Memory block, Can be configured:
256 x 8, 512 x 4, 1024 x 2, 2048 x 1

EPROM/EEPROM Technology
EPROM can be reprogrammed, no need for external storage. EPROM can not be re-programmed in circuit. EEPROM can be re-programmed in circuit. EEPROM consumes 2X more area as EPROM.

Erasable PLD (EPLD)


SOP-based PAL Logic array In, Out, bidirection Registers I/Os

Configured to D, T, JK, SR FFs. Programmable clock to each FF.

Programming the FPGA


Configuration. Readback - design verification and debugging. Security - a security-bit to prevent readback.

Advantages and Disadvantages of FPGA


Fast turnaround. Low NRE (non-recurring engineering) changes. Low risk. Effective design verification. low testing cost. Chip size & cost. Slow speed.

CPLD versus FPGA


CPLD
Interconnect style Architecture and timing Software compile times In-system performance Power consumption Applications addressed Continuous Predictable Short Fast High Combinational and registered logic

FPGA
Segmented Unpredictable Long Moderate Moderate Registered logic only Source: Altera

FPGAs
What? - Programmable logic + programmable routing = FPGAs. Why? - Zero NREs, easy bug fixes, and short time-to-market. How?

Comparison of Different Design Technologies


Custom Std Cells Gate Arrays Design time Long Short Short Fabrication Long Long Short Chip area Small Med. Large Design cost High Med. Low Unit cost Low Low Med. Design cycle Long Med. Short FPGAs Short None Very large Very low High Very short

Emerging FPGA-based Applications


Low-volume production. Urgent time-to-market competition. Rapid prototyping. Logic emulation. Custom-computing hardware. Reconfigurable computing.

Design Considerations
Target architecture. Fixed logic and routing resources. Fixed I/O pins. Slow signal delays.

FPGA Selection Criteria


Density. Speed. Price. Flexibility.

COSTS of Technologies
Lower Cost
Moores Law is alive
Smaller geometries and larger wafers and lower defect density (=higher yield ) continue to achieve lower cost per function

LUT + flip-flop: $1.- in 1990, $ 0.002 in 2003 State-of-the-art: 90 nm on 300 mm wafers


Spartan-3 uses this technology for lowest cost

Rapid price reductions, intense competition

Changing costs of FPGAs and technologies


More Logic and Better Features:
>100,000 LUTs & flip-flops
>200 BlockRAMs, and the same number of 18 x 18 multipliers

1156 pins (balls) with > 800 GP I/O


50 I/O standards, incl. LVDs with internal termination

16 low-skew global clock lines


Multiple clock management circuits

On-chip microprocessor(s) and Gbps transceivers Gate count is really a meaningless metric

A Birds Eye View


Higher Speed
Smaller and faster transistors
90 nm technology, using 193 nm ultra-violet light Cu interconnect ( instead of Al ) was easily achieved Low-K dielectric progress is disappointing

System speed: up to 500 MHz,


Mainly through smart interconnects, clock management, dedicated circuits, flexible I/O.

Integrated transceivers running at 10 Gigabits/sec

A Birds Eye View


Better tools
Back-End Place&Route and XST synthesis

VHDL and Verilog becoming entry point


IP/Cores speed up design and verification Embedded Software Development Tools

support architectures and merge HW and SW


Domain-Specific Languages

System Generator bridges the gap between Matlab/Simulink and FPGA circuit description

ASICs Are Losing Ground


Mask set >$1M + design + verification + risk
Mask Costs (in m illion $)

2 1.5 1 0.5 0 250 nm 180 nm 130 nm 90 nm


Te chnology Ge ne r ation

Source:IBM

65 nm

ASICS are only for extreme designs: Extreme volume, speed, size, low power

SPGA
Allow multiple building blocks. Logic. Memory. Data path.

Applications Using SPGAs


Intellectual property (IP). Communication & networking. Graphical processing. Embedded processing.

Designing with SPGAs


A team-based approach. Understanding how to use SPGA system features will be the key to pulling the entire design into a single device.

CMOS PLD Market Share


31%
Other Cpress AT&T Actel Lattice AMD Altera Xilinx

5% 5% 6%

3%

24% 15%

11%

Source:dataquest

CMOS Logic Market


8% 14% 10% 30% 29%
But the market share is growing

9%

Std logic Programmable GA Std cell Custom Chipset

Source:dataquest

FPGAs Growth
2500 2000 1500 1000 500 0 1996 1997 1998 1999 2000 M USD
Milions US dollars

Source: Integrated Circuit Engineering

CMOS Programmable-logic Market


5 4 3 2 1 0 1997 1998 1999 2000 B USD
Billions US dollars

Source:dataquest

Rapid Prototyping
What? Why? How?

What is prototyping?
Basic components: FPGAs and FPICs. Hardware : boards, boxes, and cabinets. Software: methodologies and CAD tools.

Field Programmable Gate Arrays

Field Programmable Interconnect Devices

Product Development Cycle


Market survey

Customer acceptance

Product development

Production

Pressures on Todays Product Development


Time-to-market! Design complexity!

Why Needs Prototyping?


Design verification. Limited production. Concurrent engineering.

This requires cooperation of engineers, computer science specialists and marketing

Design Verification
Specification Functionality & requirements

?
Final product Final functionality & performance

Design Process
Specification System-level design RTL design Logic-level design Physical-level design Final chips Simulation Fast prototyping Formal verification Logic emulation

Formal verification is just one of options

Verification Alternatives
Modeling System Prepare accuracy integration time

Speed Slow Med. Med. Med. Fast

Event Driven Simulation High Cycle-Based Simulation Behavioral Simulation Breadboarding Med. Low Med.

No No No

Short Short Short Med.

Hardware Accelerated Sim Varies No Emulation or Prototyping Med.

Yes Long Very Fast Yes Med. Very Fast

A Minute in the Life of a 100K Gates Design


One minute ten minutes

1 --------- Actual hardware at 50MHz 10 -------- Logic emulator or prototype at 5MHz 100------2K-------- HW accelerator at 250M evals/sec 1 Mon. 50K------- Cycle-based simulator at 1K insts/sec 3 Mon. 120K----- Compiled-code logic simulator at 125MIPs 1.5 Yr. 800K----- Event-driven logic simulator at 125 MIPs
We need FPGA emulation because simulation is too slow

Development with Prototyping


small gap Big gap

SW

Design

Code

Integration

Debug

HW

Design

Build

Integration

Debug

CHIP

Design

Fab

Debug

Development with Prototyping


You speed up development through parallelism

SW

Design

System Integration Code & SW Debug Build HW Integration & Debug Final Integration

HW

Design

CHIP

Design

Chip debug

Fab

How to Develop a Prototyping using FPDs


Custom-designed prototyping board. Logic-emulation systems. Field-programmable printed-circuit-boards.

Field Programmable Devices

FPGA State of the Art 2004


90-nanometer manufacturing technology Ten Gigahertz serial I/O (SerDes) in silicon 0.07 femtosecond asynchronous data capture window causes 1.5 ns metastable delay

Issues in FPGA Technologies


Complexity of Logic Element How many inputs/outputs for the logic element? Does the basic logic element contain a FF? What type? Interconnect How fast is it? Does it offer high speed paths that cross the chip? How many of these? Can I have on-chip tri-state busses? How routable is the design? If 95% of the logic elements are used, can I route the design? More routing means more routability, but less room for logic elements

Issues in FPGA Technologies (cont)


Macro elements Are there SRAM blocks? Is the SRAM dual ported? Is there fast adder support (i.e. fast carry chains?) Is there fast logic support (i.e. cascade chains) What other types of macro blocks are available (fast decoders? register files? ) Clock support How many global clocks can I have? Are there any on-chip Phase Logic Loops (PLLs) or Delay Locked Loops (DLLs) for clock synchronization, clock multiplication?

Issues in FPGA Technologies (cont)


What type of IO support do I have? TTL, CMOS are a given Support for mixed 5V, 3.3v IOs? 3.3 v internal, but 5V tolerant inputs? Support for new low voltage signaling standards?
GTL+, GTL (Gunning Tranceiver Logic) - used on Pentium II HSTL - High Speed Transceiver Logic SSTL - Stub Series-Terminate Logic USB - IO used for Universal Serial Bus (differential signaling) AGP - IO used for Advanced Graphics Port

Maximum number of IO? Package types? Ball Grid Array (BGA) for high density IO

Altera FPGA Family Summaries Now we discuss some


popular families

Altera Flex10K/10KE LEs (Logic elements) have 4-input LUTS (look-up tables) +1 FF Fast Carry Chain between LEs, Cascade chain for logic operations Large blocks of SRAM available as well Altera Max7000/Max7000A EEPROM based, very fast (Tpd = 7.5 ns) Basically a PLD architecture with programmable interconnect. Max 7000A family is 3.3 v

Xilinx FPGA Family Summaries


Virtex Family SRAM Based Largest device has 1M gates Configurable Logic Blocks (CLBs) have two 4-input LUTS, 2 DFFs Four onboard Delay Locked Loops (DLLs) for clock synchronization Dedicated RAM blocks (LUTs can also function as RAM). Fast Carry Logic XC4000 Family Previous version of Virtex No DLLs, No dedicated RAM blocks

Actel FPGA Family Summaries


MXDS Family Fine grain Logic Elements that contain Mux logic + DFF Embedded Dual Port SRAM One Time Programmable (OTP) - means that no configuration loading on powerup, no external serial ROM AntiFuse technology for programming (AntiFuse means that you program the fuse to make the connection). Fast (Tpd = 7.5 ns) Low density compared to Altera, Xilinx - maximum number of gates is 36,000

Cypress CPLDs
Ultra37000 Family
32 to 512 Macrocells Fast (Tpd 5 to 10ns depending on number of macrocells) Very good routing resources for a CPLD

Evolution
1965 Max Clock Rate (MHz) Min IC Geometries () # of IC Metal Layers PC Board Trace 5Width () years: 1 1 2000 1-2 1980 10 5 2 500 2-4 1995 100 0.5 3 100 4-8 2010( ?) 1000 0.05 12 25 1020

Every System speed doubles, IC geometry shrinks 50% # of PC-Board Layers Every 7-8 years: PC-board min trace width shrinks 50%

The Ever-Shrinking Circuitry


Number of LUTs + flip-flops + routing that fit on the cross section of a human hair 2000 2002 2004 2005 2 LUTs in Virtex-II (150 nm) 3 LUTs in Virtex-IIPro (130 nm) 4 LUTs in Virtex-4 (90 nm) 8 LUTs = one CLB in 65 nm Moores law is alive and well in FPGAs

Middle-of-the-Road Xilinx FPGAs


1990 1994 1998 2000 2002 2004 2005 XC3042 288 LUTs + flip-flops XC4005 512LUTs + flip-flops XC4013XL 1,152 LUTs + flip-flops XCV300 6,144LUTs + flip-flops XC2V1000 10,240LUTs + flip-flops XC2VP30 27,382LUTs + flip-flops XC4V60-LX 53,248 LUTs + flip-flops

Same price for each: One days engineering salary

Thirteen Years of Progress of Xilinx Devices


1000x

200x More Logic


plus memory, P, DSP, MGT
100x
CLB Capacity Speed Power per MHz Price ITRS Roadmap

XC4000 & Spartan Virtex-4 Virtex-II & Virtex-II Pro

40x Faster 50x Lower Power


per function x MHz
10x

Virtex & Virtex-E

500x Lower Cost


per function
1x
'91 '92 '93 '94 '95

Spartan-2 XC4000
'96 '97 '98 '99 '00 '01 '02

Spartan-3
'03 '04

Year

Moore Meets Einstein


2048 1024 512 256 128 64 32 16 8 4 2 1
65 70 75 80 85 Year 90 95 00 05 10

Trace Length in cm per 1/4 clock period

Clock Frequency in MHz

Speed Doubles Every 5 Years ...but the speed of light never changes

Higher Leakage Current


High Leakage current = static power consumption
Was <100 microamps, now > 100 mA, even amps (!)

Caused by: Gate leakage due to 16 gate thickness Sub-threshold leakage current
incomplete turn-off because threshold does not scale

Tyranny of numbers: 10 nA x 100 million transistors = 1 A


evenly distributed, thus no reliability problem

Sub-100 nm is not ideal for portable designs

FPGAs in 2003
1000 to 80,000 LUTs and flip-flops, millions of bits in dual-ported RAMs Low-skew Global Clocks, Frequency synthesis, 50 ps phase control 18 Kbit BlockRAMs and 18 x 18 multipliers FPGAs are not glue-logic anymore

FPGAs in 2003
1000 to 80,000 LUTs and flip-flops, millions of bits in dual-ported RAMs Low-skew Global Clocks, Frequency synthesis, 50 ps phase control 18 Kbit BlockRAMs and 18 x 18 multipliers FPGAs are not glue-logic anymore

FPGAs in 2003
300+ MHz system clock, 800 MHz I/O 3+ Gigabit transceivers Embedded hard and soft microprocessors Design security: Triple-DES encryption VHDL/Verilog entry, synthesis, auto place and route FPGAs are a compelling alternative to ASICs

FPGAs in 2004

Virtex-4 in September 2004


4th Generation Advanced Logic Integrated 450 MHz PowerPC Cores 0.6 - 11.1 Gbps RocketIO ASMBL Column-Based Architecture 500 MHz SmartRAM BRAM/FIFO

Integrated Tri-Mode Ethernet MAC Cores Integrated System Monitor 500 MHz Xesium Clocking 500 MHz Xtreme DSP Slice

SelectIO with ChipSync Technology: - 1 Gbps LVDS - 600 Mbps SE

New ASMBL Columnar Architecture


Enables Dial-In Resource Allocation Mix
Logic, DSP, BRAM, I/O, MGT, DCM, PowerPC

Made possible by Flip-Chip Packaging


I/O Columns Distributed throughout the Device

FPGA Innovation: Virtex-4


90 nm technology, triple-oxide, 1.2-V Vccint supply General-purpose I/O up to 1 Gbps, Vcco=1.5, 2.5, or 3.3-V 0.6 to 11.2 Gigabit/sec RocketI/O transceivers Advanced Silicon Modular Block architecture Three sub-families:
V4-LX for logic-intense applications V4-SX for DSP-intensive applications V4-FX with PPC micros and multi-gigabit transceivers

Common architecture for diverse applications

FPGA Innovation: Virtex-4


Higher Performance:
500 MHz for all sub-blocks

More Versatility
New innovative functions

Higher Level of Integration


More LUTs, flip-flops, RAMs, multipliers

Lower Cost
Smaller area = lower cost per function

Lower Power per ( Function times MHz )

FPGA Innovation: Virtex-4


Flip-chip packaging:
lower pin-inductance, stiffer Vcc distribution

Lower power per function and MHz


Triple-oxide gates, multiple thresholds, smaller size, lower Vcc, better design

Better clocking, less skew, more flexibility Better configuration control, partial reconfiguration Robust configuration cell, SEU tolerant like 130 nm

FPGA Innovation: Virtex-4


Improved I/O Flexibility and Performance Supports >50 standards, on-chip termination Source-synchronous and system-synchronous Serializer/deserializer behind each pin Programmable delay available for each pin > 1Gbps SelectI/O on each pin >10 Gbps transceivers on dedicated pins (-FX family only) Source-synchronous I/O improves performance Serial I/O saves pins and pc-board area

FPGA Innovation: Virtex-4


Faster logic and memory
500+ MHz operation of all on-chip functions

32-bit arithmetic
48-bit adders and synchronous loadable counters

Up to 72-bit wide memory 4- to 36-bit wide FIFO control in each BlockRAM


Operates with fully independent write and read clocks Reliable EMPTY and FULL outputs
also ALMOST Empty and ALMOST Full

FIFOs need no fabric resources and no design expertise

Advanced Clocking
Proper clocking is extremely important for performance and reliability Most design need many global clock lines with minimal clock delay and clock skew Digital Clock Manager (DCM) provides: Four-phase outputs, Frequency multiplication and division Fine phase adjustment

Advanced I/O
>50 Different Output Standards (strength, voltage, input threshold, etc)
multiple parallel output transistors which are either fully on or fully off,

Nothing is ever analog, except in LVDS Digitally Controlled Impedance =DCI


for series-termination of transmission-line drivers
Adjusts up/down strength to be = external resistor One external pull-up and pull-down resistor per bank

V2Pro and Virtex-4 can update-only-if-necessary

System Synchronous
System-Synchronous when the clock arrives simultaneously at all chips typically used below 200 MHz clock rate On-chip clock distribution DCM Zero clock delay controls set-up time, and avoids hold time requirements The traditional design methodology

Source Synchronous
Each data bus has its own clock trace typically used at 200 to 800 MHz clock rate On-chip clock-distribution DCM centers the clock in the data eye Adds more unidirectional-only clock lines The only way above 300 MHz

Serial Transceiver Technology


3.125 Gbps over each pair

32b @ 78 MHz

32b @ 78 MHz

Virtex-II Pro

Virtex-II Pro

Serial Transceiver Technology


Up to 11.1 Gbps over each pair

64b @ 168 MHz

64b @ 168 MHz

Virtex-4

Virtex-4

RocketIO Multi-Gigabit Transceiver


8 to 24 per device
TXDATA Transmit TXDATA Encode FIFO Serializer Buffer 8-64b Wide 8-64b Wide
Transmitter
78MHz to 700MH z
REFCL K

Serial Out

622 Mb/s 11.1 Gb/s Programmable Features:


64b/66b or 8b/10b EnDec Comma Detect Rx and Tx FIFO Pre-Emphasis Receiver Equalization Output Swing On-Chip Termination Channel bonding AC & DC Coupling

TX Clock Generator

16X/20X Multiplier

Receiver
Comma Detect RX and Word Alignment

Clock Generator

8-64b Wide Buffer 8-64b Wide

RXDATA Elastic RXDATA Decode

De- Receive Serializer Buffer

Serial In

Virtex-4 Capabilities
Any type of design runs at >400 MHz Pipelining provides extra performance for free Synchronous is best, but 32 clock are available Gigabit serial saves pins and board area On-chip termination for board signal integrity I/O features support double-data rate operation and source-synchronous design

Virtex-4 Capabilities
Popular functions are hard-wired
for lower cost, higher performance, and ease-of-use: microprocessors, FIFOs, serial I/O, clock management, etc.

Many pre-tested soft cores are available


Some are free, some for a fee

One-hot state machines are preferred


But MicroBlaze and PicoBlaze may be better

Massive parallelism enhances DSP,


Up to 1024 fast twos complement multipliers per chip, faster than dedicated DSP chips, but needs system-rethinking

2004 Challenges
Technology moves rapidly: 130, 90, 65 nm Multiple Vcc, lower voltage - higher current
Lower Vcc makes decoupling very critical

Moores law becomes more difficult to sustain


Leakage current has increased significantly
Triple-oxide transistors and clever design provide relief

Signal integrity on pc-boards is crucial


homebrew prototyping would waste money and time

Use Standard Evaluation Boards Instead

Você também pode gostar