Você está na página 1de 27

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture

Wei Zhang, Li Shang and Niraj K. Jha Dept. of Electrical Engineering Princeton University Dept. of Electrical and Computer Engineering Queens University

Outline

Temporal Logic Folding Background on NRAMs Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006) NanoMap: Design Optimization Flow Experimental Results Conclusions

Input Design

NanoMap

NATURE

Temporal Logic Folding

Basic idea: Use run-time reconfiguration to realize different functions in the same resource LUT3 every few cycles
LUT1 a b

d g
i e f h LUT2
e
d

OUT l

a b c

LUT 1

i f h

LUT 2

l g

LUT 3

OUT

e d a bil

cf g
h

LUT LUT 3 2 1

OUT

MEM
i =abc
l =(I+e+f)h OUT =dg+l

Overview of NATURE
CMOS fabrication compatible

NRAM-based

Run-time reconfiguration

NATURE

Temporal logic folding

Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits Fine-grain reconfiguration (even cycle-by-cycle) and logic folding

Design flexibility

Logic density

Area-delay trade-off flexibility More than an order of magnitude increase in logic density More than an order of magnitude reduction in areatime product Comparisons assume NRAMs/ CMOS logic implemented in the same technology Non-volatility: useful in low power & secure processing

Overview of NATURE (Contd.)

Challenges in nano-circuits/architectures

Regular, reconfigurable architectures, such as an FPGA, favored


Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. Lack of a mature fabrication process Fabrication defects and run-time failures (between 1% and 10%)

Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible fabrication process

NRAMTM by Nantero

Source: http://www.nantero.com/nram.html

Non-volatile nanotube random-access memory (NRAM)


Mechanically bent or not: determines bistable on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future

NRAMs

Properties of NRAMs

Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable
Phase change RAM Magnetoresistive RAM Ferroelectric RAM

NATURE not tied to NRAMs


Architecture of NATURE
LB Length-1 Length-4 wire wire Long wire Switch box

Island-style logic blocks (LBs) connected by various levels of interconnects An LB contains a super macroblock (SMB) and a local switch matrix

Connection block
Length-4 wire Direct link Long wire

Switch block

S1

S1

S1: Switch box between length-1 wires S2: Switch box between length-4 wires Switch matrix: Local routing network

Switch matrix

SMB

S1 Length-1 wire

S1

Architecture of a Super Macroblock (SMB)

n1 macroblocks (MBs) comprise an SMB: here n1 = 4


---8
MB

---8

NRAM

MB

NRAM

20

SRAM bits

---1

20 44X1 MUX

20 44X1 MUX

From Switch matrix

---2

From Switch matrix Output to Interconnect

---2

---2

From Switch matrix SRAM bits

SRAM bits

20 44X1 MUX

20 44X1 MUX

20

---2

---1

---8

NRAM

MB

---8

MB

NRAM

CLK and Global signals Reconfiguration bits

CLK and Global signals Reconfiguration bits

---1

20

---1
SRAM bits

20

Architecture of a Macroblock (MB)

n2 logic elements (LEs) comprise an MB: here n2 = 4


7 ---2 ---2 ---1 7
LE NRAM NRAM

---1

LE

65 SRAM bits

---6

13 to 5 crossbar

13 to 5 crossbar

65 SRAM bits

---5

---5

Inputs to MB 8 Outputs of MB

Inputs to MB

---5

---5

Inputs to MB 65 SRAM bits

65 SRAM bits

13 to 5 crossbar

13 to 5 crossbar

---6

---1

---2

---2

NRAM

LE

LE

---1

NRAM

CLK and Global signals Reconfiguration bits

CLK and Global signals Reconfiguration bits

---6

---6

Logic Element (Basic Configuration)

An LE implements a computation and contains:


An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output and a primary input
SRAM cell

m-input LUT

DFF

DFF

CLK

Folding Levels

Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs Level-p folding: LE reconfiguration after the execution of p LUT computations

Reconfiguration time: 160ps

Larger folding level, typically delay decrease, area increase


y0 y1 y2 y3 a0 b0 z0 z1 z2 c0 x0 x1 x2 x3 d0 g0 LUT node Reconfiguration y0 y1 y2 y3 f0

y0 y1 y2 y3 a0 b0

z0 z1 z2 c0

x0 x1 x2 x3 y0 y1 y2 y3 x0 x1 x2 x3 e0 f0 g0

x0 x1 x2 x3 e0

d0

a2 a3 a4 a6 h0 i0

Reconfiguration a2 a3 a4 a6 i0
d Output

h0

Output

(a) level-1 folding

(b) level-2 folding

Design Optimization Flow: NanoMap

Optimize and implement design on NATURE Integrate temporal logic folding


Choose a proper folding level Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles

Input design specified in register-transfer level (RTL) and/or gate-level VHDL

Motivational Example
input 1
4

input 2
4

Level 1 register

L1

reg1
4 4

reg2

L2

s0
LUT 1

s1
LUT 2

Folding stage

Plane cycle

Plane

Logic in Plane

Folding cycle

4 L3

LUT 3

LUT 4

Level 2 register

reg3

Different planes should have same number of folding stages to guarantee global synchronization Key issue: how to achieve the optimization objective

Appropriate folding level Assign the logic to folding stages

Motivational Example (Contd.)


input 1
4 L1

input 2
4

reg1
4 4

reg2

L2

s0
LUT 1

s1
LUT 2

8 LUTs Logic depth: 4 Plane depth: 9 38 LUTs Logic depth: 7


4

50 LUTs 14 flip-flops

+
4

4 L3

LUT 3

LUT 4

reg3

Example optimization objective

Minimize circuit delay under an area constraint of 32 LEs Assume each LE contains one LUT and two flipflops: 32 LEs provide 32 LUTs and 64 flip-flops

Iterative Design Flow

Start with initial guess for folding level and iteratively refine it

Large folding level -> better circuit delay, but large area cost 9 Initial #folding stages: 2 5 50 32 2 Initial folding levels:

Partition RTL modules into a series of connected LUT clusters

logic depth at most equal to the folding level Significantly speeds up the mapping procedure

Iterative Design Flow (Contd.)

Cluster size should be smaller than the area constraint


b2 0 0 0 b1 0 0 b0 0 a0

b3 0

b3 0 0
P0

b2 0 0

b1 0 0

b0 0 a0

0 0
Cluster 1

0 a1

0
Cluster 1

0 a1

P0

0 a2 0 0 a3 0

P1

34 LUTs > 32 LUTs

0 0 a2 0 0 a3 0 P3 FA 0 P2 P1

P2

P3 FA P4 FA FA P5 P7 P6 carry out FA sum out carry in 0 b j sum in

Cluster 2

Cluster 2

FA P5

ai

P4

FA P7 P6

Level-5 folding

Level-4 folding

Solution for the Example


Choose folding level

folding cycle 1
Decrease folding level

8LEs

add reg1-3

4LEs

LUT1-4 s0, s1 storage 1-4

Module partition

folding cycle 2

storage add 32LEs reg1-3 mul: c1

s0, s1 storage 1-4 s0, s1

Constraint satisfied? Yes FDS to balance resource usage

No

folding cycle 3

6LEs 6LEs

mul: c2 reg1-3

Constraint satisfied? Yes Solution

No

Three folding stages using level-4 folding 32 LEs required for mapping the RTL circuit; area constraint satisfied Circuit delay = 3 * folding cycle delay

NanoMap: Flow Diagram


Input network Optimization objective
1

Circuit parameter search


2

Module library

Output reconfiguration bits


16

Routing

User constraint

Folding level computation


3

Final routing using VPR router


15

RTL module partition

Logic Mapping

Final placement using modified VPR placer Yes


14

No

Perform logic folding? Yes

No

Satisfy delay constraints?

12

Schedule each LUT/ LUT cluster using FDS


6

Delay estimation
11

Temporal placement

Yes Map each 7 LUT/ LUT cluster to SMBs


7

No

Temporal clustering

Placement routable?

10

Satisfy area constraints? Yes

No
8

No

Refine placement? Yes

13

Fast placement using modified VPR placer


9

Force-Directed Scheduling

Perform FDS on RTL modules partitioned into LUTs/LUT clusters Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage Model resource usage as a force: F = Kx

LE usage depends on LUT computations and register storage operations: two DGs needed

K: distribution graphs (DGs) that describe the probability of resource usage Aim of FDS: minimize force, indicating minimum increase in resource usage

Temporal Clustering

For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs

Unpacked LUT with a maximal number of inputs selected as initial seed New LUTs with high attractions to the seed selected and assigned to the SMB

Attractions depend on timing criticality and input pin sharing Considers attractions across all the folding cycles
le1 cyc
A B

Fo ld

ing

D F

le2 Fo ld ing cyc

Placement and Routing

Fo ld

VPR (U. Toronto) modified to perform placement and support temporal logic folding

le1

ing

cyc

SMB 1 C

D SMB 4 D

Fo ld

Simulated annealing approach Cost function computed across the folding stages

ing

cyc
C

Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects

le2

Experimental Setup

Instance of architecture:

4 MBs in an SMB 4 LEs in an MB LEs contain a 4-input LUT and 2 flip-flops

Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs Results based on 100nm technology parameters to implement CMOS logic and NRAMs

23

Experimental Results (Contd.)


Delay (ns) for AT optimization
No folding 1.4 1.2 1 0.8 0.6 0.4 0.2 0 k enough k = 16

#LE * Delay adv. for AT opt.


No folding k enough k = 16

1 1
2

1
2

1
2

18 16 14 12 10 8 6 4 2 0

1
2

1 12 1 1
2 2 2

1
2

1 1

ex2

c5315

Biquad

ASPP4

FIR

Paulin

ex1

ex1

(normalized to no-folding)

FIR

ex2

c5315

Biquad

Paulin

ASPP4

(normalized to no-folding)

Experimental Results (Contd.)


Improvement under AT optimization for RTL Benchmarks
Reduction in #LEs k enough k = 16 14.8X 9.2X Maximum AT improvement 16.2X 9.3X Average AT improvement 11.0X 7.8X Circuit delay increase 31.8% 19.4%

LE utilization around 100% 50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding indicates trading interconnect area for NRAM area advantageous

Experimental Results (Contd.)


Flexibility in choosing the best folding level and performing area-delay trade-offs Mapping results for typical optimizations using Paulin benchmark as an example
Typical optimizations
Opt. obj. Area const. (#LEs) No No No 210 Delay const. (ns) No No 27 No Folding level 1 No 4 3

Mapping results for typical optimizations case 1 10000 1000 100 10 1 Delay (ns) Area (#LEs) case 2 case 3 case 4

Case1 Case2 Case3 Case4

AT Delay Area Delay

Conclusions

NATURE: A new high-performance run-time reconfigurable architecture NanoMap: an integrated optimization design flow for NATURE Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages Can be very useful for cost-conscious embedded systems and improvement of future FPGAs Non-volatility: helpful in secure and low power processing

Você também pode gostar