Nanomap: An Integrated Design Optimization Flow For A Hybrid Nanotube/Cmos Dynamically Reconfigurable Architecture

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture
Wei Zhang, Li Shang and Niraj K. Jha Dept. of Electrical Engineering Princeton University Dept. of Electrical and Computer Engineering Queens University
Outline

Temporal Logic Folding Background on NRAMs Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006) NanoMap: Design Optimization Flow Experimental Results Conclusions
Input Design
NanoMap
NATURE
Temporal Logic Folding
Basic idea: Use run-time reconfiguration to realize different functions in the same resource LUT3 every few cycles
LUT1 a b
d g
i e f h LUT2
e
d
OUT l
a b c
LUT 1
i f h
LUT 2
l g
LUT 3
OUT
e d a bil
cf g
h
LUT LUT 3 2 1
OUT
MEM
i =abc
l =(I+e+f)h OUT =dg+l
Overview of NATURE
CMOS fabrication compatible
NRAM-based
Run-time reconfiguration
NATURE
Temporal logic folding
Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits Fine-grain reconfiguration (even cycle-by-cycle) and logic folding

Design flexibility
Logic density
Area-delay trade-off flexibility More than an order of magnitude increase in logic density More than an order of magnitude reduction in areatime product Comparisons assume NRAMs/ CMOS logic implemented in the same technology Non-volatility: useful in low power & secure processing
Overview of NATURE (Contd.)
Challenges in nano-circuits/architectures
Regular, reconfigurable architectures, such as an FPGA, favored

Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. Lack of a mature fabrication process Fabrication defects and run-time failures (between 1% and 10%)
Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible fabrication process
NRAMTM by Nantero
Source: http://www.nantero.com/nram.html
Non-volatile nanotube random-access memory (NRAM)

Mechanically bent or not: determines bistable on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future
NRAMs
Properties of NRAMs
Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable
Phase change RAM Magnetoresistive RAM Ferroelectric RAM
NATURE not tied to NRAMs

Architecture of NATURE
LB Length-1 Length-4 wire wire Long wire Switch box
Island-style logic blocks (LBs) connected by various levels of interconnects An LB contains a super macroblock (SMB) and a local switch matrix
Connection block
Length-4 wire Direct link Long wire
Switch block
S1
S1
S1: Switch box between length-1 wires S2: Switch box between length-4 wires Switch matrix: Local routing network
Switch matrix
SMB
S1 Length-1 wire
S1
Architecture of a Super Macroblock (SMB)
n1 macroblocks (MBs) comprise an SMB: here n1 = 4

---8
MB
---8
NRAM
MB
NRAM
20
SRAM bits
---1
20 44X1 MUX
20 44X1 MUX
From Switch matrix
---2
From Switch matrix Output to Interconnect
---2
---2
From Switch matrix SRAM bits
SRAM bits
20 44X1 MUX
20 44X1 MUX
20
---2
---1
---8
NRAM
MB
---8
MB
NRAM
CLK and Global signals Reconfiguration bits
---1
20
---1
SRAM bits
20
Architecture of a Macroblock (MB)
n2 logic elements (LEs) comprise an MB: here n2 = 4

7 ---2 ---2 ---1 7
LE NRAM NRAM
---1
LE
65 SRAM bits
---6
13 to 5 crossbar
13 to 5 crossbar
65 SRAM bits
---5
---5
Inputs to MB 8 Outputs of MB
Inputs to MB
---5
---5
Inputs to MB 65 SRAM bits
65 SRAM bits
13 to 5 crossbar
13 to 5 crossbar
---6
---1
---2
---2
NRAM
LE
LE
---1
NRAM
---6
---6
Logic Element (Basic Configuration)
An LE implements a computation and contains:

An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output and a primary input
SRAM cell
m-input LUT
DFF
DFF
CLK
Folding Levels

Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs Level-p folding: LE reconfiguration after the execution of p LUT computations
Reconfiguration time: 160ps
Larger folding level, typically delay decrease, area increase

y0 y1 y2 y3 a0 b0 z0 z1 z2 c0 x0 x1 x2 x3 d0 g0 LUT node Reconfiguration y0 y1 y2 y3 f0
y0 y1 y2 y3 a0 b0
z0 z1 z2 c0
x0 x1 x2 x3 y0 y1 y2 y3 x0 x1 x2 x3 e0 f0 g0
x0 x1 x2 x3 e0
d0
a2 a3 a4 a6 h0 i0
Reconfiguration a2 a3 a4 a6 i0
d Output
h0
Output
(a) level-1 folding
(b) level-2 folding
Design Optimization Flow: NanoMap
Optimize and implement design on NATURE Integrate temporal logic folding

Choose a proper folding level Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles
Input design specified in register-transfer level (RTL) and/or gate-level VHDL
Motivational Example
input 1
4
input 2
4
Level 1 register
L1
reg1
4 4
reg2
L2
s0
LUT 1
s1
LUT 2
Folding stage
Plane cycle
Plane
Logic in Plane
Folding cycle
4 L3
LUT 3
LUT 4
Level 2 register
reg3
Different planes should have same number of folding stages to guarantee global synchronization Key issue: how to achieve the optimization objective

Appropriate folding level Assign the logic to folding stages
Motivational Example (Contd.)

input 1
4 L1
input 2
4
reg1
4 4
reg2
L2
s0
LUT 1
s1
LUT 2
8 LUTs Logic depth: 4 Plane depth: 9 38 LUTs Logic depth: 7

4
50 LUTs 14 flip-flops
+
4
4 L3
LUT 3
LUT 4
reg3
Example optimization objective
Minimize circuit delay under an area constraint of 32 LEs Assume each LE contains one LUT and two flipflops: 32 LEs provide 32 LUTs and 64 flip-flops
Iterative Design Flow
Start with initial guess for folding level and iteratively refine it
Large folding level -> better circuit delay, but large area cost 9 Initial #folding stages: 2 5 50 32 2 Initial folding levels:
Partition RTL modules into a series of connected LUT clusters
logic depth at most equal to the folding level Significantly speeds up the mapping procedure
Iterative Design Flow (Contd.)
Cluster size should be smaller than the area constraint

b2 0 0 0 b1 0 0 b0 0 a0
b3 0
b3 0 0
P0
b2 0 0
b1 0 0
b0 0 a0
0 0
Cluster 1
0 a1
0
Cluster 1
0 a1
P0
0 a2 0 0 a3 0
P1
34 LUTs > 32 LUTs
0 0 a2 0 0 a3 0 P3 FA 0 P2 P1
P2
P3 FA P4 FA FA P5 P7 P6 carry out FA sum out carry in 0 b j sum in
Cluster 2
Cluster 2
FA P5
ai
P4
FA P7 P6
Level-5 folding
Level-4 folding
Solution for the Example

Choose folding level
folding cycle 1
Decrease folding level
8LEs
add reg1-3
4LEs
LUT1-4 s0, s1 storage 1-4
Module partition
folding cycle 2
storage add 32LEs reg1-3 mul: c1
s0, s1 storage 1-4 s0, s1
Constraint satisfied? Yes FDS to balance resource usage
No
folding cycle 3
6LEs 6LEs
mul: c2 reg1-3
Constraint satisfied? Yes Solution
No
Three folding stages using level-4 folding 32 LEs required for mapping the RTL circuit; area constraint satisfied Circuit delay = 3 * folding cycle delay
NanoMap: Flow Diagram

Input network Optimization objective
1
Circuit parameter search

2
Module library
Output reconfiguration bits

16
Routing
User constraint
Folding level computation

3
Final routing using VPR router

15
RTL module partition
Logic Mapping
Final placement using modified VPR placer Yes

14
No
Perform logic folding? Yes
No
Satisfy delay constraints?
12
Schedule each LUT/ LUT cluster using FDS

6
Delay estimation
11
Temporal placement
Yes Map each 7 LUT/ LUT cluster to SMBs

7
No
Temporal clustering
Placement routable?
10
Satisfy area constraints? Yes
No
8
No
Refine placement? Yes
13
Fast placement using modified VPR placer

9
Force-Directed Scheduling

Perform FDS on RTL modules partitioned into LUTs/LUT clusters Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage Model resource usage as a force: F = Kx

LE usage depends on LUT computations and register storage operations: two DGs needed
K: distribution graphs (DGs) that describe the probability of resource usage Aim of FDS: minimize force, indicating minimum increase in resource usage
Temporal Clustering
For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs
Unpacked LUT with a maximal number of inputs selected as initial seed New LUTs with high attractions to the seed selected and assigned to the SMB
Attractions depend on timing criticality and input pin sharing Considers attractions across all the folding cycles
le1 cyc
A B
Fo ld
ing
D F
le2 Fo ld ing cyc
Placement and Routing
Fo ld
VPR (U. Toronto) modified to perform placement and support temporal logic folding

le1
ing
cyc
SMB 1 C
D SMB 4 D
Fo ld
Simulated annealing approach Cost function computed across the folding stages
ing
cyc
C
Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects
le2
Experimental Setup
Instance of architecture:
4 MBs in an SMB 4 LEs in an MB LEs contain a 4-input LUT and 2 flip-flops
Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs Results based on 100nm technology parameters to implement CMOS logic and NRAMs
23
Experimental Results (Contd.)

Delay (ns) for AT optimization
No folding 1.4 1.2 1 0.8 0.6 0.4 0.2 0 k enough k = 16
#LE * Delay adv. for AT opt.

No folding k enough k = 16
1 1
2
1
2
1
2
18 16 14 12 10 8 6 4 2 0
1
2
1 12 1 1
2 2 2
1
2
1 1
ex2
c5315
Biquad
ASPP4
FIR
Paulin
ex1
ex1
(normalized to no-folding)
FIR
ex2
c5315
Biquad
Paulin
ASPP4
(normalized to no-folding)

Improvement under AT optimization for RTL Benchmarks
Reduction in #LEs k enough k = 16 14.8X 9.2X Maximum AT improvement 16.2X 9.3X Average AT improvement 11.0X 7.8X Circuit delay increase 31.8% 19.4%
LE utilization around 100% 50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding indicates trading interconnect area for NRAM area advantageous

Flexibility in choosing the best folding level and performing area-delay trade-offs Mapping results for typical optimizations using Paulin benchmark as an example
Typical optimizations
Opt. obj. Area const. (#LEs) No No No 210 Delay const. (ns) No No 27 No Folding level 1 No 4 3
Mapping results for typical optimizations case 1 10000 1000 100 10 1 Delay (ns) Area (#LEs) case 2 case 3 case 4
Case1 Case2 Case3 Case4
AT Delay Area Delay
Conclusions

NATURE: A new high-performance run-time reconfigurable architecture NanoMap: an integrated optimization design flow for NATURE Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages Can be very useful for cost-conscious embedded systems and improvement of future FPGAs Non-volatility: helpful in secure and low power processing

Nanomap: An Integrated Design Optimization Flow For A Hybrid Nanotube/Cmos Dynamically Reconfigurable Architecture

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Nanomap: An Integrated Design Optimization Flow For A Hybrid Nanotube/Cmos Dynamically Reconfigurable Architecture

Enviado por

Direitos autorais:

Formatos disponíveis

NanoMap: An Integrated Design Optimization Flow for a Hybrid Nanotube/CMOS Dynamically Reconfigurable Architecture

Temporal Logic Folding

Temporal logic folding

Overview of NATURE (Contd.)

Regular, reconfigurable architectures, such as an FPGA, favored

Non-volatile nanotube random-access memory (NRAM)

NATURE not tied to NRAMs

Architecture of a Super Macroblock (SMB)

n1 macroblocks (MBs) comprise an SMB: here n1 = 4

From Switch matrix

From Switch matrix Output to Interconnect

From Switch matrix SRAM bits

CLK and Global signals Reconfiguration bits

CLK and Global signals Reconfiguration bits

Architecture of a Macroblock (MB)

n2 logic elements (LEs) comprise an MB: here n2 = 4

Inputs to MB 65 SRAM bits

CLK and Global signals Reconfiguration bits

CLK and Global signals Reconfiguration bits

Logic Element (Basic Configuration)

An LE implements a computation and contains:

Reconfiguration time: 160ps

Larger folding level, typically delay decrease, area increase

(a) level-1 folding

(b) level-2 folding

Design Optimization Flow: NanoMap

Optimize and implement design on NATURE Integrate temporal logic folding

Input design specified in register-transfer level (RTL) and/or gate-level VHDL

Appropriate folding level Assign the logic to folding stages

Motivational Example (Contd.)

8 LUTs Logic depth: 4 Plane depth: 9 38 LUTs Logic depth: 7

Example optimization objective

Iterative Design Flow

Partition RTL modules into a series of connected LUT clusters

Iterative Design Flow (Contd.)

Cluster size should be smaller than the area constraint

34 LUTs > 32 LUTs

P3 FA P4 FA FA P5 P7 P6 carry out FA sum out carry in 0 b j sum in

Solution for the Example

LUT1-4 s0, s1 storage 1-4

storage add 32LEs reg1-3 mul: c1

s0, s1 storage 1-4 s0, s1

Constraint satisfied? Yes FDS to balance resource usage

Constraint satisfied? Yes Solution

NanoMap: Flow Diagram

Circuit parameter search

Output reconfiguration bits

Folding level computation

Final routing using VPR router

RTL module partition

Final placement using modified VPR placer Yes

Perform logic folding? Yes

Satisfy delay constraints?

Schedule each LUT/ LUT cluster using FDS

Yes Map each 7 LUT/ LUT cluster to SMBs

Satisfy area constraints? Yes

Refine placement? Yes

Fast placement using modified VPR placer

le2 Fo ld ing cyc

Placement and Routing

4 MBs in an SMB 4 LEs in an MB LEs contain a 4-input LUT and 2 flip-flops

Experimental Results (Contd.)