Escolar Documentos
Profissional Documentos
Cultura Documentos
Wei Zhang, Li Shang and Niraj K. Jha Dept. of Electrical Engineering Princeton University Dept. of Electrical and Computer Engineering Queens University
Outline
Temporal Logic Folding Background on NRAMs Overview for hybrid NAnoTUbe/CMOS REconfigurable architecture (NATURE) (DAC 2006) NanoMap: Design Optimization Flow Experimental Results Conclusions
Input Design
NanoMap
NATURE
Basic idea: Use run-time reconfiguration to realize different functions in the same resource LUT3 every few cycles
LUT1 a b
d g
i e f h LUT2
e
d
OUT l
a b c
LUT 1
i f h
LUT 2
l g
LUT 3
OUT
e d a bil
cf g
h
LUT LUT 3 2 1
OUT
MEM
i =abc
l =(I+e+f)h OUT =dg+l
Overview of NATURE
CMOS fabrication compatible
NRAM-based
Run-time reconfiguration
NATURE
Distributed non-volatile nanotube RAMs (NRAMs): main storage for reconfiguration bits Fine-grain reconfiguration (even cycle-by-cycle) and logic folding
Design flexibility
Logic density
Area-delay trade-off flexibility More than an order of magnitude increase in logic density More than an order of magnitude reduction in areatime product Comparisons assume NRAMs/ CMOS logic implemented in the same technology Non-volatility: useful in low power & secure processing
Challenges in nano-circuits/architectures
Many programmable nanofabrics proposed: Nanowire PLA (Dehon, 2004), CMOL (Strukov, 2005), etc. Lack of a mature fabrication process Fabrication defects and run-time failures (between 1% and 10%)
Facilitates fabrication Fault tolerance through reconfiguration NATURE: fabricatable using CMOS-compatible fabrication process
NRAMTM by Nantero
Source: http://www.nantero.com/nram.html
Mechanically bent or not: determines bistable on/off states Same/opposite voltage added to change the state CMOS-compatible fabrication process 10 Gbit NRAMs already fabricated: ready to be commercialized in the near future
NRAMs
Properties of NRAMs
Non-volatile Similar speed to SRAM Similar density to DRAM Chemically and mechanically stable
Phase change RAM Magnetoresistive RAM Ferroelectric RAM
Architecture of NATURE
LB Length-1 Length-4 wire wire Long wire Switch box
Island-style logic blocks (LBs) connected by various levels of interconnects An LB contains a super macroblock (SMB) and a local switch matrix
Connection block
Length-4 wire Direct link Long wire
Switch block
S1
S1
S1: Switch box between length-1 wires S2: Switch box between length-4 wires Switch matrix: Local routing network
Switch matrix
SMB
S1 Length-1 wire
S1
---8
NRAM
MB
NRAM
20
SRAM bits
---1
20 44X1 MUX
20 44X1 MUX
---2
---2
---2
SRAM bits
20 44X1 MUX
20 44X1 MUX
20
---2
---1
---8
NRAM
MB
---8
MB
NRAM
---1
20
---1
SRAM bits
20
---1
LE
65 SRAM bits
---6
13 to 5 crossbar
13 to 5 crossbar
65 SRAM bits
---5
---5
Inputs to MB 8 Outputs of MB
Inputs to MB
---5
---5
65 SRAM bits
13 to 5 crossbar
13 to 5 crossbar
---6
---1
---2
---2
NRAM
LE
LE
---1
NRAM
---6
---6
An m-input look-up table (LUT) l flip-flops Input to flip-flop selected between LUT output and a primary input
SRAM cell
m-input LUT
DFF
DFF
CLK
Folding Levels
Logic folding at different levels of granularity, providing flexibility to perform area-delay trade-offs Level-p folding: LE reconfiguration after the execution of p LUT computations
y0 y1 y2 y3 a0 b0
z0 z1 z2 c0
x0 x1 x2 x3 y0 y1 y2 y3 x0 x1 x2 x3 e0 f0 g0
x0 x1 x2 x3 e0
d0
a2 a3 a4 a6 h0 i0
Reconfiguration a2 a3 a4 a6 i0
d Output
h0
Output
Choose a proper folding level Use force-directed scheduling (FDS) technique to balance resource usage across folding cycles
Motivational Example
input 1
4
input 2
4
Level 1 register
L1
reg1
4 4
reg2
L2
s0
LUT 1
s1
LUT 2
Folding stage
Plane cycle
Plane
Logic in Plane
Folding cycle
4 L3
LUT 3
LUT 4
Level 2 register
reg3
Different planes should have same number of folding stages to guarantee global synchronization Key issue: how to achieve the optimization objective
input 2
4
reg1
4 4
reg2
L2
s0
LUT 1
s1
LUT 2
50 LUTs 14 flip-flops
+
4
4 L3
LUT 3
LUT 4
reg3
Minimize circuit delay under an area constraint of 32 LEs Assume each LE contains one LUT and two flipflops: 32 LEs provide 32 LUTs and 64 flip-flops
Start with initial guess for folding level and iteratively refine it
Large folding level -> better circuit delay, but large area cost 9 Initial #folding stages: 2 5 50 32 2 Initial folding levels:
logic depth at most equal to the folding level Significantly speeds up the mapping procedure
b3 0
b3 0 0
P0
b2 0 0
b1 0 0
b0 0 a0
0 0
Cluster 1
0 a1
0
Cluster 1
0 a1
P0
0 a2 0 0 a3 0
P1
0 0 a2 0 0 a3 0 P3 FA 0 P2 P1
P2
Cluster 2
Cluster 2
FA P5
ai
P4
FA P7 P6
Level-5 folding
Level-4 folding
folding cycle 1
Decrease folding level
8LEs
add reg1-3
4LEs
Module partition
folding cycle 2
No
folding cycle 3
6LEs 6LEs
mul: c2 reg1-3
No
Three folding stages using level-4 folding 32 LEs required for mapping the RTL circuit; area constraint satisfied Circuit delay = 3 * folding cycle delay
Module library
Routing
User constraint
Logic Mapping
No
No
12
Delay estimation
11
Temporal placement
No
Temporal clustering
Placement routable?
10
No
8
No
13
Force-Directed Scheduling
Perform FDS on RTL modules partitioned into LUTs/LUT clusters Iteratively schedule LUT/(LUT cluster) to minimize overall resource usage Model resource usage as a force: F = Kx
LE usage depends on LUT computations and register storage operations: two DGs needed
K: distribution graphs (DGs) that describe the probability of resource usage Aim of FDS: minimize force, indicating minimum increase in resource usage
Temporal Clustering
For each folding stage, a constructive algorithm used to assign LUTs to LEs and pack LEs into MBs and SMBs
Unpacked LUT with a maximal number of inputs selected as initial seed New LUTs with high attractions to the seed selected and assigned to the SMB
Attractions depend on timing criticality and input pin sharing Considers attractions across all the folding cycles
le1 cyc
A B
Fo ld
ing
D F
Fo ld
VPR (U. Toronto) modified to perform placement and support temporal logic folding
le1
ing
cyc
SMB 1 C
D SMB 4 D
Fo ld
Simulated annealing approach Cost function computed across the folding stages
ing
cyc
C
Routing using VPR router performed hierarchically, considering direct link, length-1, length-4 and global interconnects
le2
Experimental Setup
Instance of architecture:
Impact of fixing k at 16 vs. allowing a high enough k to show design trade-offs Results based on 100nm technology parameters to implement CMOS logic and NRAMs
23
1 1
2
1
2
1
2
18 16 14 12 10 8 6 4 2 0
1
2
1 12 1 1
2 2 2
1
2
1 1
ex2
c5315
Biquad
ASPP4
FIR
Paulin
ex1
ex1
(normalized to no-folding)
FIR
ex2
c5315
Biquad
Paulin
ASPP4
(normalized to no-folding)
LE utilization around 100% 50% reduced need for a deep interconnect hierarchy for level-1 vs. no-folding indicates trading interconnect area for NRAM area advantageous
Flexibility in choosing the best folding level and performing area-delay trade-offs Mapping results for typical optimizations using Paulin benchmark as an example
Typical optimizations
Opt. obj. Area const. (#LEs) No No No 210 Delay const. (ns) No No 27 No Folding level 1 No 4 3
Mapping results for typical optimizations case 1 10000 1000 100 10 1 Delay (ns) Area (#LEs) case 2 case 3 case 4
Conclusions
NATURE: A new high-performance run-time reconfigurable architecture NanoMap: an integrated optimization design flow for NATURE Introduction of NRAMs into the architecture enables cycle-by-cycle reconfiguration and logic folding: leading to significant logic density and area-time product advantages Can be very useful for cost-conscious embedded systems and improvement of future FPGAs Non-volatility: helpful in secure and low power processing