(Tutorial) NoC The Next Generation of Multi-Processor SoC

EAIT, 2011
N t Network-on-Chip k Chi
The Next Generation of Multi-Processor System-on-Chip
Presenters
Dr. Santanu Chattopadhyay Associate Professor
Santanu Kundu Research Scholar
Dept. p of Electronics and Electrical Communication Engineering g g Indian Institute of Technology, Kharagpur.
18th Feb, 2011
email: {santanu, skundu}@ece.iitkgp.ernet.in
Lecture 1
Introduction
Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur
Introduction I t d ti
End Node
Device SW Interface HW Interface
End Node
End Node
End Node
After mass market production of dual-core and quad-core processor chips, the trend towards Multi-Core Multi Core processing is now a well established one. In multi-core multi core processing, multiple processor (i.e. CPU, DSP) along with multiple computer components (i.e. microcontroller, memory blocks, timers, etc.) are integrated onto a single silicon chip. chip This architecture is often called as Multi-Processor System-on-Chip (MPSoC) (MPSoC).
Link
Link
Communication Medium
Link
Architecture overview of Multi-Processor System-on-Chip
Link
Introduction
System-on-Chip (SoC)
Each on chip component referred t as Intellectual to I t ll t l Property P t (IP) block. The communication medium used in modern multi-processor chips is bus based. Upto tens of cores in a single chip, the performance of these bus based chips are satisfactory. But beyond that its performance degrade with number of cores attached.
The communication backbone used in modern SoC is shared bus.
Limitation of Shared Global Bus

Communication Bottleneck: A shared bus allows only one communication at a time, time and even in a hierarchical bus, bus a single communication can block all buses of the hierarchy. Scalability: Bus based SoC does not scale with the system size and its bandwidth is shared by all the systems attached to it.
Node Node Node
Limitation of Shared Global Bus

The intrinsic parasitic resistance and d capacitance it can be b quite it high hi h for a long bus line. The global bus delay increases exponentially with decrease in process technology. Every E er additional IP block adds to parasitic capacitance and causes increased propagation delay. In deep sub-micron era, 80% or more of the delay of critical paths will be due to global interconnects.
Relative Evolution of wire and gate delays
Reference: International Technology Roadmap for Semiconductor (ITRS) Documents (2003), Available at: http://public.itrs.net/Files/2003ITRS/Home2003.htm.
Shared Global Bus to Segmented Bus

R R R R
Segmented Bus
Multi-Level Segmented Bus
Shared global bus is segmented by inserting repeaters (R). In segmented bus, delay increases linearly with decrease in process technology . No improvement p in bandwidth as it is still shared by y all the cores attached to it. At the system level, it has a profound effect in changing the focus from computation to communication. communication
Point-to-Point Point to Point Dedicated Links

0 7 1
5 4
Advantage: Bandwidth is higher than the shared bus. Drawback: Switch size increases with increase in number of cores. Number of links needed increases exponentially as the number of cores increases. More number of metal layers are required in placement and routing.
Centralized Crossbar Switch

Components:
Node
Crossbar Switch
Node
Crossbar switch and Point-to-point links. Advantage: A crossbar switch enhance the scalability to some extent. Drawback: However, connecting large number of cores with a single g switch is not very effective as it is not ultimately scalable and, , thus, , it is an intermediate solution.
Node
Node
10
Network-on-Chip: A Paradigm Shift

Off-Chip vs. On-Chip Networks
o Th The bandwidth b d id h of f off-chip ff hi networks k i is typically much lower than on-chip networks. o Off-chip network is often affected by clock skew whereas clock skew problem is less significant g for on-chip p networks.
Only 3 components 1. Network Interface (NI) 2. Switch (Router) 3 Point-to-Point Links 3.
o Off-chip networks has higher latency than their on-chip counter part. o Area is not a strong constraint for off-chip networks, but for on-chip network it is one of the major constraint. constraint
Reference: Benini, L. and Micheli, G.D. (2002) Network on chips: a new SOC paradigm, IEEE Computer, Vol. 35, No. 1, pp.7078.
11
Layers of Abstraction in Network-on-Chip

Session Layer - NoC Abstraction (Open Core Protocol Standardization) Transport Layer - Network Interface Network Layer - Router / Switch Data Link Layer - Flow Control Protocol - Error Handling Physical Layer - Physical Wire Connection
12
SoC to NoC: An Evolution

SoC
Bandwidth is limited, shared Speed goes down as N grows Central arbitration No layers of abstraction However: Fairly simple.
NoC
Aggregate bandwidth grows Speed unaffected by N Distributed arbitration Separate abstraction layers However: Complex architecture.
NoC
SoC
13
Design Goal of Network-on-Chip
High throughput Low latency S l bl architecture Scalable hi Less energy consumption Smaller area requirements R li bili i Reliability in C Communication. i i Quality-of-Service Support
Lecture 2 Architecture Design and Performance Evaluation of Network-on-Chip
15
Design Issues in Network-on-Chip

Switching Techniques Topology Selection Routing Flow Control Protocol & GALS Implementation Buffering Arbitration
16
Switching S it hi Techniques T h i
Circuit Ci it S Switching it hi
Buffers for request tokens
Source end node
Destination end node
Request for circuit establishment (routing and arbitration is performed during this step)
17
Circuit Ci it Switching S it hi
Buffers for ack tokens
Source end node Request for circuit establishment
Acknowledgment and circuit establishment (as token travels back to the source, source connections are established)
18
Source end node Request for circuit establishment Acknowledgment andcircuit establishment Message transport (neither routing ( g nor arbitration is required) q )
19
X
Source end node Acknowledgment andcircuit establishment Packet transport High contention,low utilization low throughput Destination end node
20
Switching Techniques
Store-and-forward Packet switching
Packets are completely stored before any portion is forwarded Buffers for data packets
Store
Drawback: 1. Larger Buffer 2 More 2. M Latency L
Source end node
21
Store-and-forward Packet switching
Packets are completely stored before any portion is forwarded Latency per router depends on the size of the packet
Requirement: buffers must be sized to hold entire packet
Store Forward
Drawback:
1. 2 2. Larger Buffer, M r Latency More L t n
Source end node
22
Virtual Cut-Through Packet Switching
Packets completely stored at the switch
Requirement: buffers must be sized to hold entire packet
Drawback:
Busy Link
L Larger B Buffer ff Advantage: Lesser Latency y
Source end node
Latency/ y/ router reduced by y forwarding g header flit of a packet p as soon as space p for the entire packet in the next router.
23
Wormhole Packet Switching
Advantage: Lower Buffer Space, Lesser Latency. Dra back: Throughput Drawback: Thro ghp t lesser than Virtual Virt al Cut C t Through Thro gh Requirement: R i packets can be larger than buffers
Busy Link
Source end node
Packets stored along the switch
24
Network Interface (NI) Module
Protocol Conversion Clock Domain Shifting
25
Network Interface (NI) Module

packet (64x32)bits
Fli i i Flitization
Header eop bop GT/BE Src_add Dest _add (32-bit) Payload 1 eop bop GT/BE (32-bit) y 2 eop Payload bop GT/BE (32-bit) Tailer eop bop GT/BE (32-bit) (32 bit) DATA 1 DATA2
DATA n
1 Packet = 64 Flits 1 Flit = 32 bits
Deflitization
(64x32)bits ) packet ( p
26

27
Topology Selection
Diameter
Number N b of f Links Li k
A topology with large number of links can support high Maximum shortest p path distance between two nodes in bandwidth. bandwidth the network. Networks with small diameters are preferable. Average Distance Average Distance is the average among the distances between all pairs of nodes of a graph. A topology having lesser average distance is preferable.
Bisection Width
Minimum number of wires removed in order to bisect a network. A larger bisection width enables faster information exchange, and preferable.
Node Degree
Numbers of channels connecting the node to its neighbors. g The lower this number, , the easier to build the network.
2D Mesh with 16 cores Topology p gy selection is application dependent.
Reference: Interconnection Network Architectures (2001) pp.2649, Available at: www.wellesley.edu/cs/ courses/cs331/notes/notesnetworks.pdf
28
Existing Topologies in NoC

2D Mesh
2D m mesh of 16 c cores
All switches are connected to the four closest other switches and target resource block via two opposite unidirectional links, except those switches on the edge of the layout.
For MN Mesh, Di Diameter: t (M + N - 2) Bisection Width: min (M, N) No. of routers required: (M * N) Node Degree: 3 (corner), (corner) 4 (edge), 5 (central). CLICH: Chip-Level Integration of Communicating g Heterogeneous g Elements
Reference: Kumar, S., Jantsch, A., Soininen, J. P., Forsell, M., Millberg, M., Oberg, J., Tiensyrja, K. and Hemani, A. (2002) A network on chip architecture and design methodology, Proc. of. ISVLSI, pp.117124.
29

2D Torus
2D T Torus of 16 c cores
Wires are wrapped pp around from the top component to the bottom and rightmost to leftmost. leftmost For MN Torus,
Diameter: M/2 + N/2 Bisection Width: 2 * min (M, N) No of routers required: (M * N) No. Node Degree: 5
Disadvantage: The long end-around connections can yield excessive delays. delays
Reference: Dally, W. J. and Towles, B. (2001) Route packets, not wires: on-chip interconnection networks, Proceedings of the 38th Design Automation Conference (DAC 2001), pp.684689.
30

Solving Delay Problem of Torus
Reducing the maximum i physical link length
31

Folded Torus
2D Folded ed Torus of 1 16 cores
Reference: Dally, W.J. and Seitz, C.L. (1986) The torus routing chip, Journal of Distributed Computing, Vol. 1, No. 4, pp.187196.
2DTo orus of 16 co ores
32

Octagon
For a network having N number of IP bl k blocks, Diameter: 2 * N/8. Drawback: D b k For a system consisting of more than eight nodes, the network is extended to multidimensional space.
Wiring complexity increases linearly
2DOctago on of 8 cores s
with number of nodes.
Reference: Karim, F., Nguyen, A. and Dey, S. (2002) An interconnect architecture for networking systems on chips, IEEE Micro, Vol. 22, No. 5, pp.3645.
33

Binary Tree
2DBinary Tree o of 16 cores
A binary tree-based network with N (power of 2) number of IP core has, Diameter: log2 N Bisection Width: 1 No of Routers required: (N/2 1) No. Node Degree: 5 (leaf), 3 (stem), 2 (root)
Dr b k Bisection Drawback: Bi ti n Width i is very r l less. Advantage: Lesser Diameter.
Reference: Jeang, Y. L., Huang, W. H. and Fang, W. F. (2004) A binary tree architecture for application specific network on chip (ASNOC) design, IEEE Asia-Pacific Conference on Circuits and Systems, pp.877880.
34

Fat Tree
2DFat Tree o 2 of 16 cores
Every level has same number switches. The functional IP blocks reside at the leaves and the switches reside at the vertices. For N number of IP blocks, the network has, Diameter: log2 N/4 Bisection Width: N/2 No. of Routers required: (N. log2 N)/8 Node Degree: 8 (non-root node), 4 (root node).
SPIN: Scalable, Programmable, Integrated Network Advantage:
Large Bisection Width, Smaller Diameter
Drawback : High Node Degree

Reference: Guerrier, P. and Greiner, A. (2000) A generic architecture for on-chip packet-switched interconnections, Proceedings of Design, Automation and Test in Europe (DATE 2000), pp.250256.
35

Butterfly Fat Tree (BFT)
2DBFT of 16 cores
In the network, the IPs are placed at the l leaves and d switches i h placed l d at the h vertices. For N number of IPs, the network has, Diameter: log2 N/4 Bisection Width: N
Advantage - Requires lesser number of switches No. of Routers needed: ( N/2) - Low diameter and Large bisection Node Degree: 6 (non-root), 4 (root) width Drawback - High node-degree. node degree.
Reference: Pande, P. P., Grecu, C., Ivanov, A. and Saleh, R. (2003), High-throughput switch-based interconnect for future SoCs, Proc. Intl Workshop on System-on-Chip for Real Time Applications, pp.304310.
36
Mesh-of-Tree Topology
- In M N MoT where M denotes the number of R Row T Trees and d N denotes the number of Column Trees. Both M and N are power of 2. 2 - Number of nodes = 3*M*N (M + N). Small Diameter (2 log2 M + 2 log2 N). Large g Bisection Width [min (M,N)]. Drawback - Non-planer Non planer topology. topology
4 4 Mesh-of-Tree M h f T connecting ti 32 cores
Reference: Kundu, S. and Chattopadhyay, S. (2008), Mesh-of-Tree Deterministic Routing for Networkon-Chip Architecture, ACM Great Lake Symposium on VLSI, pp. 343346.
37

38
Routing
Source Routing vs. Distributed Routing

Source routing
Routing control unit in switches is simplified; computed at source. Headers containing the route tend to be larger increase overhead. Next route computed by finite-state machine or by look-up table.
Distributed routing
Deterministic Routing vs. Adaptive Routing

Deterministic routing

Always follow a specified path. Easy to implement and supports in-order delivery. Different paths based on congestion and faults; destroys in-order delivery. Historical channel load information, length of queues, status of nodes and links.
Ad i routing Adaptive i

39
Routing Challenges
Live-lock in Adaptive Routing
Livelock
Arises from an unbounded number of allowed nonminimal hops. Solution: restrict the number of non-minimal hops allowed. allowed Arises from a set of packets being blocked waiting only for network resources (i.e., links, buffers) held by other packets in the set. Probability increases with increased traffic & d r decreased d availability. il bilit
Deadlock
40
Routing Dependent Deadlock

ci = channel i di = destination d i i node d i s1 c0 c1 d3 c 11 d5 s4 d2 c 10 c9 c8 d1 c2 d4 c5 c7 c6 s3 Channel dependency graph c9 c4 s5 c 12 p5 c 12 c6 p4 c 10 si = source node i pi = packet k i c 3 s2 c3 p3 c7 p4 c 11 p1 c0 p2 c4 p3 c8 p3 p4 c1 p2 c5 p2 p1 c2 p1
Routing of packets in a 2D mesh
41
Routing Dependent Deadlock Avoidance

Deterministic Routing in 2D mesh using Dimension Ordered Routing E Establish t bli h ordering d i on all resources b based d on network t k di dimension. i E Example: l X-Y Routing: First, route horizontally and match the Y co-ordinate; and then route vertically and match X co-ordinate.
X YR X-Y Routing tin
N cycle No l in the th Channel Ch nn l Dependency D p nd n Gr Graph ph
42

Deadlock Free Adaptive Routing in 2D Mesh: Turn Model
West First
Deterministic Routing
Adaptive Routing p g
North Last
Negative First
Reference: Glass, C. J. and Ni, L. M. (1992), Turn Model for Adaptive Routing, Proceedings of International Symposium on Computer Architecture, pp. 278 287.
43

Deadlock Free Adaptive Routing in 2D Mesh: Odd-Even Turn Model
Rule 1. Any packet is not allowed to take k an EN turn and d ES turn at any nodes located in an even column.
Rule 2. Any packet is not allowed to take an NW turn and SW turn at any nodes located in an odd column.
Reference: Chiu, G. M. (2000), The Odd-Even Turn Model for Adaptive Routing, IEEE Transactions on Parallel and Distributed Systems, pp. 729 738.
44

Deterministic Routing in 2D Torus and Folded Torus by using Virtual Channels Messages at a node numbered less than their destination node are routed on the high channels, and messages at a node numbered greater than their destination node are routed on the low channels.
n0
n0n2
n1
n1n3
n2
n2n0
n3
n3n1
Reference: Dally, W. J. and Seitz, C. L., (1987) Deadlock Free Message Routing in Multiprocessor Interconnection Networks, IEEE Transactions on Computers, vol. C-36, no. 5, pp. 547 553.
45
Deadlock Recovery
Allow deadlock to occur, but once a potential deadlock situation is detected, break at least one of the cyclic dependencies to gracefully recover. The common techniques are, Regressive recovery (abort-and-retry): Remove packet(s) from a dependency cycle by killing (aborting) and later reinjecting j g( (retry) y) the p packet(s) ( ) into the network after some delay. Progressive recovery (preemptive): Remove packet(s) from a dependency d d cycle l by b rerouting i the h packet(s) k ( ) onto a deadlock-free lane.
46

Switching Techniques Topology Selection Routing Flow Control Protocol & GALS GALS Implementation Implementation Buffering Arbitration
47
Flow Control Protocol
48
Flow Control Protocol
49
Flow Fl Control C l Protocol P l
50
Globally Asynchronous Locally Synchronous (GALS) style of Communication
Reference: Kundu, S. and Chattopadhyay, S. (2007) Interfacing Cores and Routers in Network-on-Chip Using GALS, IEEE International Symposium on Integrated Circuits (ISIC 2007), pp.
51

52
Counter based FIFO
Binary Counter Based

- Drawback 1. There can be considerable ambiguity when a count is read during count transition.
Gray Code Counter Based

- Drawback 1. Power of 2 FIFO depth. Area wastage for non- binary FIFO depth.
Reference: Yi, Cheng, Gray code sequences, U. S. Patent 6703950, March 9, 2004.
53
Gray Counter Based Dual Clock FIFO
Reference: Cummings, C. E. and Alfke, P. (2002) Simulation and Synthesis Techniques for Asynchronous FIFO Design with Asynchronous Pointer Comparisons, Synopsys Users Group Conference, vol. User Papers.
54
Functionality of Asynchronous Comparator
Full = ( (waddr == raddr) && (wr_dir != rd_dir) ) Empty = ( (waddr == raddr) && (wr_dir == rd_dir) )
55
Metastability
Full and Empty Signals are controlled by both the clocks. Thus probability of arising Metastable States. 2-State Synchronizer are used to reduce the probability of Metastability. Full Signal is synchronized with the wr-clk and Empty Si Signal li is synchronized h i d with ih the rd-clk.
Full = ( (waddr == raddr) && (wr_dir != rd_dir) ) E Empty =(( (waddr dd == raddr) dd ) && ( (wr_dir di == rd_dir) d di ) )
56

57
Arbitration
58
Router Architecture
Input Channel Input Buffer Routing Computation Unit Control Unit
Output Channel Output Buffer Arbiter Control Unit
Reference: Kundu, S. and Chattopadhyay, S. (2008) Network-on-chip architecture design based on Mesh-of-Tree deterministic routing topology, Intl Journal of High Performance Systems Architecture, Vol. 1, No. 3, pp. 163-182.
59
Wormhole Router Architecture Data Path

Physical channel
Link k Contr rol RoutingControlUnit (RC) Header Flit
Physical channel
Cross sBar(ST)
Inputbuffer b ff (IB)
Output p buffer (OB)
Lin nk Control Link C Control
Routing Algorithm Link Control C Inputbuffer (IB) RoutingControlUnit (RC)
Outputbuffer (OB)
CRITICAL PATH
IB( (Input p Buffering) g)
Header Fli Flit Routing Algorithm

RC( (Route Computation) p )
Arbitration Unit(SA) Output Port#

SA( (Switch Alloc) )
Crossbar Control
ST( (Switch Trav) )
OB( (Output p Buffering) g)
Physical channel
Physical channel
60
Flit Traversal Through Wormhole Router

Physical channel
Link L Co ontrol
Physical channel
Link L Co ontrol
Inputbuffer (IB)
RoutingControlUnit (RC) Header Flit

Routing Algorithm g
Outputbuffer (OB)
Arbitration Unit(SA)
Output Port#
Crossbar Control
OB(OutputBuffering)
IB(InputBuffering)
RC(Route Computation)
SA(Switch Alloc) ST(Switch Trav)
Packet Header Packet Payload 1 Packet Payload 2 Packet Payload 3
IB RC SA ST OB IB IB IB IB ST OB IB IB IB ST OB IB IB S ST O OB
Lin nk Con ntrol
Routing Algorithm
Physical channel
RoutingControlUnit (RC) Header Flit
CrossBar(ST T)
Inputbuffer (IB)
Outputbuffer (OB) ( )
Link L Co ontrol
Physical channel
61
Performance Evaluation
Performance Metrics
Throughput: g p TP =
(Maximum Accepted Packets) x (Packet length) Unit: flits/ / cycle/ y / IP (Number of IP blocks) x (Total time)
Latency: The time (in clock cycles) that elapses from between the occurrence of a
message header injection into the network at the source node and the occurrence of a tail flit reception at the destination node. node Lavg =
Li P
P = total number of messages, Li= latency of each message i.
Bandwidth: Bandwidth refers to the maximum number of bits can send successfully to the destination through the network per second. It is represented as bps (bits/sec).
Cost Metrics
Average energy/packet and average energy/clock cycle are being measured. taken into consideration.
Energy dissipation: d Energy consumed by routers and links at different workload. Area requirements: Percentage chip area occupied by the switch and links have
62
Simulator Design for Performance Evaluation

Types of Simulator 1. Cycle Accurate: Sample the state of the signals at every clock edge (positive or negative). Much faster than event driven simulation. 2. Event Driven: iven: Most accurate as every active signal is calculated for every device during the clock cycle as it propagates. Each signal g is simulated for its value and its time of occurrence. Excellent for timing analysis and verify race conditions. Computation intensive (depends on the number of activities) and hence very slow. slow
To calculate the performance metrics like throughput, latency etc., the delay after each and every gate is not required. In that case Cycle Accurate Simulator is the best choice.
63
Existing NoC Simulators

Some Existing NoC Simulators
NIRGAM University of Southampton, UK MPARM - Xpipes University of Bologna, Italy NS2 Open Source
Drawbacks
limited li i d to M Mesh h topology; l No power evaluation Not freely available Packet level transaction
64
Cycle Accurate Simulator for NoC Modeling

The simulator should operate at the granularity of individual architectural
components co po e s o of the e router. ou e . SystemC is normally preferred. Traffic Generators are used for evaluating the performance of NoC.
Input Channel Input Buffer Routing Computation Unit Control Unit Output Channel Output Buffer Arbiter Control Unit
Router
Traffic Generation Poisson Distribution Self-Similar Traffic Application Appli ti n Specific Sp ifi Traffic Tr ffi Network 1. Throughput 2. Latency 3. Bandwidth
65
Traffic Generator
Application Driven Traffic is the best suited for performance evaluation. D t Due to unavailability il bilit of f the th same, synthetic th ti traffic t ffi source models d l are also l used. d Nature of traffic is generally bursty in NoC.
A Poisson process
When observed on a fine time scale will appear bursty Burst length of a Poisson arrival process tends to be smoothed by averaging over long enough time scale. P i Poisson process fail f il to capture the actual burstiness of NoC traffic . Short range Dependence
Reference: Varatkar, G.V. and Marculescu, R. (2004) On-chip traffic modeling and synthesis for MPEG-2 video applications, IEEE Trans. on Very Large Scale Integration (VLSI) Systems, Vol. 12, No. 1, pp. 108-119.
66
Traffic Generator
A Self-Similar (fractal) process
When aggregated over wide range of time scales, will maintain its bursty characteristic. Self-similarity manifests itself in several equivalent fashions: Slowly decaying variance Long range dependence Non-degenerate autocorrelations Heavy Tailed
A Self-Similar process can be generated by super-positioning ON-OFF Pareto Sources
Reference: Park, K. and Willinger, W. (2000) Self-Similar network traffic and performance evaluation, A Wiley-Interscience Publication, John Wiley & Sons, Inc.
67
Traffic Parameter
Offered Load: Number of packets injected for particular time interval. Locality Factor: Ratio of traffic destined to the local cluster from a core to the total traffic injected by each core. Locality Factor = 0 signifies Uniform Distributed Traffic. For example in 4x4 Mesh, the distances (d) of the destinations from one corner , 2, , 3, , 4, , 5, , and 6. If locality y factor = 0.5, , then source are at d = 1, 9 50 percent of the traffic will go to the cluster having d = 1. 9 Rest 50 percent traffic will be distributed as o 15% will go to the cluster having d = 2 o 12.5% will go to the cluster having d = 3 o 10% will go to the cluster having d = 4 o 7.5% will go to the cluster having d = 5 o 5% will go to the cluster having d = 6 If there is more than one core in a cluster, the traffic will be randomly distributed among them. S
d=0 d=1 d=2 d=6 d=5 d=4 d=3
68
Performance of any network depends on the following network parameters.
Topology T l Locality factor of the traffic Buffer Position and Buffer Depth S i hi Techniques Switching T h i Number of cores attached
Theoretically, Theoretically
Throughput Number of Links Average Distance
Latency Average Distance M Mesh BFT
Here, ,W Wormhole router architecture is used to form the network with following parameters, Number of cores attached = 32 Message Length = Packet Length = 64 flits Each flit consists of 32 bits Total Simulation cycle = 2 lacs with 10,000 cycle settling time
69
Throughput varies with topology and locality factor
Throughput = Maximum Accepted Traffic in flits/cycle/IP We kept buffer depth = 6 in both input and output channels of the router in all the cases
70
Latency decreases with increase in Locality Factor in different topologies
We kept buffer depth = 6 in both input and output channels of the router in all the cases
71
Power Evaluation Flow

Router Power Evaluation
Operating Condition: Process = 1, Voltage = 1 volt, Temp = 750 C

Reference: Synopsys prime power , Design vision manual.(Version Y-2006.06)
72

Link Length Estimation Mesh
Estimated Length of Wires: 1.25 mm, 2.5 mm
73

Link Length Estimation Butterfly fl Fat T Tree (BFT) ( T)
Estimated Length of Wires: 1.25 mm, , 5.0 mm
74

Interconnect Modeling
Copper wire (resistivity = 17 n-m) of Metal Layer 4 (Semi-global) has been taken.
To reduce the wiring area we have chosen the minimum dimension of Metal Layer 4. The dimensions are, Width (W) = 0.2 m Spacing (S) = 0.2 m Pitch = W + S = 0.4 m (T) ) = 0.5 m Thickness ( H = 0.75 m Dielectric Constant = 2.9
Layer 5
Layer 4
Layer 3
C Cross-section i of f i interconnects
Link Energy Evaluation
Parasitic Components (R, C, L) of Three Wire Model has been extracted from Field Solver tool of HSPICE. The energy gy consumption p of middle wire for different transitions is also obtained from HSPICE.
75

Three wire modeling
Data rate : 32 200 M bits/sec Driver sizes are designed based on length of the wire. Load Capacitance on the other end of the wire is 5fF Look Up Table (LUT) is made for middle line energy consumption
76
Energy Consumption in Mesh Topology

Network Energy = Router Energy + Link Energy
Si l i runs f Simulation for 2 l lacs clock l k cycle l with i h clock l k period i d of f 5 ns
Internal Power D i t Dominates

77
Comparison of Energy Consumption
78
Energy Performance Trade Trade-Off Off

Throughput Variation with FIFO Depth & Position in Mesh
FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4 FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6 FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6 FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output Channel FIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel

Latency Variation with FIFO Depth & Position in Mesh
FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4 FIFO_Depth_4-6 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 6 FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6 FIFO Depth 4-0 FIFO_Depth_4 0 => > Input Channel FIFO Depth =4, 4, No FIFO at Output Channel FIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel
80

Energy Variation with FIFO Depth & Position in Mesh
Simulation runs for 2 lacs clock cycle with clock period of 5 ns
FIFO_Depth_4-4 => Input Channel FIFO Depth =4, Output Channel FIFO Depth = 4 FIFO D h 4 6 => FIFO_Depth_4-6 > Input I Channel Ch l FIFO Depth D h =4, 4 Output O Channel Ch l FIFO Depth D h=6 FIFO_Depth_6-6 => Input Channel FIFO Depth =6, Output Channel FIFO Depth = 6 FIFO_Depth_4-0 => Input Channel FIFO Depth =4, No FIFO at Output Channel FIFO_Depth_6-0 => Input Channel FIFO Depth =6, No FIFO at Output Channel
81

Trade-Off in Mesh at saturation (load = 160)
FIFO D h 6 0 shows FIFO_Depth_6-0 h best b E Energy-Performance P f Trade-Off T d Off
82
Network Energy Consumption in Mesh after FIFO Optimization

Si l i runs f Simulation for 2 l lacs clock l k cycle l with i h clock l k period i d of f 5 ns
Internal Power Still Dominates

We kept FIFO depth = 6 in input channel and no FIFO at output channel
83
Comparison of Energy Consumption after FIFO Optimization

84
Internal Power
Netlist View of a D-type flip-flop with synchronous clear input in S Synopsys D Design i Vi Vision i
Internal power = short circuit power + Internal node switching power Output node of the clock-buffer switches continuously with free running clock To minimize Internal Power: Stop the clock when the network is idle
85
Internal Power Minimization
Netlist View of FIFO Memory
86
Network Energy Consumption in Mesh after Clock Gating in FIFO
Simulation runs for 2 lacs clock cycle with clock period of 5 ns
87
Comparison of Energy Consumption after Clock Gating in FIFO

88
Network Area Comparison
% SoC Area Overhead BFT 2 424 2.424 Mesh 3 701 3.701
Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.
89
Scalability Measurement
Scalability is a property which exhibits performance proportional to the
number of cores employed. As the size of a scalable system is increased, a corresponding increase in performance is obtained.
BW = [(Throughput * Number of cores attached * Number of bits in a flit) / clock period]
90
Head Head-of-Line of Line Blocking in Wormhole Router

VC0
2Dmesh,noVCs,XYrouting
91
Introduction of Virtual Channels

Multiple Virtual Channels multiplexed on a single physical link to improve performance. Payload flits use the VC acquired by the header flit while tailer flit releases VC.
SwitchA VC0 SwitchB VC0
VC1
DEMUX
Physical y datalink
MUX
VC1
VCcontrol
VCScheduler
Reference: Dally, W. J. (1992) Virtual Channel Flow Control, IEEE Trans. on Parallel and Distributed Systems, Vol. 3, No. 2, pp. 194205.
92
Virtual Channels
VC0 VC1
2D mesh, 2 VCs, XY routing
VC avoids HOL blocking. g
93
Virtual Channels
VC0 VC1
X
NoVCs available
X
2D mesh, 2 VCs, XY ro ting routing
VC mitigates HOL blocking but can not eliminate li i i it
94
Virtual Channel Based Router Architecture

Inputbuffers
Link Contro ol
Cross sBar
Inputbuffers
Lin nk Control
RoutingControl and ArbitrationUnit
Lin nk Co ontrol Physical channel
Physical channel
...
MUX M
Link Contro ol Ph hysical ch hannel
Ph hysical ch hannel
...
MUX
DEMUX DEM MUX
95
Virtual Channel Based Router Architecture

Link Control Link Control ysical Phy cha annel Link Control Physical chann nel Physical channel
Inputbuffers
Inputbuffers
Physic cal chann nel
Link Control
MUX
CrossBar
...
RoutingControl and ArbitrationUnit
Reference: N. Kavaldjiev, G. J. M. Smit, and P. G. Jansen, A Virtual Channel Router for On-Chip Networks, in Proc. of IEEE Intl SOC Conference. IEEE Computer Society Press, pp. 289293, 2004.
MUX
DEMUX DEMUX
MUX
MUX
...
96
Determination of Number of Virtual Channels
- Upto 4 virtual channels throughput increases, but beyond that it saturates. - Energy dissipation increases with increase in the number of virtual channels. - For Energy-Performance gy Trade-off, , 4 virtual channels with each p physical y channel is preferred.
Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) Performance evaluation and design trade-offs for MP-SOC interconnect architectures, IEEE Trans. on Computers, Vol. 54, No. 8, pp.10251040.
97
Throughput Improvement in Mesh using Virtual Channel Architecture
N of No. f Virtual Vi l Channel Ch l=4
98
Latency Improvement in Mesh using Virtual Channel Architecture
No. of Virtual Channel = 4
99
Energy Overhead in Mesh using Virtual Channel Architecture

100
Performance of Some Other Topologies

Reference: Pande, P. P., Grecu, C., Jones, M., Ivanov, A. and Saleh, R. (2005) Performance evaluation and design trade-offs for MP-SOC interconnect architectures, IEEE Trans. on Computers, Vol. 54, No. 8, pp.10251040.
101
Network Area Comparison with Virtual Channel Architecture
% SoC Area Overhead Mesh Without VC 3.701 With VC 6.145 2.424 BFT Without VC With VC 3.507
Total Core Area = (32 * 2.5 * 2.5) sq. mm. = 200 sq. mm.
102
Quality of Service (QoS) Support

Conceptually, networks two disjoint
a network with throughput and latency y g guarantees (g (guaranteed throughput, GT) a network without those guarantees (best-effort, ( , BE) )
Several types of commitment in the network

Architectural Modification for Supporting QoS combine bi guaranteed d worst-case behavior with good average resource usage
Reference: Rijpkema, E., Goossens, K., Radulescu, A., Dielssen, J., Meerbergen, J. V., Wielage, P., and Waterlander, E. (2003) Trade-offs in the Design of a Router with Both Guaranteed and Best-Effort Services for Networks on Chip, IEE Proc. Computers and Digital Techniques, Vol. 150, No. 5, pp. 294-302.
Lecture 3
Application pp Mapping pp g
104
Task of Application Mapping
105
Mapping M i Problem P bl Formulation F l i Core C Graph G h

Directed graph G = (V, E) Each vertex vi represents a core Each edge ei,j E represents communication between vi and vj Weight of edge ei,j is commi,j, is the bandwidth requirement
106
Mapping Problem Formulation NoC Topology Graph

A directed graph P = (U,F) Each vertex ui U is a router Each edge fi,j F represents a direct communication between the vertices
Weight g of edge g fi,j y i j denoted by bwi,j represents the available bandwidth across the edge
107
Map M Function F i
map: V U Each edge k of the core graph represents a commodity dk Each commodity has a value vl(dk) representing the bandwidth d d requirement of f the communication from f vi to vj Bandwidth constraint: An edge in the topology graph must have enough bandwidth to accommodate all commodities passing through it Minimize communication cost: k vl(d l(dk) dist(source(d di ( (dk), ) dest(d d (dk))
108
Mapping M i Solution S l i
109
Mapping M i Algorithms Al i h
Mapping problem is intractable Several approaches are possible: ILP, Heuristics (PMAP, GMAP, PBB, NMAP, BMAP etc.), Meta-search heuristics (GA, PSO, Simulated Annealing) Other variants of the problem combining, Task T k scheduling h d lin Power consumption Alternative routing paths etc.
110
Mapping M i with i h Minimum-Path Mi i P h Routing R i (NMAP)

Three phases Initialize, Minimum path computation, Iterative improvement Initialize:
1. 2. 3. Core with maximum communication demand placed onto the node with maximum number of neighbors g Select the core that communicates most with the mapped cores Place selected core onto the node that minimizes communication cost with mapped ones
111
Mapping M i with i h Minimum-Path Mi i P h Routing R i

Shortest Path:
Minimum path routing Commodities are sorted on descending order of flows For each commodity, shortest path is identified As soon as a commodity path is finalized, finalized cost of each edge on the path increased by the value of the commodity
112
Mapping M i with i h Minimum-Path Mi i P h Routing R i

Iterative Improvement: Iteratively swap vertices pair-wise to obtain a better mapping Traffic splitting: Multiple shortest paths may exist Formulate a multi-commodity flow problem to satisfy bandwidth requirements for solutions that have lesser communication costs but do not satisfy y all the bandwidth requirements
113
Binomial Bi i l Mapping M i Algorithm Al i h (BMAP)

NMAP algorithm is O(N4logN) BMAP is a three stage algorithm with complexity O(N2logN) Binomial Merging Iteration Topology Mapping Hardware cost Optimization
114
BMAP: Binomial Merging Iteration

1. Calculate IP Ranking: Rank of IP core i, ranking(i) = (requirement(i, j) + requirement(j, i), j = 1 to N requirement(i, j) is the bandwidth requirement from i to j 2. Merge IP Set: Based on ranking merge two IP-sets at a time: logN time 3. Refreshing IP Set: Ranking is recalculated. Ranking of IP Set k generated by merging IP Set i and IP Set j is, ranking(k) = ranking(i) + ranking(j) requirement(i,j) requirement(j i) requirement(j,i)
115
Merging: M i An A Example E l
116
BMAP: BMAP Topology T l Mapping M i and d Traffic T ffi Surface Creation

After mapping, a traffic surface is generated It shows the traffic load of each router Minimal Mi i l path h routing i i is used d Based on this surface, hardware can be optimized p by y selecting g proper routers from the library
117
BMAP: BMAP Hardware H d Cost C Optimization O i i i

1 Dummy Router Elimination: 1.
Dummy routers added at start point to have 4n routers BMAP puts these routers at boundaries, hence can be eliminated
2. Router Selection:
Sharing single buffer among low bandwidth input channels Choice of router is made from library
3 Unfolding: 3.
Add additional routers and links for larger bandwidth requirements
118
BMAP: BMAP Hardware H d Optimization O i i i - An A Example E l
119
Network on Chip Synthesis: SUNMAP + xpipes
120
SUNMAP: SUNMAP Topology T l Mapping M i

Optimizes for area, area power or delay within design constraints Uses heuristics to perform mapping onto topologies: mesh, torus, hypercube, clos, and butterfly Built B ilt in fl floor-planner pl nn f for area, p power analysis n l i Choice of different routing functions
121
SUNMAP: SUNMAP Topology T l Mapping M i

Heuristic approach with several phases: Initial mapping using a greedy algorithm (from communication graph) Compute optimal routing (using flow formulation) 1. Floorplan solution 2. Check area and bandwidth constraints 3. Compute mapping cost Iterative improvement loop (Tabu search) Allows manual and interactive topology creation
122
System configuration
// In this topology: 8 cores, 8 memories, 4x4 torus // ----------------------------- IP cores // name, switch number, clock divider, buffers, type core(core_0, ( 0 switch_0, i h 0 1, 1 6, 6 initiator); i ii ) core(mem_8, switch_11, 1, 6, target:0x00); [] // ----------------------------- switches // name, , input p p ports, , output p p ports, , buffers switch(switch_0, 5, 5, 6); switch(switch_1, 5, 5, 6); [] // ----------------------------- links // name name, so source, rce destination link(link0, switch_0, switch_1); link(link1, switch_1, switch_0); [] // ----------------------------- routes // source, destination, hops route(core_0, pm_8, switches:0,1,5,6,7,11); route(core_1, pm_9, switches:1,5,9,8); route(core_2, pm_10, switches:2,6,5,9); route(core 3 pm_11, route(core_3, pm 11 switches:3,2,6,10); switches:3 2 6 10); []
Specifies NIs (I/Os, clocks, buffers) switches (I/Os, buffers) links routes
123
xpipes i Compiler: C il Platform Pl f Generation G i

Input: System configuration: Topology, Routing tables, Parameters(flit width, buffering, ) Component Library Creates a class template for each type of network component p n nt b based d upon p n component p n nt configuration nfi ti n (I/O ports, buffer sizing) Hierarchical instantiation of the p platform in SystemC y
124
Network-on-Chip Synthesis Tool: xpipes
MPARM Architecture
Reference: Bertozzi, D. and Benini, L. (2004) xpipes: A Network on-Chip Architecture for Giga Scale Systems-on-Chips, IEEE Circuits and Systems Magazine, pp. 18-31.
Lecture 5 Conclusion and Future of Network-on-Chip
126
Network-on-Chip: Network on Chip: At a Glance

Topics Covered
Need of Network-on-Chip
NoC Architecture Design Performance Evaluation Design Trade-Off Application Mapping on NoC Signal g Integrity g y and Reliability y Issues
Some More Topics

Impact I of f Hi Higher h C Communication i i L Layers in NoC Performance Test and Verification of NoC Thermal Modeling of NoC Metrics and Benchmarks for NoC. Floorplan-aware NoC architecture optimization Fault Tolerant Architecture in NoC CAD Tools for NoC
127
Limitation of 2D Network-on-Chip Network on Chip

The conventional 2D integrated circuit (IC) has limited floor-planning floor planning choices, and consequently, it limits the performance enhancements arising out of NoC architectures. Need for more and more bandwidth but not at the cost of increased power consumption.
Reference: Carloni, L. P., Pande P. P., Yuan X. (2009) Networks-on-Chip in emerging interconnect paradigms: Advantages and Challenges ACM/IEEE Intl Symp. On Network s-on-Chip, pp. 93-102.
128
NoC Research Groups in Foreign Universities

1. 2. 3 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14 14. 15. 16. Prof. Luca Benini, University of Bologna, Italy. Prof. Giovanni De Micheli, EPFL, Switzerland. Prof William J. Prof. J Dally, Dally Stanford University University, USA USA. Prof. Partha Pratim Pande, Washington State University, USA. Prof. Radu Marculescu, Carnegie Mellon University, USA. Prof. Bashir M Al-Hashimi, University of Southampton, UK. Prof. Chita R. Das, Pennsylvania State University, USA. Prof. Niraj K. Jha, Princeton University, USA. Prof. Sashi Kumar, Jonkoping University, Sweden. Prof. Axel Janstach, Royal Institute of Technology (KTH), Sweden. Prof. Jari Nurmi, Tampere University of Technology, Finland. Prof. Andre Ivanov, University of British Columbia, Canada. Prof. Resve Saleh, University of British Columbia, Canada. P f Israel Prof. I l Cid Cidon, T Technion-Israel h i I lI Institute i of f T Technology, h l I Israel. l Dr. Davide Bertozzi, University of Bologna, Italy. Dr. Srinivasan Murali, EPFL, Switzarland.
and d many more
129
NoC Research in Indian Universities

1. 2. 3. 4. Prof. Santanu Chattopadhyay, Indian Institute of Technology, Kharagpur. Prof. S. K. Nandy, y, Indian Institute of Science, , Bangalore. g Prof. Bharadwaj Amruthur, Indian Institute of Science, Bangalore. Prof. M. R. Bhujade, Indian Institute of Technology, Bombay.
J Journals, l C Conference, f and dW Workshop k h on N NoC C

Microprocessor and Microsystems Journal Journal, Elsevier (MICPRO) IEEE/ACM International Symposium on Networks-on-Chip Networks on Chip IEEE Int l Workshop on Network on Chip Architectures (NoCArc) Intl
130
NoC Research in Industries

Tilera Corporation Arteris Inc. Silistix Inc. NXP Semiconductor
IBM Corporation (Cyclops-64/Blue Gene)
130
Aethereal
131
Network-on-Chip Network on Chip Books
132
133
134
Bibliography
For detailed updated reference, the audience are directed to the following link:
http://www.cl.cam.ac.uk/~rdm34/onChipNetBib/onChipNetwork.pdf
Below we are giving some of our contributions in NoC research: [1] S. Kundu and S. Chattopadhyay, Interfacing Cores and Routers in Network-on-Chip Using GALS, IEEE International Symposium on Integrated Circuits (ISIC), 2007. [2] S. Kundu and S. Chattopadhyay, Mesh-of-Tree Deterministic Routing for Network-on-Chip Architecture, ACM Great Lake Symposium on VLSI (GLSVLSI), (GLSVLSI) 2008. 2008 [3] S. Kundu, R. P. Dasari, K. Manna, and S. Chattopadhyay, Mesh-of-Tree based scalable Network-on-Chip Architecture, IEEE Region 10 Colloquium and International Conference on Industrial and Information Systems (ICIIS), 2008. [4] S. Kundu and S. Chattopadhyay, Mesh-of-Tree based Network-on-Chip Architecture Using Virtual Channel based Router IEEE VLSI Design and Test Conference (VDAT), 2008. [5] S. Kundu and S. Chattopadhyay, Network-on-chip architecture design based on mesh-of-tree deterministic routing topology. International Journal for High Performance Systems Architecture, Vol. 1, No. 3, pp.163182, Inderscience Publisher, 2008. [6] S. Kundu, d , R. P. Dasari, , K. Manna, , and d S. Chattopadhyay, p d y y, Performance Evaluation of Mesh-of-Tree Based d Network-on-Chip Using Wormhole Router with Poisson Distributed Traffic, IEEE VLSI Design and Test Conference (VDAT), 2009. [7] S. Kundu, K. Manna, S. Gupta, K. Kumar, R. Parikh, and S. Chattopadhyay, A Comparative Performance Evaluation Of Network-on-Chip Architectures Under Self-Similar Traffic, IEEE International Conference on Ad Advances i Recent in R t Technologies T h l i in i Communication C i ti and d Computing C ti (ARTCom), (ARTC ) 2009. 2009
135
Microprocessor Research Laboratory
Th You Thank Y

(Tutorial) NoC The Next Generation of Multi-Processor SoC

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

(Tutorial) NoC The Next Generation of Multi-Processor SoC

Enviado por

Direitos autorais:

Formatos disponíveis

EAIT, 2011

Dr. Santanu Chattopadhyay Associate Professor

Santanu Kundu Research Scholar

email: {santanu, skundu}@ece.iitkgp.ernet.in

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Device SW Interface HW Interface

Architecture overview of Multi-Processor System-on-Chip

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

The communication backbone used in modern SoC is shared bus.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Limitation of Shared Global Bus

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Limitation of Shared Global Bus

Relative Evolution of wire and gate delays

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Shared Global Bus to Segmented Bus

Multi-Level Segmented Bus

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Point-to-Point Point to Point Dedicated Links

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Centralized Crossbar Switch

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Network-on-Chip: A Paradigm Shift

Only 3 components 1. Network Interface (NI) 2. Switch (Router) 3 Point-to-Point Links 3.

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Layers of Abstraction in Network-on-Chip

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

SoC to NoC: An Evolution

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Design Goal of Network-on-Chip

Lecture 2 Architecture Design and Performance Evaluation of Network-on-Chip

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Design Issues in Network-on-Chip

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Source end node

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Source end node Request for circuit establishment

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Drawback: 1. Larger Buffer 2 More 2. M Latency L

Source end node

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Source end node

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

L Larger B Buffer ff Advantage: Lesser Latency y

Source end node

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Source end node

Packets stored along the switch

Destination end node

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Network Interface (NI) Module

Protocol Conversion Clock Domain Shifting

Dept. of Electronics & Electrical Communication Engg., IIT Kharagpur

Network Interface (NI) Module

1 Packet = 64 Flits 1 Flit = 32 bits