On-Chip Global Interconnects

White Paper
On-chip Global Interconnects

for Networking ASICs
Executive Summary
ASICs for networking applications typically consist of many serial transceivers for off-chip communication. Serial line rates from 2.5 Gb/s to 12.5 Gb/s for backplane and chip-to-chip communication are quite common for todays designs. Advances in high-density and low-power serial transceiver design enable switching ASICs to pack more than 2 Tb/s worth of serial transceivers.
While serial transceivers move data in and out of an ASIC, on-chip global interconnects move data inside the ASIC. These global interconnects include crossbar switches and busses for sharing on-chip resources. To guarantee quality of service, the on-chip global interconnects are often designed to carry several times worth of traffic compared to the serial transceivers. Close to 10 Tb/s is not atypical with todays requirements. Process technology, nonetheless, is not helping. Wire delay compared to gate delay grows exponentially from one process node to the next, primarily due to ever-increasing wire resistance. Todays ASIC designers add pipeline
stages and repeaters (Figure 1) to meet clock cycle requirements at the expense of power consumption, noise, and latency. Advances in on-chip communication are necessary to keep up with those in serial transceivers. This paper describes how LSI LinkXpress On-Chip Interconnect IP, a companion IP to serial transceivers, enables high-performance on-chip global interconnects without the compromises ASIC designers are making today. The additional performance of the On-Chip Interconnect allows for more architectural options, particularly for crossbar switch designs.
ON-CHIP GLOBAL INTERCONNECT TRENDS IN NETWORKING ASICS Integrated Serial Transceivers are Ubiquitous
IHigh-bandwidth ASICs are a reality due to advances in high-density and low-power serial transceiver technology. Physical I/O connections to networking ASICs have gone serial. Examples are SAS/SATA, Fibre Channel, XAUI, XFI, InfiniBand, PCI-Express, and proprietary serial backplanes from 1Gb/s to 12.5 Gb/s. ASICs with a high number of serial transceivers typically are switches. The two prevalent architectures are crossbar switches and memory switches, implementing switching functions for the IP, TDM, storage and high-performance computing networks.
(a) Original architecture : 1GHz
Gates Cant Keep Up with Wires

With all the off-chip data bandwidth made possible by integrating serial transceivers onto the networking ASIC, moving data on chip has become more critical. On-chip traffic typically goes through crossbar switches, which provide any-to-any connectivity among data ports, or a switched bus, which provides connectivity
(b) Implementation compromise : 500 MHz with double -wide datapath and an extra cycle of latency
Figure 1. Original Wiring Architecture and Implementation Compromises
among masters and slaves on a master-slave bus, and is logically equivalent to two asymmetric
10000
Wire Delay / Gate Delay
1000
M6 M6 M 5 M 5 M4 M 3 M 3 M2 M 1 M 1 M 1 M 3 M2 M 5
Mid-Layer Metals
100
M4
10 1 180
Upper Layer Metals
Poly Diusion
130 100
70
50
35
25
18
13
Tech (mm), log-scale (linear in time)

Figure 2. Vertical Congestion Due to Repeater and Pipeline Flip-Flop Insertion Figure 3. Global Wire Delay Normalized by Gate Delay over 5mm [5]
crossbar switches. In some cases, global signals are point-to-point, such as the arbitration signals used by a crossbar switch. In any case, these networking ASICs involve many integrated serial transceivers on a large die with many long on-chip global interconnects. ASIC designers are painfully aware of the increasing timing closure difficulty caused by wire delay. Sending a signal from one end of a chip to another requires various workarounds. Figure 1 illustrates typical workarounds, including repeater insertion, pipelining, and slowing down the clock frequency by widening the signal path. These workarounds compromise noise, power consumption, and latency, slow down clock frequency and increase silicon area. Without shielding, repeaters need to be staggered carefully to avoid crosstalk. With shielding, wiring area increases, potentially
lengthening the global interconnects and making timing closure harder. In a wiredominated design, adding repeaters and pipeline flip-flops creates another problem via and metal congestion since signals at high-level global routing metal layers must reach all the way down to the polysilicon layer to connect to repeater and flip-flop inputs, blocking or being blocked by routing on metal layers between them. Likewise, the outputs of these repeaters and flip-flops must reach from the diffusion layer back to the global routing metal layers (Figure 2).
generation product, the total die size of a particular semiconductor product does not shrink from one process node to the next. The global wire length therefore remains constant from node to node, making global timing closure even more difficult.
The LinkXpress On-Chip Interconnects

The LSI LinkXpress on-chip interconnect IP eliminates timing closure bottlenecks without compromising performance, latency, power consumption, and noise. The key enabling technology is low-swing
No help is on the horizon with the traditional ASIC flow. In fact, wire delay normalized by gate delay grows exponentially from one process technology node to the next (Figure 3) [5]. Because designers try to pack more circuits and keep the die cost the same in a next-
differential signaling internal to the LinkXpress On-chip Interconnects (Figure 4). Externally, LinkXpress On-chip Interconnects behave as a full-swing single-ended macro, which is therefore compatible with existing ASIC design flows. High performance, low noise operation, high noise immunity, and low power consumption are some of the inherent benefits of low-swing differential signaling. Thanks to the highly sensitive differential receiver, the
(a) Original architecture : 1GHz
LinkXpress On-Chip Interconnects can drive long wires without repeaters, enhancing performance and lowering latency. The LinkXpress On-chip Interconnect is a
(b) Implementation compromise : 500 MHz with double -wide datapath and an extra cycle of latency
Figure 4. LinkXpress On-Chip Interconnects
physical-layer, protocol-agnostic IP. It can implement on-chip crossbar switches, masterslave buses, and point-to-point links. Most applications using the LinkXpress On-Chip Interconnects are crossbar switches. In such
(c) LinkXpress 1GHz On - Chip Interconnect
On-chip Global Interconnects for Networking ASICs
Input 1
Ingress / Egress Logic
Crossbar Switch
Output 1
LinkXpress Crossbar Switch
Input M
Output N
Figure 5. Embedded Clock Tree in LinkXpress Crossbar Switch
Figure 6. Two LinkXpress Crossbar Switches in an I/O-Limited SONET Grooming Switch
Figure 7. Output-Queued Switch
applications, the crossbar switch is typically in the middle of the die interconnecting (mostly identical) logic blocks around it. The LinkXpress Crossbar Switch embeds a clock tree, whose leaf nodes can drive the clock tree roots inside the surrounding logic blocks, facilitating hierarchical place-and-route and timing closure (Figure 5). The LinkXpress Crossbar Switch is used in the LSI VC2002 SONET STS-1 Grooming Switch (Figure6). The VC2002 is an I/O-limited design and the distance between the SONET processing blocks are long (6mm in the horizontal and vertical dimensions). Two parallel LinkXpress Crossbar Switches, each able to sustain 180Gbps of throughput, interconnect these SONET processing blocks in a three-stage Clos network. The design with only flip-flops guarding the single-ended primary inputs and outputs and no intermediate pipelining. The global place-and-route tool sees the two LinkXpress Crossbar Switches as one hard macro, whose timing view consists of an array of input flip-flops for the data input and address pins, an array of output flip-flops, a clock input, and a set of clock tree outputs for driving the clock roots of the surrounding SONET processing blocks.
one of the inputs to connect to any one of the outputs. In other words, a crossbar switch is functionally equivalent to N M:1 multiplexors. Like the multiplexor-based equivalent, the LinkXpress Crossbar Switch can change its switching configuration every clock cycle. The crossbar switch itself is strictly non-blocking, i.e. an available input can connect to an available output without rearranging any existing connections. Used in a packet switch, contention arises when multiple inputs try to connect to the same output, which reduces throughput. As shown in the following subsections, practical packet switches require internal crossbar switch speed-up and/or multiple internal crossbar switches to bound delay and increase throughput. The LinkXpress Crossbar Switch enables high-performance switching since it can achieve switching throughput and latency with low power not possible with traditional cell-based designs.
therefore MN. The implementation cost of an output-queued switch is high. It is often used as a model to compare against lower-cost switching architectures coupled with particular matching algorithms and traffic patterns.
Input-Queued Switches
An input-queued switch consists of a FIFO at each crossbar switch input to buffer packets (Firgure8). Although the single-stage crossbar switch itself is non-blocking, an input-queued switch may block due to head-of-line (HOL) blocking. This occurs when packets from multiple input FIFOs attempt to connect to the same output simultaneously, blocking packets behind them that are destined for a different output port. Under independent and uniform traffic with packets of fixed and equal length (cells), an N x N input-queued switch, where N is the number of input and output ports and N is large, has throughput limited to 2 - 2 59% when the aribter picks randomly which one of the contending inputs can connect to an output port [7]. The throughput gets worse with bursty traffic or traffic that favors certain ports (e.g., server-client traffic). The throughput has been demonstrated to be 100% in simulation with the iSLIP scheduling algorithm but the traffic must be independent and uniform [9].
Output-Queued Switches
An output-queued switch that buffers packets only at the output of the crossbar switch is non-blocking (Firgure7). Nonetheless, this property comes at a high cost. When all inputs try to connect to the same output, the worstcase scenario, the output queue must have enough input bandwidth to avoid blocking. Each output queue in an M x N output-queued crossbar switch therefore must be able to accept M packets simulataneously. The total input bandwidth of the output queues is
APPLICATIONS FOR LINKXPRESS ON-CHIP INTERCONNECTS

Packet/Cell Switching
The LinkXpress On-Chip Interconnects can be used as an on-chip crossbar switch. A crossbar switch with M inputs and N outputs allows any
Virtual Output-Queued Switches

Virtual output-queued (VOQ) switches overcome the throughput limitation due to HOL blocking in a purely input-queued switch without incurring the overhead in an outputqueued switch (Firgure9). Each input of an M
Crossbar Switch
Output 1
...
Output 1
Crossbar Switch
Output 1
...
Output 1
N Queues
N Queues
Output M
...
Output N
Output M
...
Output N
N Queues
N Queues
Scheduler
Scheduler
Figure 8. Input-Queue Crossbar Switch
Figure 9. Virtual Output-Queued Switch
x N VOQ switch consists of N queues, one for each output. An arbiter selects packets from a different queue to resolve HOL blocking. Maximum weight bipartite matching algorithms, such as Longest Queue First (LQF) and Oldest Cell First (OCF) [11], can achieve 100% throughput given non-uniform traffic for a VOQ switch. Matching algorithms, however, must take time complexity and fairness into account. The following section reviews results with maximal matching algorithms, which are more practical to implement than maximum matching algorithms.
weighted fair queueing and strict priority [1]. When the input traffic satisfies the strong law of large numbers and no input or output is oversubscribed , the CIOQ switch with a speedup of two can achieve 100% throughput with any maximal matching algorithms [2]. Parallel Packet Switches An N x N parallel packet (PPS) switch [6] is a three-stage network, similar in connectivity to a three-stage Clos network [3]. In a k-layered PPS, there are N first-stage 1:k demultiplexors, k middle-stage N x N output-queued switches,
N third-stage k:1 multiplexors, and exactly one link between every element in every two adjacent stages. A PPS can emulate an output-queued switch with push-in-first-out queueing if each layer operates at a rate of 3R/k, where R is the line rate. From the previous section, a CIOQ switch can emulate an output-queued switch with a speed-up of two. So to implement a PPS with CIOQ switches as the middle-stage switches, each one of the k CIOQ switches must run at a rate of 6R/k. For instance, with a dual-layer PPS (i.e., k = 2), each CIOQ switch must have a speed-up of three.
Combined Input-and OutputQueued Switches

A combined input- and output-queued (CIOQ) switch has buffers attached to both the inputs and the outputs of the crossbar switch (Figure10). A packet switch with an internal speed-up of S for the crossbar switch has S scheduling cycles during every time slot, and the crossbar switch has input and output capacity that is S times that of the line rate. A combined input- and output-queued packet switch with an internal speed-up of two delivers 100% throughput and in fact can exactly emulate an output-queued switch packet-bypacket under all input traffic patterns with
Input 1
...
Crossbar Switch with a Speed-Up of Two
Output 1
Input N
...
Output N
Scheduler
Figure 10. CIOQ Switch with a Speed-Up of Two
Multiple Input/Output-Queued Switches

An N x N multiple input/output-queued (MIOQ) switch [8] is also similar to the three-stage Clos network. In a k-layered PMIOQ switch, there are N first-stage N:k switches, each one accepting output from N VOQs from an input port, k middle-stage N x N crossbar switches, N third-stage k:N switches, and exactly one link between every element in every two adjacent stages. A dual-layer MIOQ switch (k=2) can emulate an output-queued switch with scheduling algorithms such as FIFO, weighted fair queueing, and strict priority under any traffic pattern. No speed-up is necessary for the middle-stage crossbar switches.
An N-master, M-slave switched bus is logically equivalent to an N x M crossbar switch plus an M x N crossbar switch. Figure 11 illustrates a master-slave bus example. The blocks on top, display1, display2, cpu_dmem, cpu_imem, and dsp are masters and the blocks memctrl1, memctrl2, and memctrl3 are slaves. Figure 12 shows the connectivity matrix of this 5-master, 3-slave switched bus. The rows of the matrix represent input and the columns output. A 1 in the matrix represents a possible connection between an input and an output. A 0 represents that no connection exists. Since masters do not communicate among themselves and neither do slaves, by grouping the master rows and columns as shown in Figure 12, the connectivity matrix consists of two all-1 matrices and two all-0 matrices, clearly indicating two distinct crossbar switches a 5 x 3 and a 3 x 5. If any master or slave were allowed to communicate with any other masters or slaves through the switched bus, then the entire connectivity matrix would consist of all 1s and a single 8 x 8 crossbar switch can implement the interconnects. Since the LinkXpress On-Chip Interconnect is a physical wiring macro, and a master-slave on-chip switched bus is logically equivalent to crossbar switches, the LinkXpress On-Chip
Interconnect can implement a master-slave on-chip switched bus much like the way it can implement packet switches.
Barrel Shifters
The most general form of an N-bit barrel shifter rotates an N-bit input arbitrarily. Barrel shifters are useful for allocating shared on-chip resources in networking ASICs. Traditionally barrel shifters are implemented with multiplexors. An N-bit barrel shifter can be implemented with lgN stages of 2:1 multiplexors. Note that the connectivity matrix of a barrel shifter is identical to that of a crossbar switch since any one of the N inputs can connect to any one of the N outputs. The difference lies in the control logic. Each output of an N x N crossbar switch is associated with a lgN-bit address to select one of the N inputs. Thus, there are NlgN address (control) bits. A barrel shifter, on the other hand, needs only lgN control bits. A barrel shifter is a logical subset of a crossbar switch. It is essentially a crossbar switch in which the values of the N address ports are correlated. A barrel shifter whose inputs and/or outputs span long distances can take advantage of the LinkXpress On-Chip Interconnects as the physical implementation.
On-Chip Switched Buses

High-performance system-on-chips are using on-chip switched buses instead of shared buses. A master-slave switched bus is one where the masters can initiate read and write operations to the slaves. The slaves respond to requests from the masters. Masters do not have direct connections with each other; similarly, slaves do not communicate directly with each other.
Outputs
display 1 display 2 cpu_dmem cpu_imem dsp
arbiter
display 1 display 2
Input s
cpu_dmem cpu_imem dsp memctrl 1 memctrl 2 memctrl 3
memctrl 1
memctrl 2
memctrl 3
1 1 1 1 1 0 0 0
1 1 1 1 1 0 0 0
1 1 1 1 1 0 0 0
0 0 0 0 0 1 1 1
0 0 0 0 0 1 1 1
0 0 0 0 0 1 1 1
0 0 0 0 0 1 1 1
dsp
5x3 Crossbar Switch
memct rl 1 memct rl 2 memct rl 3 display 1 display 2 cpu_dmem cpu_ ime m
0 0 0 0 0 1 1 1
3x5 Crossbar Switch
Figure 11. On-Chip Switched Bus
Figure 12. On-Chip Switched Bus Connectivity Matrix
r 00
c 00
... ...
Matching
...
r0 M r10
c0N c 10
...
r1M
1N
r N0
c M0
0 0 1 0 0 1 0 0 0 C C R = 1 1 1 maximum = 0 1 0 maximal = 0 1 0 1 0 1 1 0 0 0 0 1
...
... ...
r NM
...
c MN
Figure 13. Matching Circuit
Figure 14. Maximum vs. Maximal Matchings
Matching Circuits
So far, this paper has been focusing on the data plane of the on-chip global interconnects. The control plane determines the connectivity. Since control information, such as the destination of a packet, is embedded in the data plane for many networking ASICs, and the data ports are distributed throughout a large die, there is also a global interconnect problem to solve for the control plane. In switches and routers, a group of inputs are matched to a group of outputs to satisfy a set of connection requests. The matched inputs and outputs are connected to each other. A set of connection requests can be represented by a binary connection request matrix R, in which the rows represent the inputs and the columns outputs. A 1 in row i, column j, or rij = 1, if and only if input i requests a connection to output j. As a result, a row with multiple 1s represents a multicast request. A matching circuit determines a binary connection assignment matrix C, where cij = 1 if and only if there is a connection assigned for between input i and output j. Each column in C therefore can have at most one 1. In other words, the connection assignment must be contention-free. Futhermore, each row also has at most one 1. Figure 13 shows a high-level diagram of a matching circuit.
A maximal matching is one such that to increase the number of 1s in the connection assignment, one or more assignments must be removed and the matching recomputed. Figure 14 illustrates an example maximum and a maximal matching for a 3 x 3 connection request matrix. Maximum matching is hardware-intensive and slow. Commercial switches use maximal matching algorithms, such as parallel iterative matching (PIM) [1] and iSLIP [10]. In practice, these algorithms are not run to completion due to time constraints and produce submaximal matchings. Typical commercial matching algorithms consist of three phases: request, grant, and accept. Take a VOQ or CIOQ switch for instance. In the request phase, each input signals a connection request to every output for which the input has a packet in the queue. During the grant phase, each output grants one input request. During the accept phase, each input accepts one grant among all the grants from all the outputs to form a match. For switch-on-chips, because request (grant) information from each input (output) needs to be gathered, and inputs (outputs) are
distributed over a large area on the silicon, the LinkXpress On-Chip Interconnects can be used to transport matching control signals efficiently.
Summary
This white paper outlines the advantages of using the LinkXpress On-Chip Interconnects over existing cell-based methodology. The key enabling technology, on-chip low-swing differential signaling, is internal to the LinkXpress On-Chip Interconnects; the primary inputs and outputs of the LinkXpress On-Chip Interconnects are guarded by full-swing singleended flip-flops, making the design compatible with existing cell-based methodology. The most common application for the LinkXpress On-Chip Interconnects is for building on-chip crossbar switches. A number of crossbar-switchbased packet switch architectures are reviewed, many of which require speed-up and/or parallel crossbar switches to bound delay and achieve high throughput. The LinkXpress On-Chip Interconnects facilitate building such large crossbar switches efficiently. Aside from packet switches, on-chip master-slave switched buses are shown to be logically equivalent to two crossbar switches. The matching circuit for a crossbar switch also requires efficient global interconnects. Thus the LinkXpress On-Chip Interconnects are suitable for both the data plane and the control plane of these applications.
References
1. T. Anderson, S. Owicki, J. Saxe and C. Thacker, High-speed switch scheduling for local-area networks, ACM Trans. Comput. Syst., vol. 11, no. 4, pp 319-352, Nov. 1993. 2. S-T. Chuang, A. Goel, N. McKeown, B. Prabhakar, Matching output queuing with a combined input output queued switch, IEEE J. Selected Areas in Communication, vol 17, no. 6, pp 1030-1039, June 1999. 3. C. Clos, A study of non-blocking switching networks, Bell Syst. Tech. J., vol. 32, pp. 406-424, 1953. 4. J. G. Dai and Balaji Prabhakar, The throughput of data switches with and without speed-up, Proc. IEEE INFOCOM, pp 556-564, March 2000. 5. R. Ho, K. Mai, M. Horowitz. Managing wire scaling: A circuit perspective, IEEE Interconnect Technology Conference, June 2003. 6. S. Iyer and N. McKeown, Analysis of the parallel packet switch architecture, IEEE Trans. Networking, vol. 11, no. 2, pp 314-324, April 2003. 7. M. Karol, M. Hluchyj and S Morgan, Input vs. output queuing on a space-division packet switch, IEEE Trans Comm, vol. 35, no. 12, pp. 1347-1356, December 1987. 8. H-I. Lee and S-W. Soo, Matching output queueing with a multiple input-output queued switch, IEEE/ACM Trans. Networking, vol. 14, no. 1, pp 121-132, February 2006. 9. N. McKeown, Scheduling algorithms for input-queued cell switches, Ph.D. dissertation, University of California at Berkeley, 1995. 10. N. McKeown, The iSLIP scheduling algorithm for input-queued switches, IEEE/ACM Trans. Networking, vol. 7, no. 2, pp 188-201, 1999. 11. N. McKeown, A. Mekkittikul, V. Anantharam, and J. Warland, Achieving 100% throughput in an input-queued switch, IEEE Trans Comm, vol. 47, no. 8, pp 1260-1267, August 1999. 12. N. McKeown, M. Izzard, A. Mekkittikul, W. Ellersick and M. Horowitz, The Tiny Tera: A packet switch core, Hot Interconnects V, Stanford University, August 1996.
For more information and sales office locations, please visit the LSI web sites at: lsi.com lsi.com/contacts
North American Headquarters Milpitas, CA T: +1.866.574.5741 (within U.S.) T: +1.408.954.3108 (outside U.S.) LSI Europe Ltd. European Headquarters United Kingdom T: [+44] 1344.413200 LSI KK Headquarters Tokyo, Japan Tel: [+81] 3.5463.7165
LSI Corporation, the LSI logo design, and LinkXpressare trademarks or registered trademarks of LSI Corporation. All other brand and product names may be trademarks of their respective companies. LSI Corporation reserves the right to make changes to any products and services herein at any time without notice. LSI does not assume any responsibility or liability arising out of the application or use of any product or service described herein, except as expressly agreed to in writing by LSI; nor does the purchase, lease, or use of a product or service from LSI convey a license under any patent rights, copyrights, trademark rights, or any other of the intellectual property rights of LSI or of third parties. Copyright 2007 by LSI Corporation. All rights reserved. > 0507

On-Chip Global Interconnects

Enviado por

Dados do documento

Descrição original:

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

On-Chip Global Interconnects

Enviado por

Direitos autorais:

Formatos disponíveis

White Paper

On-chip Global Interconnects

(a) Original architecture : 1GHz

Gates Cant Keep Up with Wires

Wire Delay / Gate Delay

Upper Layer Metals

Tech (mm), log-scale (linear in time)

The LinkXpress On-Chip Interconnects

(a) Original architecture : 1GHz

Figure 4. LinkXpress On-Chip Interconnects

(c) LinkXpress 1GHz On - Chip Interconnect

On-chip Global Interconnects for Networking ASICs

Ingress / Egress Logic

Ingress / Egress Logic

LinkXpress Crossbar Switch

Ingress / Egress Logic

Figure 5. Embedded Clock Tree in LinkXpress Crossbar Switch

Figure 6. Two LinkXpress Crossbar Switches in an I/O-Limited SONET Grooming Switch

Figure 7. Output-Queued Switch

APPLICATIONS FOR LINKXPRESS ON-CHIP INTERCONNECTS

Virtual Output-Queued Switches

On-chip Global Interconnects for Networking ASICs

Figure 8. Input-Queue Crossbar Switch

Figure 9. Virtual Output-Queued Switch

Combined Input-and OutputQueued Switches

Crossbar Switch with a Speed-Up of Two

Figure 10. CIOQ Switch with a Speed-Up of Two

On-chip Global Interconnects for Networking ASICs

Multiple Input/Output-Queued Switches

On-Chip Switched Buses

cpu_dmem cpu_imem dsp memctrl 1 memctrl 2 memctrl 3

5x3 Crossbar Switch

memct rl 1 memct rl 2 memct rl 3 display 1 display 2 cpu_dmem cpu_ ime m

3x5 Crossbar Switch

Figure 11. On-Chip Switched Bus

Figure 12. On-Chip Switched Bus Connectivity Matrix

On-chip Global Interconnects for Networking ASICs

Figure 13. Matching Circuit

Figure 14. Maximum vs. Maximal Matchings

On-chip Global Interconnects for Networking ASICs

Você também pode gostar