Você está na página 1de 4

A High-Performance Low-Power Nanophotonic On-Chip Network

Zheng Li , Jie Wu , Li Shang , Alan Mickelson , Manish Vachharajani , Dejan Filipovic , Wounjhang Park and Yihe Sun

{zheng-li,wujie07}@mails.tsinghua.edu.cn {li.shang, mickel, manish.vachharajani, dejan, wpark}@colorado.edu sunyh@tsinghua.edu.cn ABSTRACT


On-chip communication, including short, often-multicast, latency-critical coherence and synchronization messages, and long, unicast, throughput-sensitive data transfer, limits the power eciency and performance scalability of many-core chip-multiprocessor systems. This article presents Iris, a CMOS-compatible high-performance low-power nanophotonic on-chip network. Iris linear-waveguide-based throughputoptimized circuit-switched subnetwork supports throughputsensitive data transfer. Iris planar-waveguide-based WDM broadcastmulticast subnetwork optimizes latency-critical trac and supports the circuit setup of circuit-switched communication. Overall, the proposed design delivers an on-chip communication backplane with high power eciency, low latency, and excellent throughput. These pressures have pushed designers from bus-based or point-to-point interconnects to packet-switched on-chip electrical interconnect fabrics [1, 2, 3]. These fabrics oer significant performance and power advantages over the aforementioned prior approaches. However, the electrical interconnect fabric is far from ideal in terms of power eciency and performance scalability. Furthermore, these fabrics introduce non-deterministic latency because of router-by-router data buering and resource arbitration [4]. On-chip optics promise far better performance and power dissipation [5]. Kirman et al. [6] propose a hierarchical snoopy bus design consisting of global optical loops and local electrical interconnect for chip multiprocessors. Shachama et al. [7] present a hybrid network-on-chip (NoC) for future 22nm technology. The proposed design combines a high bandwidth circuit-switched torus photonic network and a packet-switched electrical control network. Vantrease et al. [8] present a nanophotonic interconnect targeting throughput optimization in many-core systems for future 16nm technology. The proposed design leverages wave-division-multiplexing (WDM) and a novel all-optical arbitration scheme. Petracca et al. [9] explore topologies of photonic network design for a single real-world application. Batten et al. [10] propose using monolithic silicon photonics for many-core processor-to-DRAM interconnect. However, there are still challenges that these optical interconnects do not overcome, both in terms of architecture and fabrication. In many-core systems, on-chip trac can be classied into two categories. The rst are short, oftenmulticast, and latency-critical coherence and synchronization messages. The second are throughput-bound, oftenunicast, data transfers, such as cache line ushes. Recent studies show that even though latency-critical trac is only a small portion of total bits transferred on the network, if not properly handled, system performance can degrade signicantly [11]. Unfortunately, design decisions centered around fabrication limitations of optical components mean that these messages are not handled well. In particular, it is dicult to fabricate optical buers in a CMOS compatible manner [12], which means that the more scalable optical approaches are based on circuit-switched, often relegating circuit setup to a latency-optimized electrical network. This increases the latency of short synchronization messages. Worse, these messages are often broadcast or multicast and thus incurring the cost of setting up multiple circuits and retransmitting messages over these multiple circuits, further driving up latency. In this work, we propose Iris, a CMOS-compatible, highperformance, low-power nanophotonic on-chip network. As shown in Figure 1, Iris is a multi-layer design, consisting

TNLIST, Inst. of Microelectronics, Tsinghua University, Beijing 100084, China Department of Electrical, Computer, and Energy Engineering, University of Colorado, Boulder, CO 80309, U.S.A

Categories and Subject Descriptors


C.1.2 [Processor Architectures]: Multiple Data Stream Architectures (Multiprocessors) Interconnect Architectures

General Terms
Design, Performance

Keywords
Networks-on-Chip, Silicon Photonics

1.

INTRODUCTION

While fabrication technology scaling has steadily reduced the size and improved the performance and power eciency of transistors, communication remains a key challenge for multi-billion-transistor many-core systems. The main considerations are absolute data throughput and message latency versus the total power dissipation of the interconnect.
This paper was supported in part by the NSF under awards CCF0829950, in part by the National Natural Science Foundation of China (NSFC) under grant #60236020 and the Specialized Research Fund for the Doctoral Program of Higher Education (SRFDP) #20050003083. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. ISLPED09, August 1921, 2009, San Francisco, California, USA. Copyright 2009 ACM 978-1-60558-684-7/09/08 ...$10.00.

SiO2

Si Planar waveguide Antennas Ring resonators


broadcast

Vertical Horizontal

Status update

W7

W0 Wavelengths

Arbitration

Coherence packets

Coherence packets

Laser power

Si Linear waveguides Silicon-silicon interconnect Metal wires CMOS devices Through Silicon via Si Package

receive broadcast

State update for segments of linear waveguides

receive broadcast

Planar

Planar

Electrical power pad

receive

Broadcast coherence messages

0 1 2 3 4 5 6 7

Time

Figure 1: Schematic depiction of Iris of a low-latency planar-waveguide-based WDM broadcast multicast nanophotonic subnetwork and a linear-waveguidebased throughput-optimized circuit-switching nanophotonic subnetwork. This dual network design delivers exceptional performance with low power dissipation by: 1. Optimizing the delivery of short latency-bound multicast messages on the low-latency multicast subnetwork. 2. Minimizing circuit setup time for the circuit-switched throughput-optimized network by using multicast messages sent over the low-latency multicast subnetwork. 3. Minimizing power dissipation by using ecient optical communication for circuit setup, multicast, and highthroughput data-transfer. Moreover, each of Iris subnetworks is carefully designed to address fabrication challenges for nanophotonic interconnects. In particular we have successfully fabricated the key novel structure, the antennas for broadcasting messages in a planar waveguide, in a CMOS compatible manner. The net result is an interconnect fabric with excellent throughput, low latency, and low power dissipation. It shows that, compared with other recently proposed electrical and photonic alternatives, Iris improves performance by 63.3% and 52.6%, and power eciency by 56.1% and 72.6%, respectively.

Figure 2: Planar waveguide arbitration and usage


Planar waveguide
Transmission Antennas
Linear waveguides

Frequency response in transceiver side


Modulators

Receiving Antennas Frequency response in receiver side Demultiplexers stack

Signals Channel select

Intensive modulation Laser Source

Filters

Source waveguides

Detectors

Light Source

Transmitter

Nanophotonic waveguides

Receiver

Figure 3: Photonic network structure wavelengths in the planar waveguide to arbitrate for one shared resource. The arbitration scheme starts with each node being assigned a unique dynamic priority. Each node broadcasts a signal 1 on the wavelength corresponding to its dynamic priority. As the node sends the one-bit message, it also listens for the combined messages broadcasted across all wavelengths. If multiple nodes try to access the same resource, they will see the same bit vector, reecting the dynamic priorities of all the requesting nodes. Each node will be equipped with a checking logic to exam if any of the bits in the bit vector has a higher priority than its own. If so, the node knows that it fails the arbitration. After each data transmission, each node changes its priority using a deterministic random number generator. Since nodes share the same random seed, they agree on the same global priority. To arbitrate for the same K resources, i.e., multiple broadcast channels, W nodes can still leverage W wavelengths in the broadcast network. The only dierence is that each node will be pre-assigned a resource id for arbitration. The arbitration scheme turns to K W/K : 1 arbitration in parallel. This design sacrices some arbitration eciency for reduced arbitration latency and simplied arbitration logic.

2.

IRIS: NETWORK LOGICAL DESIGN

In Iris, both the broadcast and throughput nanophotonic subnetworks operate in tandem to provide low-latency, highthroughput communication at low power levels. The interaction is straight-forward. Short, multicast messages, are transmitted via WDM on the planar waveguide-based broadcast network. Since the network is inherently broadcastbased, multicast requires no additional eort. However, since it is a broadcast medium, arbitration for the WDM channels becomes the primary logical design challenge. For larger, throughput-sensitive messages, the linear-waveguidebased throughput-optimized subnetwork is used as it can deliver a higher total cut-through bandwidth. To allocate a channel, we use the broadcast network to reserve the appropriate wavelength and the physical waveguides to set up and tear down circuits with very low latency.

2.2 Throughput-Optimized Circuit-Switched Subnetwork


The throughput-optimized subnetwork is a circuit-switched optical network with horizontal and vertical waveguides between each row and each column of processors. The aforementioned broadcast subnetwork provides a low-latency means of allocating circuits for long data messages. The same arbitration protocol used for sending broadcast messages is used to allocate resources when establishing a circuit. Each node keeps a linear waveguide occupation table locally, which records the occupation status of all segments in all linear waveguides. Sending out a long data message takes three steps. First, the node checks the resource availability locally. If the resource is available, it broadcasts a message to request for one horizontal and one vertical waveguide. The arbitration scheme described in the previous section resolves conict between concurrent requests. Finally, if the resource allocation succeeds, data is then forwarded through the established circuit as follows. The data message modulates to the horizontal waveguide, switches to and travels through a vertical waveguide, and then arrives

2.1

Low-Latency Broadcast Subnetwork

The broadcast subnetwork is used for short, latency-critical, often-multicast messages. In addition, because each broadcast operation has global reach with low latency, all global resource arbitrations (for both the broadcast subnetwork and the circuit-switched subnetwork) are conducted over this network. Another advantage of this approach is that we can leverage the broadcast nature of the subnetwork to provide a global order for events. The key logical design consideration for the broadcast network is how to perform arbitration. To do this, we leverage the broadcast nature of the subnetwork and WDM. As we will see, the net result is a distributed global arbitration scheme that enables one-cycle arbitration. Figure 2 illustrates the arbitration process. W nodes can leverage W

(a) Antenna model [13]

(b) SEM image

Figure 4: Antenna model and scanning electron microscope images at the destination. In addition, the source node will broadcast a message to all other nodes, each of which will update its local resource occupation table to reect the newly established circuit. Coordination messages, arbitration, and state update messages all travel through the broadcast network. To minimize network latency and power consumption, the proposed design uses a low-radix mesh topology. High-radix design can eectively minimize network average routing hops. However, given a xed on-chip communication bandwidth, increasing radix increases serialization latency. For light speed operated photonic interconnect, the number of hops does not aect the latency, but the bandwidth does. Therefore, the proposed network uses a low radix mesh design. This design decision also considers power optimization. Since bends and crosses introduce intersection loss, cross talk and back reection, which in turn causes high power consumption, crosses and bends should be avoided whenever possible. The proposed switch design can eectively minimize optical power loss. In addition, since the switches are operated at packet level, the power consumption is negligible compared to transceivers which are operated at bit level.

o the indiused boundaries. The antenna structure that we were attempting to produce is the ideal structure of the Pistolkors [13] diraction antenna as is depicted in Figure 4(a). To fabricate the antennas, a thin layer ( 20nm) of Au is vacuum deposited on the silicon layer of an SOI wafer. A set of square Au islands in a two-dimensional square pattern are then produced by photolithographically processing this Au layer. A reasonable replication of the mask detail is performed in the lithography process, except for some loss of edge sharpness, which is not so desirable for antenna operation. By using a focused ion beam (FIB) that focuses gallium ions onto the gold surface and etching through the deposited layer and down to the surface of the silicon, the optical antenna structures are then formed. Figure 4(b) shows a scanning electron micrograph of an antenna structure written in 20 nm Au on top of the SOI layer.

4. EXPERIMENTAL RESULTS
In this section, we evaluate Iris, the proposed nanophotonic on-chip network. The following experiments are conducted on a 4 4 CMP, using a trace-driven cycle-accurate cache-network simulator. Network trac traces are gathered using the M5 full-system simulator [14] running several SPLASH2 [15] and ALPBench [16] multithreaded benchmarks. The performance and power eciency of Iris are evaluated against the following recently-proposed electrical and photonic alternatives. ELE: a packet-switched electrical mesh network equipped with latency-optimized threepipeline-stage router design with speculative virtual channel allocation [17]. PLinear: a linear photonic waveguide based circuit-switched mesh network, an approximation of a recently-purposed optical network design [7]. It consists of a latency-optimized electrical network to setup the photonic circuit switch path. PPlanar: a hybrid package-switched network consisting of a planar photonic waveguide based subnetwork and an electrical mesh subnetwork. Coordination messages are sent via planar network like Iris, but the large data packets are sent via electrical network. Figure 5(a) shows the power dissipation breakdown of Iris and the other three alternatives. The electrical network power consumption of each of the alternatives is calculated by synthesizing a RTL router design using Synopsys Design Compiler with TSMC 65 nm low power technology library. The power consumption of the electrical network is contributed by routers (labeled as router) and link circuitry (labeled as link). The power consumption of the nanophotonic network is contributed by planar waveguides loss (labeled as p loss), linear waveguides loss (labeled as l loss), transceiver power for multicasting/broadcasting (labeled as p TR) and for unicasting (labeled as l TR). Detector power is negligible compared to other power sources (thus not shown in the results). This study shows that Iris is the most power ecient onchip interconnect solution. Compared to ELE, the electrical alternative, Iris reduces the power consumption by 56.1%. On the other hand, PLinear and PPlanar do not show power benet compared to ELE due to the following reasons. Based on current nanophotonic technology, on-chip waveguides, modulation and crossing are lossy and power consuming. PLinears power characterization shows that transmitting small, often-multicast messages is not power efcient using the linear waveguides due to the high waveguide loss, transceiver power consumption, and also the power consumption of the electrical network used to setup the circuitswitched link path of the photonic network. Compared to

3.

IRIS: NETWORK PHYSICAL DESIGN

As shown in Figure 3, Iris nanophotonic subnetworks consist of transmitters, waveguides and receivers. The nanophotonic components are fabricated in separate silicon layers, and integrated with the CMOS silicon die through threedimensional integration. Therefore, the design and fabrication of each layer can be optimized independently. As shown in Figure 1, the broadcast waveguide is a planar SOI waveguide built on the top layer of the nanophotonic stack. Within the planar SOI waveguide layer, an array of nanophotonic antennas are printed nanoscopically on the surface of the silicon, which broadcast and receive optical signals. These antennas are one of the unique features of Iris, but also are the key unproven nanophotonic component however, we discuss how we have successfully fabricated these antennas. The antenna feeds extend through several layers of nanophotonic devices, i.e. ring resonators, that relay messages between the send/receive electronics and the planar waveguide. The layer of the linear SOI waveguides is underneath these devices. The linear waveguides form a mesh-like circuit-switched network, which delivers o-chip laser power to on-chip processing cores, transfers large data messages between on-chip cores, and also communicates with o-chip DRAMs via chip-to-chip ber. The bottom layer consists of Germanium-doped silicon photon detectors and ampliers and other electrical components. We choose SOI for both planar and linear waveguides because the index of refraction contrast between silicon (nSi 3.475) and silicon dioxide (nSiO2 1.45) at the working bands (1530 nm 1625 nm) is one of the largest that can be achieved. This large index dierential can be used to obtain tight connement in silicon layers with minimum scattering

50

link cholesky fft

routers fmm lu

p loss

p TR radix

l loss

l TR L2 miss latency (clock cycle) TPC-W waternsq

100

protocol cholesky fft fmm

memory lu

ack mpgenc ocean radix

request SPECJbb TPC-H TPC-W waternsq

40 Power (W)

mpgenc ocean

SPECJbb TPC-H

80

30

60

20

40

10

20

0
s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL

0
s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL s ar Iri lan ar PP ine PL E EL

(a) On-chip power dissipation decomposition

(b) Average local L2 miss latency decomposition

Figure 5: Performance and power comparison PLinear, Iris improves the overall network power eciency by 72.6%. For PPlanar, since most of the throughput-hungry messages are still transferred via the electrical subnetwork, the power consumption of the electrical network is signicant. Compared to PPlanar, Iris improves the overall network power eciency by 64.0%. As technology scales further, the power eciency of nanophotonic devices and components is expected to improve. The power consumption of electrical network, on the other hand, is expected to increase. Therefore, using nanophotonic on-chip interconnect becomes increasingly power benecial. Next, we evaluate the performance of Iris. Figure 5(b) shows the average L2 cache miss latency (read and write) of Iris compared to the three alternatives. In this study, latency is decomposed into the following four components: cache miss request latency (labeled as request), protocol transaction latency (labeled as protocol), memory access latency (labeled as memory), and acknowledgments or data reply latency, (labeled as ack). This study shows that the planar waveguide eciently delivers coherence messages and simplied protocol transaction, while the linear waveguide network improves the network throughput for large data packets. Therefore, compared with ELE and PLinear, Iris, which is equipped with the new planar waveguide, can signicantly reduce the protocol transaction latency and request latency. In addition, the planar waveguide can effectively reduce the setup latency of the circuit-switched network. Overall, Iris provides a high-performance on-chip communication solution with 63.3% improvement over ELE, 52.6% improvement over PLinear, and 44.1% improvement over PPlanar. [2] Tilera TILE64 chip-multiprocessor, http://www.tilera.com. [3] S. Vangal, et al., An 80-tile 1.28 TFLOPS network-on-chip in 65nm CMOS, in Proc. Int. Solid-State Circuits Conf., Feb. 2007, pp. 98589. [4] A. Kumar, et al., Express virtual channels: towards the ideal interconnection fabric, in Proc. Int. Symp. Computer Architecture, June 2007, pp. 150161. [5] R. Beausoleil, et al., Nanoelectronic and nanophotonic interconnect, Proceedings of the IEEE, vol. 96, no. 2, pp. 230247, Feb. 2008. [6] N. Kirman, et al., On-chip optical technology in future bus based multicore designs, IEEE Micro, vol. 27, no. 1, pp. 5666, 2007. [7] A. Shacham, K. Bergman, and L. Carloni, Photonic networks-on-chip for future generations of chip multiprocessors, Computers, IEEE Transactions on, vol. 57, no. 9, pp. 12461260, Sept. 2008. [8] D. Vantrease, et al., Corona: System implications of emerging nanophotonic technology, in Proc. Int. Symp. Computer Architecture, 2008, pp. 153164. [9] M. Petracca, et al., Design exploration of optical interconnection networks for chip multiprocessors, in Proc. Symp. High Performance Interconnects, Aug. 2008, pp. 3140. [10] C. Batten, et al., Building manycore processor-to-DRAM networks with monolithic silicon photonics, in Proc. Symp. High Performance Interconnects, Aug. 2008, pp. 2130. [11] N. E. Jerger, L.-S. Peh, and M. Lipasti, Virtual circuit tree multicasting: A case for on-chip hardware multicast support, in Proc. Int. Symp. Computer Architecture, June 2008. [12] M. Haurylau, et al., On-chip optical interconnect roadmap: challenges and critical directions, in Proc. Int. Conf. on Group IV Photonics, Sept. 2005. [13] A. A. Pistolkors, Theory of the circular diraction antenna, Proceedings of the IRE, vol. 36, no. 1, pp. 5660, 1948. [14] N. L. Binkert, et al., The M5 simulator: Modeling networked systems, IEEE Micro, vol. 26, no. 4, pp. 5260, 2006. [15] SPLASH2 website, http://www-ash.stanford.edu/apps/SPLASH/. [16] M.-L. Li, et al., The ALPbench benchmark suite for complex multimedia applications, in Proc. Int. Symp. Workload Characterization, Oct. 2005, pp. 3435. [17] W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann Pub., 2003.

5.

CONCLUSION

Emerging many-core on-chip systems call for power-ecient, high-performance on-chip communication solutions. In this work, we propose Iris, a nanophotonic on-chip network, which consists of a low-latency planar-waveguide-based WDM broadcast/multicast subnetwork, and a throughput-optimized linearwaveguide based circuit-switching subnetwork. Together, the proposed design provides power-ecient support for both latency-critical and throughput-critical on-chip communication trac of many-core systems. Experimental study demonstrates that Iris improves power eciency and performance by 56.1% and 63.3% over the electrical alternative, and 72.6% and 52.6% over the linear-waveguide based nanophotonic alternative.

6.

[1] P. Gratz, et al., On-chip interconnection networks of the TRIPS chip, IEEE Micro, vol. 27, no. 5, pp. 4150, Sept. 2007.

REFERENCES

Você também pode gostar