Você está na página 1de 13

2010

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2011

Application-Aware Topology Reconguration for On-Chip Networks


Mehdi Modarressi, Student Member, IEEE, Arash Tavakkol, Student Member, IEEE, and Hamid Sarbazi-Azad
AbstractIn this paper, we present a recongurable architecture for networks-on-chip (NoC) on which arbitrary application-specic topologies can be implemented. When a new application starts, the proposed NoC tailors its topology to the application trafc pattern by changing the inter-router connections to some predened conguration corresponding to the application. It addresses one of the main drawbacks of the existing application-specic NoC optimization methods, i.e., optimization of NoCs based on the trafc pattern of a single application. Supporting multiple applications is a critical feature of an NoC when several different applications are integrated into a single modern and complex multicore system-on-chip or chip multiprocessor. The proposed recongurable NoC architecture supports multiple applications by appropriately conguring itself to a topology that matches the trafc pattern of the currently running application. This paper rst introduces the proposed recongurable topology and then addresses the problems of core to network mapping and topology exploration. Further on, we evaluate the impact of different architectural attributes on the performance of the proposed NoC. Evaluations consider network latency, power consumption, and area complexity. Index TermsApplication-specic systems-on-chip (SoCs), multi-application-based design, networks-on-chip (NoC), performance, power consumption, recongurable systems.

I. INTRODUCTION

PPLICATION-SPECIFIC optimization is one of the most effective approaches to bridge the exiting gap between the current and the ideal network-on-chip (NoC) power/performance metrics in application-specic multi-core systems-onchip (SoCs) [1]. This class of optimization methods tries to customize the architecture of an NoC for a target application, when the application and its trafc characteristics are known at design time, which is the case for most embedded applications running on multi-core SoCs. Most state-of-the-art NoC architectures and their design ows for application-specic multi-core SoCs provide design-time

Manuscript received January 19, 2010; revised May 26, 2010; accepted July 14, 2010. Date of publication September 07, 2010; date of current version September 14, 2011. M. Modarressi is with the Department of Computer Engineering, Sharif University of Technology, Tehran 11155-9517, Iran (e-mail: modarressi@ce.sharif. edu). A. Tavakkol is with the School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran 19295-5746, Iran (e-mail: arasht@ipm. ir). H. Sarbazi-Azad is with the Department of Computer Engineering, Sharif University of Technology, Tehran 11155-9517, Iran, and also with the School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran 19295-5746, Iran (e-mail: azad@sharif.edu; azad@ipm.ir). Digital Object Identier 10.1109/TVLSI.2010.2066586

NoC optimization for a single application [2]. In other words, they try to generate and synthesize an optimized NoC based on the trafc pattern of a single application. However, recent multi-core SoCs (mainly programmable multi-core SoCs) have become highly complex and cost-effective, and as technology advances, integrating several different applications into a single SoC chip becomes more cost-effective. As a result, such NoC architectures should closely match the trafc characteristics and performance requirements of different target applications. However, since different applications have different functionalities, the inter-core communication characteristics can be very different across the applications. Consequently, an NoC that is designed to run exactly one application does not necessarily meet the design constraints of other applications. Prior work [3] shows when conducting simulations on over 1500 different NoC congurations (with different topology, buffer size, and/or bit-width), no single NoC can be found to provide optimal performance across a range of applications. Optimizing the network topology and core to network mapping are two important application-specic NoC customization methods which dramatically affect networks performance-related characteristics such as average inter-core distance, total wiring length, and communication ows distribution. These characteristics, in turn, determine the power consumption and average network latency of the NoC architecture. Topology determines the connectivity of the NoC nodes, while mapping determines on which node each processing core should be physically placed. Mapping algorithms generally try to place the processing cores communicating more frequently near each other; note that when the number of intermediate routers between two communicating cores is reduced, the power consumption and latency of their communications decreases proportionally. Like other application-specic optimization methods, the existing application-specic mapping and topology generation methods generate a customized topology and mapping for a single application and the physical placement of the cores cannot be changed once the mapping is performed. In this paper, we introduce an NoC with recongurable connections to dynamically change the connectivity among cores. It enables the network topology to dynamically match the communication pattern of the currently running application. The reconguration of the proposed architecture is achieved by inserting several simple switches in the network allowing the network to dynamically change the inter-node connections and implement the topology that best matches the communication pattern of the running application. In other words, we try to reduce the hop count (or number of routers) between the source and destination

1063-8210/$26.00 2010 IEEE

MODARRESSI et al.: APPLICATION-AWARE TOPOLOGY RECONFIGURATION FOR ON-CHIP NETWORKS

2011

nodes of high volume communication ows by bypassing the intermediate routers. This can lead to considerable performance improvement since the latency (power) of the router pipeline stages has a signicant contribution to the total NoC latency (power). For example, in Intels 80-core TeraFlops, more than 80% of the on-chip communication power is consumed by the routers [4]. The topologies proposed for on-chip networks vary from regular tiled-based [5], [6] to fully customized [7], [8] structures. Since fully customized NoCs are designed and optimized for some specic applications, they give the best performance and power results for that applications. On the other hand, regular NoC architectures provide standard structured interconnects which ensure well-controlled electrical parameters. In these topologies, designers can solve usual physical design issues like crosstalk tolerance, timing closure, and wire routing for a specic regular topology and reuse it in several designs. This reusability alleviates the predictability problem in deep submicrometer technologies [9] and can effectively reduce the design time. Our proposed NoC architecture can be placed between these two extreme limits of NoC design schemes and benet from both worlds. While this NoC architecture is designed and optimized like regular NoCs, it can be dynamically recongured to a topology that best matches the trafc pattern of the currently running application. In other words, this architecture realizes application-specic topologies over structured and regular components. The main contributions of this work are as follows. 1) First, we propose a novel recongurable NoC architecture on which a range of regular and application-specic topologies can be implemented. 2) Second, with a set of different applications presented as input, we propose a two-phase algorithm which rst maps the tasks of the applications onto the NoC nodes and then implements a suitable topology for each input application by appropriately conguring the NoC connections. The rest of this paper is organized as follows. In Section II, related work is addressed. Section III presents the proposed recongurable NoC architecture followed by Section IV which deals with the mapping and topology selection algorithms developed for the proposed recongurable NoC. The experimental results appear in Section V where the performance and power consumption of the proposed NoC architecture is evaluated. In Section VI, simulation results under cost constraints are reported. Finally, concluding remarks are given in Section VII. II. RELATED WORK The need for scalable on-chip communication architectures has been addressed in a number of studies [10], [11]. Many researchers have then proposed different optimization methods for on-chip networks to reduce the power consumption and message latency. Some of these methods focus on reducing the power consumption and latency within a network hop by optimizing the micro-architecture of the router and switching mechanism [12], [15], [16]. To reduce the message latency, most of these (often general-purpose) optimization methods try to

cut down the routers critical path delay by parallelizing multiple pipeline stages [12][14]. Similarly, the power reduction is mostly achieved by reducing the router activity and the total capacitance switched per cycle [17], [18]. These methods are all orthogonal to our recongurable topology, which aims to reduce the hop count rather than per hop energy and latency. Mapping, routing, and topology selection mechanisms, on the other hand, try to decrease the average number of hops that messages take in the network, knowing the trafc pattern of an application. The problem of topology selection and core mapping on NoC nodes have been explored by many researchers [19][22], all limited to optimize an NoC based on a set of communication constraints obtained from a single application. Most NoCs implement regular topologies that can be laid out on a 2-D plane [5], [6], [23]. Despite the advantages of meshes for on-chip implementation, some packets may suffer from long latencies due to the lack of short paths between remotely located nodes. Furthermore, the communication trafc characteristics of multi-core SoCs used in embedded systems are nonuniform and can usually be obtained statically. As a result, an application-specic NoC with a custom topology which satises the design objectives and constraints of the target application is more appropriate [7], [8], [21], [24]. Compared to other research areas in the eld of NoC, few works have considered optimizing an NoC for multiple applications [3], [25][28]. In some previous work, we investigated the effect of a recongurable topology on NoC power consumption [27], [29]. This work extends our previous work [27] by proposing a more exible and efcient structure and providing extensive evaluations to thoroughly investigate the power, latency, and area of the recongurable NoC. In [28], a worst-case NoC design ow based on the delay and throughput constraints of a set of input applications is proposed. By applying Dynamic Voltage Scaling (DVS) and Dynamic Frequency Scaling (DFS) at run time, the power consumption of the NoC is optimized while the performance constraints of the currently running application are met. Kim et al. [3] introduce a polymorphic NoC with a congurable set of buffers, crossbars, and links on which an arbitrary network can be constructed. The network can be congured to offer the same performance as a xed function network, while incurs 40% area overhead on average. However, the authors have not analyzed the power consumption of their NoC architecture. In [25], reconguration is achieved by wrapping the NoC routers by some logic called topology switch. In this architecture, topology can be dynamically recongured by bypassing some routers through these switches. The authors report 56% reduction in power consumption and 10% area overhead for a multimedia application. However, they do not present a specic algorithm to generate a topology for a given application and do not deal with possible deadlocks. In addition, in this architecture, the latency of the slowest (longest) link is not allowed to exceed the clock period of the NoC. Compared to this work, our proposal offers higher recongurability and is more suitable for large applications especially when the task graph of the application is well-connected. Our proposed architecture also differs

2012

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2011

Fig. 1. (a) Recongurable NoC architecture and (b) three possible switch congurations.

Fig. 2. Recongurable NoC congured as (a) a mesh and (b) a binary tree. The black node is the root of the binary tree. (c) the conguration switches on which the its are buffered are shown as solid circles.

from other works mentioned above and can offer a more appropriate tradeoff between design exibility and area overhead. III. PROPOSED NOC ARCHITECTURE A. Proposed NoC Architecture The system under consideration is composed of nodes arranged as a 2-D mesh network. In the proposed NoC architecture, however, the routers are not connected directly to each other, but connected through simple switch boxes, called conguration switches (see Fig. 1). Each square box in Fig. 1 represents a network node which is composed of a processing element and a router, whereas each circle represents a conguration switch. Fig. 1(a) also shows the internal structure of a conguration switch. It consists of some simple transistor switches that can establish connections between incoming and outgoing links. In this gure, for the sake of simplicity, only a single connection is depicted between each two ports of a conguration switch. However, there are two connections between each two ports of a conguration switch in order to route the incoming and outgoing sub-links of bidirectional links independently. For (north) port can be example, the incoming sub-link of the connected to the outgoing sub-link of some port (the port, for example) while the outgoing sub-link of the port is connected port, for exto the incoming sub-link of a different port (the ample). Actually, the internal connections are implemented by a multiplexer at each output port of the switch. Since a connection coming through an input port does not loop back, each multiplexer is connected to three input ports. Fig. 1(b) displays three possible switch congurations. This NoC can be congured based on arbitrary topologies, including some standard topologies, if the conguration switches are set properly. For example, Fig. 2 display a 5 5 network congured as a 2-D mesh [see Fig. 2(a)] and a 4 4 network congured as a binary-tree [see Fig. 2(b)]. In general, dynamic hardware reconguration can only be implemented on recongurable devices; hence, most of the recongurable architectures are implemented using eld-programmable gate arrays (FPGAs). However, since the

recongurable part of the proposed NoC is limited to some simple conguration switches, this NoC can be implemented on both FPGA and application-specic integrated circuit (ASIC) platforms. An important consideration in the proposed topology is the long links that may be generated by merging a number of channel segments between conguration switches. Such long links in a conventional NoC may decrease the NoC clock frequency, when its cannot pass through them during a single NoC cycle. Even by inserting repeaters, the delay in the wires can exceed one clock period. This problem can be solved by segmenting the long links into xed length links connected by a register (1-it buffer) and sending the data over them in a pipelined fashion. Since the connection between two adjacent nodes (on which its travel in a single NoC cycle) consists of two channel segments (see Fig. 1), the registers should be distributed among the conguration switches in such a way that each it is latched after passing through two channel segments. Fig. 2(c) shows the switches in which the its are latched. For example, the bold wire segments in the gure form a long link between nodes A and B. Here, each it passes over this link in ve cycles in a pipelined manner. Different issues regarding pipelining long links have been addressed in [30] and [31]. Although we used the mesh topology as the basis of our recongurable NoC architecture, the proposed reconguration mechanism can be applied to other well-known topologies, such as torus, hypercube, and -ary -cubes. It can also be used in some modern topologies, such as concentrated mesh (or X-mesh) and WK-recursive networks, to further improve their performance when dealing with multiple applications. B. Implementation Issues Although the proposed NoC is not restricted to specic switching and routing schemes, the NoC routers in this study adopt a wormhole switching mechanism; note that wormhole switching best suits the limited buffering resources and low-latency communication requirements of on-chip networks. Like most application-specic optimized NoCs, this NoC applies a table-based routing scheme. This allows the NoC to support any static routing algorithm and is a suitable choice due to the irregular nature of application-specic topologies. It also allows the designer to exploit its understanding of the application trafc characteristics and avoid network congestion by appropriately allocating paths to trafc ows. In [20], Hu and Marculescu compared the advantages and drawbacks of the dynamic and static routing schemes, and concluded that static routing is more suitable for the NoC-based systems due to its lower cost, ease of implementation, and power efciency. Static power is often considered as a major problem in deepsubmicrometer circuits. To tackle this problem, we use a dynamic power management scheme by selectively turning off the components (links, routers, conguration switches) which are not used in a conguration. As will be shown in Section V, in most cases, input channels of a conguration switch are connected to output channels in the dimension to which the input channel belongs. To exploit this behavior in order to decrease the conguration switch delay and power consumption, we use a multiplexer structure

MODARRESSI et al.: APPLICATION-AWARE TOPOLOGY RECONFIGURATION FOR ON-CHIP NETWORKS

2013

Fig. 3. (a) Area overhead of the recongurable NoC for different buffer depths (in its) when link width is xed at 64-bits in a 6 6 NoC, (b) network sizes (number of nodes) when buffer depth is xed at 16 its and link width is assumed to be 64 bits, and (c) link widths (bits) when buffer depth is xed at 16 its in a 6 6 NoC.

in which the capacitive load between two ports of the conguration switch in the same dimension is small. This structure is inspired from decomposed crossbar architecture [32]. More details of the switch internal structure can be found in [33]. To further exploit straight connections in conguration switches (i.e., E-W and N-S connections), our topology generation algorithm selects, among all possible paths between two nodes, the paths with greater number of straight connections.

C. Cost of the Proposed NoC Architecture Like FPGAs, the proposed recongurable network pays for the exibility with additional area overhead. In this section, we evaluate the area of the proposed NoC architecture and compare it to a conventional packet-switched NoC. The conventional NoC is considered as a mesh-connected network with ve input and ve output input-buffered routers. The same structure is considered for a recongurable NoC, but it includes some additional area due to the conguration switches. The area estimation is done based on the hybrid synthesis-analytical area model employed in some tools (such as CACTI [34] and Orion [35]) and research papers [3], [36], [37]. In this method, the area of the routers building blocks (e.g., the area of a memory cell) is calculated for a specic technology point (65 nm, for example) and then analytically combined to estimate the total router area. Since the proposed NoC adopts conventional routers and links, we use the analytical area model of Balfour and Dally [36]. The parameter values in 65 nm are taken from CACTI [34] and Orion [35]. Although the exact area of an NoC in 65-nm technology can be determined after synthesis and placement and route procedures, analytical models can give a good insight into the area of NoCs, especially when comparing the area of NoCs with different congurations [3]. A typical NoC is composed of three types of components: routers, network interfaces (NI), and links. Our proposed NoC uses an extra component, i.e., the conguration switch. Router area is estimated based on the area of the input buffers and the crossbar switch, since the router area is dominated by these two components. Network interfaces generally include packetization and depacketization queues and logics. Each conguration switch is composed of four 3 1 multiplexers to implement internal connections. As mentioned before, some switches have four additional registers, each of which latches incoming its from one of the switch ports. The area of a conguration switch is much less than the area of a router

since it has small buffering capacity, no controller logic (for virtual channel and switch allocator, and routing logic), no network adapter, and uses a smaller switching fabric. The area overhead due to the additional inter-router wires is analyzed by calculating the number of channels in a conventional and a recongurable mesh-based NoC. We consider two channel-segments (see Fig. 1) in the recongurable NoC (which has the same size as a channel in a conventional NoC) as one 1 mm. channel. We assume square-shaped cores of 1 mm Consequently, each NoC channel is 1 mm long. The channel area model is taken from Orion 2.0 area calculation function (since it offers more precise channel models) with repeaters optimized for delay. Fig. 3 shows the effects of different architectural parameters, including buffer depth, network size, and link width, on the area overhead of the proposed recongurable NoC over a conventional NoC. In Fig. 3(a), the effect of buffer depth (in its) on the area overhead in a 6 6 NoC with 64-bit links is shown. As can be seen in the gure, increasing the buffer depth reduces the area overhead. In this analysis, the depth of the network interface packetization (depacketization) queue is assumed to be as large as two 8-it packets which is a small size for the network interface queue. Obviously, by increasing the network interface queue size, the ratio of the network node area over the conguration switch area will increase, leading to further reduction in the area overhead of the proposed architecture. Fig. 3(b) reports the sensitivity of the area overhead to the NoC size. Here, we consider networks with 64-bit links and 16-it buffers. As can be seen in the gure, the effect of network size on the area overhead is more noticeable for small and medium sized NoCs. When the network size increases beyond 100 nodes, the area overhead grows very slowly. Therefore, using the proposed architecture in future on-chip systems, with ever increasing number of cores, will not impose huge changes in the design issues and implementation complexities. Fig. 3(c) shows the sensitivity of the area overhead to the link bit-width of the NoC. Here, we have xed the buffer depth at 16 its in a 6 6 NoC. Obviously, by increasing the bit-width, the area of the data-path components (crossbars, links, and buffers) is increased. If the bit-width is scaled by a factor of , the buffer and link area is scaled by , while the crossbar area is scaled by . As a result, by increasing the bit-width, the area overhead increases due to the additional crossbar-based switches we used. However, the variation in area overhead is small for common NoC bit-widths. As can be seen in the gure, the difference between the naive implementation of 8-bit links and the aggressive

2014

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2011

implementation of 512-bit links is about 6%. This means that the area overhead of the proposed technique well scales with respect to the link width, which is desirable in future high-performance complex SoCs. Another source of overhead is the storage space required to keep the conguration information of the switches. Each conguration switch is congured by eight bits divided into four two-bit parts each to control one of the internal multiplexers of the switch. As a result, the conguration data for several applications can be kept in the switch with moderately low storage requirements. In this work, we assume that the NoC reconguration process is initiated by a conguration manager (which may be implemented in the application layer), whenever a switching takes place between two applications. In addition to the area overhead discussed above, dynamic reconguration of the NoC may generally cause some performance overhead due to the time needed to load a new conguration to the NoC. However, in most SoC designs, the application switching time is of the order of few milliseconds which consist of the time needed to load the data and code of a new application into the SoC, sending control signals to different parts of the SoC, and shutting down the old application [28]. Switching between network congurations is done in parallel with application switching. As mentioned before, the conguration data of the proposed NoC architecture is small and can be stored in conguration switches and routers. Thus, the NoC conguration-switching time is by far smaller than the time needed to switch between applications and does not impose any additional delay to the application switching procedure. It has been shown in [28] that even if the conguration data is stored in an off-chip memory, it can be loaded and distributed around the NoC in few microseconds which is still shorter than the application switching time. The energy dissipated for conguration switching can also be ignored due to infrequent switching and the small amount of data transition during each switching event. IV. ENERGY AND PERFORMANCE-AWARE MAPPING In this section, we address the mapping and routing problems in the proposed recongurable NoC. Simply stated, for the given set of input applications using a specic set of IP cores, our objective is to: 1) map the cores into different nodes of a recongurable NoC; 2) nd a customized topology for each application, based on the mapping in previous step and the application trafc characteristics; and 3) nd a route for the trafc ows of each application, based on the topology found for the application. We develop a two-step algorithm for this problem where core to network mapping is done at the rst step and then topology and route generation are done concurrently at the next step. The idea behind splitting the procedure into two steps is that, in our proposal, the recongurability only changes the connectivity among the cores and not the physical placement of them; so, our system comprises of some nodes with xed placement and a recongurable set of links. Consequently, mapping and topology selection should be done per MPSoC and per application, respectively. This suggests that our procedure must be carried out in two subsequent steps: the rst

step handles core-to-network mapping by considering all applications and the second step works on each individual application and forms a proper topology for it. Each input application is described as a communication task , where each graph (CTG). The CTG is a directed graph represents a task, and a directed edge represents the communication ow from to . The communication volume (bits per second) corresponding to every edge is also . It is assumed that each task is provided and is denoted by nonmigratory and already mapped onto an IP-core; hence, each CTG node represents its corresponding core, as well. A. Core to Network Mapping At the rst step, our objective is to gure out how to physically map the cores required by input applications onto different tiles of a mesh network such that the distances between the communicating cores are minimized. We assign a weight to each task graph based on its criticality, e.g., the percentage of time that the corresponding application is run on the NoC. Assigning weights enables the designer to bias the mapping for major or critical applications. This step is performed by constructing a synthetic task graph (the average task graph) from the task-graphs of the given set of input applications. This average graph includes all the nodes of all task graphs of the input applications. For the edge between every pair of nodes, the average weight of the volumes relating to the corresponding edges across all task graphs are calculated and used in the average task graph. If an edge does not exist in a task graph, its volume is considered to be zero. More formally, the weight (communication volume) of each edge is calculated as

where represents the weight of the th task graph and and denote the volume of (which denes the edge and ) in the th task graph and the average between nodes task graph, respectively. is the number of input task graphs. The mapping problem can be formulated as follows. Given a synthetic average CTG constructed from the task graphs of the input applications and a recongurable NoC (satisfying , nd a mapping from CTG to the NoC nodes as such that for every node of the CTG, we have . The constraint states that each core should be mapped to exactly one NoC node and no node can host more is the Manhattan distance between than one core. nodes and in the network and is the network node and to which CTG node is mapped. denote the number of CTG and NoC nodes, respectively. Again, denotes the volume of which represents the edge between vertex and vertex . Core to network mapping is an NP-hard problem [38], therefore rather than searching for an optimal solution, it has been solved heuristically in prior works [5], [24], [39]. As the focus

MODARRESSI et al.: APPLICATION-AWARE TOPOLOGY RECONFIGURATION FOR ON-CHIP NETWORKS

2015

of this paper is the topology reconguration, we perform mapping for the average graph using NMAP, a well-known and popular heuristic method presented in [5]. NMAP uses a heuristic algorithm for power-aware mapping of task graph nodes on a mesh-based network and selects a route for each task graph edge. We only use the mapping algorithm of NMAP and then, in the next step, propose a topology and route selection algorithm based on the recongurable network links. In the NMAP methodology all cores are initially unmapped. Then, the core mapping is accomplished as follows. Step 1) Map the core with the maximum communication demand onto one of the mesh nodes with maximum number of neighbors. Step 2) Select the core that communicates the most with the already mapped cores and examine all unallocated mesh nodes for placement. Select the node which minimizes the communication cost between the current core and cores already mapped. The communication cost of mapping vertex of the CTG into node of the NoC is given by , where is the Manhattan distance between and the node to which CTG vertex is mapped. The process is repeated until all cores are mapped. We refer interested readers to [5] for more details on the NMAP. B. Topology and Route Generation Once the mapping is obtained from the average task graph, a suitable topology is constructed for each individual application. The recongurability of the proposed NoC architecture can be exploited for different objectives, such as guaranteeing the quality-of-service (QoS) level required by an application. However, the algorithms presented in this section aim to reduce the NoC average power consumption and message latency of a specic application being processed at a given time. To achieve this goal, we implement a topology for each application in which the number of hops between the source and destination nodes of high-volume communication ows is as small as possible. The main idea is to choose the heaviest communication ow that is not yet assigned a route and nd a path with minimum possible hop counts for it. Finding this route may involve conguring the switches which are not yet congured in order to bypass some intermediate routers and make a shorter connection between the end nodes (in terms of hop counts). As a result, route selection and topology construction is done in parallel, within the same procedure. The algorithm can congure the uncongured internal connections of the conguration switches, but not the connections that have been congured at previous iterations of the algorithm (based on the edges with higher volumes). Initially, in the topology selection algorithm, all edges of an application task graph are stored in a decreasing order (based on their communication volumes) and the internal connections of all conguration switches are uncongured. Then, for each edge in the order, a branch-and-bound algorithm chooses the path with the least cost between its source and destination nodes. We calculate the cost of a path based on the routers and conguration switches it includes. From the packet latency perspective, each router processes a it in four cycles (two cycles for

body its), while it takes one cycle for a it to pass through two conguration-switches. From a power perspective, on the other hand, the power consumption of a conguration switch is 4 to 5 times less than a router. As a result, we assign a cost of 1 to a conguration-switch and a cost of 5 to a router. More details about the implementation of these algorithms, as well as the respective pseudo codes can be found in [33]. The algorithm searches for the optimal path by alternating the following branch and bound steps. 1) Branch: Starting from the source router of the selected edge, the algorithm makes a new branch by adding a router/conguration switch adjacent to the current node of the partial path. Current node is dened as the last node added to a partial path through which the path will be extended. The added node must be located within the shortest-path area, i.e., between the source and destination nodes of the edge. The shortest path area is dened by the nodes and conguration switches located along one of the shortest paths between the source and destination nodes, as well as their adjacent conguration switches. If the current node is a router, the path is extended by including its neighboring conguration switches along the shortest path towards the destination node. If the current node is a conguration switch, the path is extended by adding the neighboring routers or conguration switches along the shortest path. However, if the current conguration switch has been already congured (at previous steps of the algorithm) in such a way that the port through which the partial path reaches the switch is connected to another port, the path can be only extended along the direction determined by current switch conguration, provided that the direction ends to a router along the shortest path towards the destination. 2) Bound: A branch (a partial path) is bounded (discarded) in some conditions. First, it is bounded if by adding the new node, the bandwidth constraints of the newly added link is violated. More formally, the bandwidth constraint of each NoC link must be satised as

where tained by

is the bandwidth of link

and

is ob-

if otherwise. where represents the set of links on which task graph edge (with volume ) is mapped. In addition, if the cost of a partial path that reaches a node is larger than the minimum cost of the existing partial paths already reaching that node, the path is discarded. The minimum cost of the already completed paths is also kept by the algorithm and a partial path is bounded when its new cost exceeds this value. Finally, we perform a connectivity check to verify that there is at least one path between the source and destination nodes of all edges that are not mapped yet. If the current partial path congures the switches in such a way that all possible paths between the source and destination nodes of at least one unmapped edge are blocked, the partial path is bounded.

2016

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2011

After a path is found for an edge with the minimum cost, it is established in the NoC by conguring all according conguration switches within the path. Then, the algorithm continues with the next edge. The algorithm is repeated for all the edges of the application task graph. Once all task graph edges are mapped to a path in the NoC, the paths are analyzed for detecting potential deadlocks. To this end, all cyclic dependencies among paths are broken by adding a virtual channel in one of the nodes of the cycle. In this procedure, we assume that the applications are run on the NoC one at a time. Nonetheless, simultaneous execution of multiple applications is a likely scenario in MPSoCs. We can easily support simultaneous execution of multiple applications by the same procedure. To this end, we combine the task graphs of the applications which are allowed to run simultaneously, or within overlapping time intervals, into a single task graph called compound task graph. The compound task graph includes all the nodes of all task graphs of the involved applications and the weight of each edge is obtained where as and denote the volume of (which denes the edge between nodes and ) in the th task graph and the compound task graph, respectively. We perform the topology generation and route selection steps for the new task graph. The selected conguration will then be loaded into the network when the applications are run simultaneously.

V. EXPERIMENTAL RESULTS To evaluate the performance of the proposed NoC architecture and the mapping and topology generation methodology, we perform simulations for some benchmark applications. The benchmark application set includes some synthetic task graphs generated by TGFF [40] together with task graphs of some existing SoC designs which have been widely used in the literature, including multi-window display (MWD) [41], video object plane decoder (VOPD) [5], and multi-media system (MMS) [6]. The MMS benchmark contains H.263 decoder, H.263 encoder, MP3 encoder, and MP3 decoder applications. However, for the rst two SoCs which have a single application task graph, we generate two additional task graphs for each of them, called X-50% and X-25%, where X is the name of the application. The X-25% and X-50% task graphs are generated by replacing the source and destination nodes of the edges of task graph X with other randomly chosen nodes (i.e., moving the edges of the task graph to other positions) with a probability of 25% and 50%, respectively. We assign a weight of 0.5 to the base task graph X, a weight of 0.25 to X-25% and X-50% task graphs, and then integrate the three task graphs into a single NoC, according to the design ow described in the previous section. The conventional NoC used for comparison applies the same mapping as its recongurable counterpart, but its topology is xed during execution of the applications. The communication ows of the conventional NoC are directed by the X-Y deterministic routing scheme. The task-graphs of different MMS applications

are depicted in Fig. 4, where the edge tags represent the communication volume between the source and destination nodes of the edge in mega bits per second. We set the input rate of the MMS programs in order to scale the rate of their communication ows to several mega bits per second. The applications use the same set of IP-cores but the trafc pattern among the cores is different for each application. In Fig. 4, the edges belonging to each individual application are bolded in the task graph corresponding to their applications. Different tasks of this application suite are mapped on 12 cores. The physical mapping is accomplished based on the average graph of the four input task-graphs. The algorithm then nds a suitable topology for each application which is illustrated next to the corresponding task graph in Fig. 4. We have evaluated the proposed NoC architecture using Xmulator, a fully parameterized simulator for interconnection networks [42]. The simulator is augmented with the Orion power library [35] to calculate the power consumption of the networks. Simulation experiments are performed for a 64-bit wide system with speculative four-stage pipelined wormhole routers [43], 16-it buffers, and 8-it packets. The power results reported by Orion are based on an NoC implemented in 65-nm technology and the working frequency of the NoC is set based on the bandwidth requirements of each benchmark (5 GHz for MMS applications, and 200 MHz for the other applications). In simulation experiments, packets are generated with exponential distribution and the communication rates between any two nodes are set to be proportional to the communication volume between them in the task graph. Table I displays the average packet latency and power consumption of the recongurable NoC and its equivalent conventional NoC for the described multi-core SoC benchmarks. The results show considerable power and performance improvements over a conventional NoC using NMAP. As the table indicates, the reconguration can adapt the topology to different applications and effectively reduce the power consumption and average packet latency of the NoC by up to 28% (16% on average) and 26% (10% on average), respectively. As the results for MMS applications show, the power and performance gain obtained by the proposed architecture for MP3 encoder/decoder are higher than the gain obtained for H.263 encoder/decoder. The reason is that the volume of the communication ows of H.263 applications, which are larger than the communication volumes of MP3 applications, bias the mapping for H.263 encoder and decoder. As a result, the source and destination nodes of H.263 applications are mapped near each other, while the nodes required by MP3 applications are placed at greater distances. However, as Fig. 4(c) and (d) show, we can congure the NoC to establish a direct connection between almost all communicating nodes of MP3 encoder and decoder. This leads to a power and performance values close to the case when the communicating cores are mapped into nearby nodes. Similarly, for the MWD and VOPD benchmarks, the weight of the original application is increased over the two synthetic applications which make the mapping biased for the original application. As a result, reconguration brings about more power

MODARRESSI et al.: APPLICATION-AWARE TOPOLOGY RECONFIGURATION FOR ON-CHIP NETWORKS

2017

Fig. 4. Task graph and corresponding topology for (a) H263 decoder, (b) H63 encoder, (c) MP3 decoder, and (d) MP3 encoder.

and performance gains for the synthetic applications (X-25% and X-50%). The reported power includes both dynamic and static parts. As mentioned in Section III, a dynamic power management

scheme is used to deactivate unused router and switch ports in order to decrease the static power consumption of the recongurable NoC. The idle routers and links of the conventional NoC are also deactivated. The VOPD benchmarks, however, require

2018

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2011

TABLE I POWER CONSUMPTION (WATTS) AND MESSAGE LATENCY (FOR 8-FLIT PACKET) IN A CONVENTIONAL AND A RECONFIGURABLE NOC. THE CONVENTIONAL NOC USES NMAP

Fig. 5. (a) Average message latency and (b) power improvement of the recongurable NoC over a conventional NoC versus the difference between the trafc patterns of the task-graphs for different NoC sizes.

more connections compared to other benchmarks, and thus involve more active switches, routers, and links. Consequently, the static power consumption of the recongurable NoC, when running the VOPD applications, is higher than the static power of a conventional NoC. This higher static power consumption may not be compensated by the obtained dynamic power saving, as reported for the VOPD original application in Table I. Nonetheless, the proposed recongurable NoC always enhances the NoC dynamic power consumption, as illustrated in the last column of Table I. To investigate the effect of the applications trafc pattern on the obtained improvements more precisely, we perform the mapping and topology generation procedure for some synthetic task graphs generated by TGFF [40]. For each experiment, we generate a task graph by TGFF with communication volumes randomly assigned to its edges. Then, two other task graphs are generated by randomly changing the source and destination nodes of the edges belonging to the base task graph with a specic probability. The experiments are repeated for different network sizes to evaluate the scalability of the proposed recongurable NoC by studying the effect of the network size on the NoC performance improvement. The mapping and topology generation is performed based on the design ow of Section IV. The NoC parameters are the same as those used earlier in this section. Fig. 5 reports the power and performance improvements of the recongurable NoC compared to a conventional NoC when the difference between the trafc patterns of the three task graphs integrated into a single NoC varies from 20% to 100% for different NoC sizes. Each point in Fig. 5(a) rep, where resents

and are the average power consumption of the three task graphs integrated into a recongurable NoC and a conventional NoC, respectively. Fig. 5(b) displays the same results for the average message latency. The gure conrms the conclusion drawn previously and shows that when the difference amongst the trafc patterns of the applications integrated into the NoC increases, the proposed reconguration method reduces the power consumption and message latency more effectively. Moreover, Fig. 5 shows that by increasing the NoC size, higher improvements can be obtained since a larger NoC can result in larger average path lengths (in terms of hop counts) for communication ows which can be effectively reduced in our recongurable NoC. A considerable point in Fig. 5(a) is the behavior of the 8 8 NoC. In this case, when the difference amongst the trafc patterns increases over 60%, the conventional NoC soon reaches the saturation state. This effect is mainly due to the fact that the mapping is optimal for none of the task graphs, resulting in longer average message hop counts. Therefore, the network load, especially in central nodes, increases beyond the network bandwidth. The proposed recongurable NoC consistently provides short-cut paths to handle different trafc patterns; thus, in addition to providing lower latency, it saturates at a higher trafc rate compared to the conventional NoC. This is because packets traveling over long channels do not need to win switch ports (and virtual channels) while bypassing intermediate routers, which is equivalent to 100% resource allocation efciency. This reduction of contention for network resources leads to a higher throughput. The power consumption improvement of the 8 8 NoC, illustrated in Fig. 5(b), continues to increase as expected until net-

MODARRESSI et al.: APPLICATION-AWARE TOPOLOGY RECONFIGURATION FOR ON-CHIP NETWORKS

2019

TABLE II COMPARISON OF THE PROPOSED NOC AND RENOC FOR VOPD WORKLOAD. THE AREA NUMBERS ARE NORMALIZED TO THE AREA OF THE PROPOSED NOC

work gets saturated. In this situation, the power of the conventional NoC does not increase due to blocked packets, while its recongurable counterpart is still working normally, and transferring more packets which in turn consume more power. We have also compared our proposal with ReNoC, the recongurable NoC architecture proposed in [25]. As ReNoC considers no specic algorithm for topology generation, we consider the VOPD application for which the topology is provided [25] on a 4 3 mesh. We assume that the packets are sent over the long links in a pipelined fashion. We use the same simulation parameters used earlier in this section for the VOPD application. Our area model (described earlier in Section III) shows that the area of a 64-bit 4 3 ReNoC is 12% less than the area of our proposed recongurable NoC of the same size. Simulation results, reported in Table II, show that our recongurable NoC outperforms the ReNoC and reduces the power consumption by 10% and average packet latency by 7%. For a fair comparison, Table II also reports the power and latency results of same-sized NoCs. To this end, we increase the ReNoC bit-width. Based on our area model, a 72-bit ReNoC has almost the same area as our 64-bit recongurable NoC. Simulation results show that the proposed NoC with 64-bit links still outperforms the ReNoC with 72-bit links. Note that the main advantage of our recongurable NoC, with respect to the ReNoC, is that it can form more direct connections for the communication ows. Therefore, for the task graphs with more communication ows (i.e., well-connected task graphs), for which ReNoC can provide direct paths for only a limited number of communication ows, the difference between our proposed NoC and ReNoC will be more noticeable. VI. PERFORMANCE EVALUATION WITH COST CONSTRAINT In this section, we compare the proposed NoC with the conventional NoC with cost constraint. To do so, the extra logic used in our recongurable NoC can be added to other network parts. To do so, one can consider increasing the bit-width or the buffer depth of the conventional NoC. Another possible case is to invest the area overhead of the recongurable NoC to make the processing cores faster. Our area model reveals that the area of the recongurable NoC and a conventional NoC with the parameters mentioned above is 1.75 and 1.37 mm , respectively. Assuming 1 mm 1 mm cores, the recongurable NoC imposes less than 3% overhead (i.e., ) to the entire area used by the cores. According to Pollacks rule [44], the increase in processor performance is roughly proportional to the square root of the increase in its logic, i.e., its area. As a result, investing the area

Fig. 6. Recongurable NoC with the corridor width of 2.

Fig. 7. Area of the selected NoC congurations (numbers are normalized with respect to area of the rst bar).

overhead to increase the performance (or area) of the cores in the SoC with a conventional NoC leads to about 1% increase . in the performance of each core As a result, investing this area overhead to increase the network performance would be a better choice, especially in the cases where the network lies within the critical path of the system. A. Comparison With NoCs With Larger Buffering Capacity We consider NoC congurations with various numbers of virtual channels per physical channel and buffer depths. More precisely, we aim to use the extra logic to make the NoC with additional buffering capacity (by increasing the number of virtual channels and/or buffer depth). We introduce another recongurable structure in which by increasing the number of conguration switches between two adjacent routers, or NoC corridor width, more exibility is obtained. An NoC with corridor width of 2 is depicted in Fig. 6. The routers in this architecture are the conventional ve-port routers. Obviously, this extra exibility comes at the price of larger network area due to the additional conguration switches and channels. When using recongurable NoCs with larger corridor width, where the number of conguration switches between every two adjacent routers is increased, more improvements can be obtained in terms of packet energy and delay increase, especially for applications with a large number of inter-core communication ows. Fig. 7 demonstrates the area of different NoC congurations with different buffer sizes and number of virtual channels. The

2020

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2011

Fig. 8. (a) Average message latency (cycles for 8-it packet) and (b) power (Watts) of the three NoC congurations of Cong-set1 under the MMS trafc and its variants.

Fig. 9. (a) Average packet latency (cycles for 8-it packet) and (b) power (Watts) of the ve NoC congurations of Cong-set2 under the MMS trafc and its variants.

area values are normalized with respect to the area of the conventional NoC with one virtual channel per physical channel and 8-it buffers. As shown in the gure, the evaluated congurations are grouped in two sets. The rst set, referred as Cong-Set1, includes a conventional NoC with two virtual channels and 8-it buffers (Conv. 2VC,8-Flit Buffers), a conventional NoC with one virtual channel and 16-it buffers (Conv. 1VC,16-Flit Buffers), and a recongurable NoC with corridor width of one, one virtual channel, and 8-it buffers (Recong-C1, 1VC, 8-Flit Buffers) all with almost equal area. The bars corresponding to these congurations are colored in light grey. The second set includes a conventional NoC with three virtual channels and 8-it buffers (Conv. 3VC, 8-Flit Buffers), a conventional NoC with one virtual channel and 24-it buffers (Conv. 1VC, 24-Flit Buffers), a recongurable NoC with corridor width of one, two virtual channels, and 8-it buffers (Recong-C1, 2VC, 8-Flit Buffers), a recongurable NoC with corridor width of two, one virtual channel, and 8-it buffers (Recong-C2, 1VC, 8-Flit Buffers), and a recongurable NoC with corridor width of one, one virtual channel, and 16-it buffers (Recong-C1, 1VC, 16-Flit Buffers), all with almost the same area. Lets us refer to this set of congurations as Cong-Set2. To consider the cost constraint, Figs. 8 and 9 compare the power and latency results for Cong-Set1 and Cong-Set2 NoC congurations. In this experiment, one instance of H.263 encoder and decoder applications together with ve instances of

MP3 decoder and encoder applications are run simultaneously on the NoC. Increasing the number of MP3 instances changes the trafc pattern over the base MMS application set which is used for core to network mapping. This helps to evaluate the effect of corridor width on NoC adaptability which is the focus of this paper. We also evaluate the impact of increasing the on-chip trafc on performance improvement of the proposed architecture with different corridor widths. For this purpose, we generate two and , by ranother task graphs, domly adding new edges to the MMS task graph in such a way that the volume of the MMS inter-core trafc is increased by 10% and 20%, respectively. We set the NoC parameters as in the previous section, but the network bit-width is increased to 128 bits so that it can cope with the excessive on-chip trafc generated by the cores. In order to fairly compare the selected NoC congurations, we set the working frequency of the NoCs to 6 GHz. When working at lower frequencies, the conventional NoCs soon reach the saturation point while the proposed recongurable NoCs work normally. Fig. 8 shows that the recongurable NoC provides smaller latency and power consumption values than the two equivalent conventional congurations of Cong-Set1. Moreover, the conventional NoC with 2 virtual channels provides lower latency and larger power consumption compared to the equivalent conventional NoC with deeper buffers and one virtual channel under MMS workload.

MODARRESSI et al.: APPLICATION-AWARE TOPOLOGY RECONFIGURATION FOR ON-CHIP NETWORKS

2021

Fig. 10. (a) Average packet latency (for 8-it packet) and (b) power of the three NoC congurations of Cong-set1and an NoC with wider bit-width under the MMS.

As Fig. 9 shows, the recongurable NoC with corridor width of 2 outperforms its equivalent counterparts for power consumpover a conventional tion by 7% (up to 10% in NoC with three virtual channels) and latency by 24% (up to 29% over a conventional NoC with 24-it buffers). in For the recongurable NoC with corridor width of 1, increasing the number of virtual channels to 2, while the buffer depth is halved (i.e., the total buffering space is kept constant), reduces the packet latency and increases the NoC power consumption. The power and latency of the two recongurable NoCs with corridor width of 1 is better than those of its conventional counterparts with the same area. A considerable point in Fig. 9(a) is that in the recongurable NoC with corridor width of 2, the average packet delay increases slightly (less than 0.7 cycles) when the application changes from , while the average increase in packet MMS to delay is 2.7 cycles when the corridor width is 1. The reason is that by increasing the corridor width, the NoC still has enough resources to customize the topology for additional trafc ows when the on-chip trafc is increased, hence it can keep the communication delay steady, compared to other NoCs. B. Comparison With NoCs With Larger Bit-Width In this section, we investigate on the performance improvement when the area overhead is invested to increase the NoC link width. By increasing the NoC bit-width, the area of the NoC data-path components (buffers, links, and crossbars) is increased. According to our area model, the area of a 128-bit recongurable NoC with corridor width of one, one virtual channel, and 8-it buffers (Recong-C1, 1VC, 8-Flit Buffers, in Fig. 7) is almost the same as the area of a conventional NoC with bit-width of 158. So, we add this conguration to Cong-set1. Fig. 10 shows the power and average message latency of these NoCs. The simulation parameters are exactly the same as those used in the previous section. As can be seen in the gure, our recongurable NoC efciently exploits the area overhead to provide shorter paths for the communication ows leading to better performance and power results compared to the conventional NoC with equal area using wider links. VII. CONCLUSION We proposed a recongurable architecture for NoCs on which arbitrary application-specic topologies can be implemented.

Since entirely different applications may be executed on an SoC at different times, the on-chip trafc characteristics can vary signicantly across different applications. However, almost all existing NoC design ows and the corresponding application-specic optimization methods customize NoCs based on the trafc characteristics of a single application. The recongurability of the proposed NoC architecture allows it to dynamically tailor its topology to the trafc pattern of different applications. In this paper, we rst introduced a recongurable NoC architecture and evaluated its implementation cost in terms of area overhead over a conventional NoC. We then addressed the two problems of core to network mapping and topology exploration in which the cores of a given set of input applications are physically mapped to the network and then a suitable topology is found for each individual application. Experimental results, using some multi-core SoC workloads, showed that this architecture effectively improves the performance of NoCs by 29% and reduces the power consumption by 9%, over one of the most efcient and popular mapping algorithms proposed for conventional NoCs. We then investigated the impact of different NoC parameters and application trafc characteristics on power, latency, and area. Compared to previous recongurable proposals and regarding the imposed area overhead and power/performance gains, the proposed NoC introduces a more appropriate tradeoff between the cost and exibility. REFERENCES
[1] J. Owens, W. J. Dally, R. Ho, D. N. Jayasimha, S. W. Keckler, and L. S. Peh, Research challenges for on-chip interconnection networks, IEEE Micro, vol. 27, no. 5, pp. 96108, May 2007. [2] T. Bjerregaard and S. Mahadevan, A survey of research and practices of network-on-chip, ACM Comput. Surveys, vol. 38, no. 1, pp. 151, 2006. [3] M. Kim, J. Davis, M. Oskin, and T. Austin, Polymorphic on-chip networks, in Proc. ISCA, 2008, pp. 101112. [4] Y. Hoskote, S. Vangal, A. Singh, N. Bokar, and S. Bokar, A 5-GHz mesh interconnect for a Teraops processor, IEEE Micro, vol. 27, no. 5, pp. 5161, May 2007. [5] S. Murali and G. De Micheli, Bandwidth-constrained mapping of cores onto NoC architectures, in Proc. Des. Autom. Test Euro. (DATE), 2004, pp. 896901. [6] J. Hu and R. Marculescu, Energy- and performance-aware mapping for regular NoC architectures, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 24, no. 6, pp. 551562, Apr. 2005. [7] U. Ogras and R. Marculescu, Energy- and performance-driven customized architecture synthesis using a decomposition approach, in Proc. Des. Autom. Test Euro. Conf., 2005, pp. 352357.

2022

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 11, NOVEMBER 2011

[8] J. Chan and S. Parameswaran, NoCOUT: NoC topology generation with mixed packet-switched and point-to-point networks, in Proc. Asia South Pacic Des. Autom. Conf., 2008, pp. 256270. [9] F. Angiolini, L. Benini, P. Meloni, L. Raffo, and S. Carta, Contrasting a NoC and a traditional interconnect fabric with layout awareness, in Proc. Des. Autom. Test Euro. (DATE), 2006, pp. 16. [10] L. Benini and G. De Micheli, Networks on chip: A new paradigm for systems on chip design, IEEE Comput., vol. 35, no. 1, pp. 7078, Jan. 2001. [11] A. Jantsch and H. Tenhunen, Networks on Chip. Norwell, MA: Kluwer, 2003. [12] R. Mullins and S. Moore, Low-latency virtual-channel routers for on-chip networks, in Proc. Int. Symp. Comput. Arch., 2004, pp. 188197. [13] L. S. Peh and W. J. Dally, A delay model for router microarchitectures, IEEE Micro, vol. 2, no. 1, pp. 2634, Jan. 2001. [14] K. Kim, S. Lee, K. Lee, and H. J. Yoo, An arbitration look-ahead scheme for reducing end-to-end latency in networks-on-chip, in Proc. Int. Symp. Circuits Syst. (ISCAS), 2005, pp. 23572360. [15] P. Abad, V. Puente, J. Gregorio, and P. Prieto, Rotary router: An efcient architecture for CMP interconnection networks, in Proc. Symp. Comput. Arch. (ISCA), 2007, pp. 116125. [16] A. Kumar, L. S. Peh, P. Kundu, and N. K. Jha, Express virtual channels: Towards the ideal interconnection fabric, in Proc. Int. Symp. Comput. Arch. (ISCA), 2007, pp. 150161. [17] J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M. Yousif, and C. Das, A gracefully degrading and energy-efcient modular router architecture for on-chip networks, in Proc. Int. Symp. Comput. Arch. (ISCA), 2006, pp. 415. [18] P. Meloni, S. Murali, S. Carta, M. Camplani, L. Raffo, and G. De Micheli, Routing aware switch hardware customization for networks on chips, in Proc. NanoNet, 2006, pp. 15. [19] J. Hu and R. Marculescu, Energy-aware mapping for tile-based NoC architectures under performance constraints, in Proc. Asia South Pacic Des. Autom. Conf., 2003, pp. 233239. [20] J. Hu and R. Marculescu, Exploiting the routing exibility for energy/ performance aware mapping of regular NoC architectures, in Proc. Des. Autom. Test Euro. (DATE), 2003, pp. 668693. [21] S. Murali and G. De Micheli, SUNMAP: A tool for automatic topology selection and generation for NoCs, in Proc. Des. Autom. Conf. (DAC), 2004, pp. 914919. [22] D. Bertozzi, A. Jalabert, S. Murali, R. Tamahankar, S. Stergiou, L. Benini, and G. De Micheli, NoC synthesis ow for customized domain specic multiprocessor systems-on-chip, IEEE Trans. Parallel Distrib. Syst., vol. 16, no. 2, pp. 113129, Feb. 2005. [23] E. Salminen, A. Kulmala, and T. Hamalainen, Survey of network-onchip proposals, OCP-IP White Paper, 2008. [24] G. Ascia, V. Catania, and M. Palesi, Multi-objective mapping for mesh-based NoC architectures, in Proc. ISSS-CODES, 2004, pp. 182187. [25] M. Stensgaard and J. Spars, ReNoC: A network-on-chip architecture with recongurable topology, in Proc. Int. Symp. Networks-on-Chip (NoCS), 2008, pp. 5564. [26] S. Vassiliadis and I. Sourdis, Flux networks: Interconnects on demand, in Proc. Int. Conf. Embed. Comput. Syst.: Arch., Model. Simulation (IC-SAMOS), 2006, pp. 160167. [27] M. Modarressi and H. Sarbazi-Azad, Power-aware mapping for recongurable NoC architectures, in Proc. Int. Conf. Comput. Des. (ICCD), 2007, pp. 417422. [28] S. Murali, M. Coenen, R. Radulescu, K. Goossens, and G. De Micheli, A methodology for mapping multiple use-cases onto networks on chips, in Proc. Des. Autom. Test Euro. (DATE), 2006, pp. 118123. [29] M. Modarressi, H. Sarbazi-Azad, and A. Tavakkol, An efcient dynamically recongurable on-chip network architecture, in Proc. Des. Autom. Conf. (DAC), 2010, pp. 310313. [30] U. Ogras and R. Marculescu, Application-specic network-on-chip architecture customization via long-range link insertion, in Proc. Des. Autom. Conf. (DAC), 2005. [31] L. P. Carloni, L. McMillan, and K. Sangiovanni, Theory of latencyinsensitive design, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst., vol. 20, no. 9, pp. 10591076, Sep. 2001. [32] Y. Choi and T. Pinkston, Evaluation of crossbar architectures for deadlock recovery routers, J. Parallel Distrib. Comput., vol. 61, no. 1, pp. 4978, 2001. [33] M. Modarressi, A recongurable topology for NoCs, Tech. Rep. TR-HPCAN10-2, 2010.

[34] S. Thoziyoor, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi, CACTI 5.1, HP Laboratories, USA, Tech. Rep. HPL-2008-20, 2008, . [35] A. Kahng, B. Li, L. Peh, and K. Samadi, ORION 2.0: A fast and accurate NoC power and area model for early-stage design space exploration, in Proc. Des. Autom. Test Euro. (DATE), 2009, pp. 423428. [36] J. Balfour and W. J. Dally, Design tradeoffs for tiled CMP on-chip networks, in Proc. Int. Conf. Supercomput., 2006, pp. 178189. [37] M. Kim, D. Kim, and E. Sobelman, NoC link analysis under power and performance constraints, in Proc. ISCAS, 2006, pp. 41634166. [38] G. Ascia, M. Catania, and M. Palesi, An evolutionary approach to network-on-chip mapping problem, in Proc. IEEE Congr. Evolutionary Computation, 2005, pp. 112119. [39] K. Srinvasan, K. Chatha, and G. Konjevod, Linear programming-based techniques for synthesis of network-on-chip architectures, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 14, no. 4, pp. 407420, Apr. 2006. [40] R. P. Dick, D. L. Rhodes, and W. Wolf, TGFF: Task graphs for free, in Proc. Int. Workshop Hardw./Softw. Codes., 1998, pp. 97101. [41] K. Srinvasan and K. Chatha, A low complexity heuristic for design of custom network-on-chip architectures, in Proc. Des. Autom. Test Euro. (DATE), 2006, pp. 130135. [42] HPCAN Laboratory, Sharif Univ. Technol., Tehran, Iran, Xmulator NoC Simulator, 2009. [Online]. Available: http:// www.xmulator.org [43] W. J. Dally and B. Towles, Principles and Practices of Interconnection Networks. New York: Morgan-Kaufmann, 2004. [44] S. Borkar, Thousand core chips: A technology perspective, in Proc. 44th Des. Autom. Conf. (DAC), 2007, pp. 746749.

Mehdi Modarressi (S08) received the B.Sc. degree from Amirkabir University of Technology, Tehran, Iran, in 2003, and the M.Sc. degree from Sharif University of Technology, Tehran, Iran, in 2005, both in computer engineering. He is currently pursuing the Ph.D. degree from the Department of Computer Engineering, Sharif University of Technology. His research focuses on different aspects of high-performance computing and CAD with a particular emphasis on network-on-chip architectures. He is currently with Parallel Systems Architecture Laboratories (PARSA), EPFL, Lausanne, Switzerland, as a visiting student.

Arash Tavakkol (S08) received the B.Sc. and M.Sc. degrees in computer engineering from Sharif University of Technology, Tehran, Iran, in 2005 and 2008, respectively. Since February 2008, he has been with the Supercomputing Center, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran. His research interests include network-on-chips, interconnection networks, and performance evaluation.

Hamid Sarbazi-Azad received the B.Sc. degree in electrical and computer engineering from Shahid-Beheshti University, Tehran, Iran, in 1992, the M.Sc. degree in computer engineering from Sharif University of Technology (SUT), Tehran, Iran, in 1994, and the Ph.D. degree in computing science from the University of Glasgow, Glasgow, U.K., in 2002. He is currently an Associate Professor of computer engineering at SUT, and heads the School of Computer Science, IPM, Tehran, Iran. He became a Distinguished Researcher of SUT in 2004, 2007, and 2008. His research interests include high-performance computer architectures, NoCs and SoCs, parallel and distributed systems, performance modeling/evaluation, graph theory and combinatorics, and wireless/mobile networks, on which he has published more than 250 refereed conference and journal papers. Dr. Sarbazi-Azad was a recipient of a Khwarizmi International Award in 2006 and a TWAS Young Scientist Award in engineering sciences in 2007.

Você também pode gostar