Dynamic Reconfiguration in Computer Clusters With Irregular Topologies in The Presence of Multiple Node and Link Failures

IEEE TRANSACTIONS ON COMPUTERS,
VOL. 54, NO. 5,
MAY 2005
603
Dynamic Reconfiguration in Computer Clusters with Irregular Topologies in the Presence of Multiple Node and Link Failures
Dimiter Avresky, Senior Member, IEEE, and Natcho Natchev
AbstractComponent failures in high-speed computer networks can result in significant topological changes. In such cases, a network reconfiguration algorithm must be executed to restore the connectivity between the network nodes. Most contemporary networks use either static reconfiguration algorithms or stop the user traffic in order to prevent cyclic dependencies in the routing tables. The goal of this paper is to present NetRec, a dynamic network reconfiguration algorithm for tolerating multiple node and link failures in high-speed networks with arbitrary topology. The algorithm updates the routing tables asynchronously and does not require any global knowledge about the network topology. Certain phases of NetRec are executed in parallel, which reduces the reconfiguration time. The algorithm suspends the application traffic in small regions of the network only while the routing tables are being updated. The message complexity of NetRec is analyzed and the termination, liveness, and safety of the proposed algorithm are proven. Additionally, results from validation of the algorithm in a distributed network-validation testbed Distant, based on the MPI 1.2 features for building arbitrary virtual topologies, are presented. Index TermsDynamic reconfiguration, multiple node and link failures, fault tolerance, clusters of workstations, irregular topologies.
1 INTRODUCTION
performing misrouting. Deflection routing algorithms misroute messages whenever the next channel toward the destination cannot be used. Deadlock prevention techniques have also been applied to wormhole routing [19]. In [20], it was shown that the method of assigning an order to the channels and the restriction of the routing in a decreasing channel order is a sufficient condition to prevent deadlocks. Adaptive routing methods were shown for nCubes [21] and irregular topologies [22], [23]. In [7], a deadlock-free distributed reconfiguration algorithm for virtual cut-through networks is proposed which is able to asynchronously update the routing tables without stopping the user traffic in the entire network. It divides the network into regions and broadcasts the regional routing information to all nodes in the corresponding region. Hence, all intraregion nodes know the entire regional topology. Once each region is reconfigured, the topology information is broadcast to the neighboring regions. Another dynamic reconfiguration scheme, Partial Progressive Reconfiguration, is proposed in [24]. The algorithm performs sequences of partial routing table updates and involves a very complex synchronization on every step, which results in significant implementation complexity and reconfiguration time. It is only applicable to virtual cutthrough networks with Up*/Down* routing [3] and requires two routing tables to be maintained to avoid reconfiguration-induced dependencies. In [25], several dynamic reconfiguration schemes are presented, applicable to networks with arbitrary switching and routing. However, they require each router to monitor the routing tables of each of its neighbors and, moreover, they require several sets of virtual or physical channels to be dedicated/used for reconfiguration.
Published by the IEEE Computer Society
IGH-SPEED local and system area networks [1], [2], [3] may change their topology due to failures of network components. In such cases, a reconfiguration algorithm must be executed to restore the connectivity between the network nodes. In recent years, a significant amount of work was devoted to these issues. However, most of the solutions are based either on redundant network paths or on regular network topologies. Currently, static reconfiguration techniques with predefined alternative paths, based on redundant hardware, are used in many contemporary high-speed networks [4], [5], [6]. Another approach is to use dynamic reconfiguration to establish new paths around the failures. However, reconfiguration-induced deadlock and livelock problems may arise when the routing tables are updated dynamically because additional dependencies may be established before old ones are removed [7]. To avoid this, various adaptive routing algorithms have been proposed for both regular [8], [9], [10], [11], [12], [13], [14], [15], [16] and irregular topologies [17]. These algorithms either ensure that the communication queue graph is acyclic or perform misrouting whenever the channel which leads to a messages destination is not available, as in chaotic routing [18] and deflection routing [8]. In the case of chaotic routing, the router waits until the internal queue is full before
. The authors are with the Electrical and Computer Engineering Department, Northeastern University, 440 Dana Research Center, Northeastern University, 360 Huntington Ave., Boston, MA 02115. E-mail: {avresky, nnatchev}@ece.neu.edu. Manuscript received 16 Apr. 2004; revised 18 Aug. 2004; accepted 19 Nov. 2004; published online 16 Mar. 2005. For information on obtaining reprints of this article, please send e-mail to: tc@computer.org, and reference IEEECS Log Number TC-0132-0404.
0018-9340/05/$20.00 2005 IEEE
604
IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 5,
MAY 2005
In [26], the authors proposed a dynamic reconfiguration algorithm for tolerating a single node failure in high-speed networks with arbitrary topology. The goal of this paper is to present and validate an extension of the algorithm for tolerating multiple simultaneous node and link failures. The proposed algorithm is distributed and does not require global knowledge about the network topology. Unlike many of the above-mentioned algorithms, it is applicable for arbitrary switching and routing techniques in both regular and irregular topologies and does not require additional virtual or physical channels. NetRec requires only a small number of messages between the failureaffected nodes because the nodes involved in each reconfiguration are contained within only one region and no interregional routing information is exchanged. It avoids the potential deadlocks by performing sequences of routing table updates and by suspending the application traffic in small regions while the routing tables are being updated. The algorithm was validated in a distributed testbed Distant, which is based on MPI 1.2 features for building arbitrary virtual topologies. The rest of the paper is organized as follows: In Section 2, NetRec is presented. In Section 3, some examples of NetRecs operation are given. Section 4 presents complexity analysis and proves the correctness of the algorithm and Section 5 provides results from validation of NetRec. Finally, Section 6 concludes the paper. Pseudocode of the algorithm is provided in the Appendix, which can be found on the Computer Society Digital Library at http://computer.org/ tc/archives.htm.
Fig. 1. Irregular topology with three failures.
unaffected by the reconfiguration algorithm and may contain any combination of switches and end devices.
2.1 Failure Detection The issue of failure detection has been extensively studied in the literature [27], [28]. NetRec uses I am alive messages to monitor the liveness of the network switches and links. Packet CRC checksums can also be used to monitor the link state [5]. For failure detection in the network switches, different built-in test and diagnostic techniques can be utilized. Also, code diversification and duplication techniques can be used for failure detection [29]. In this paper, we assume that all node and link failures are permanent and the failed nodes are fail-silent, i.e., the failures are not Byzantine [30]. 2.2 Failures during the Operation of NetRec The question of how to handle failures during the operation of a distributed reconfiguration algorithm is complex and depends on many factors. It is worth noting that, in any high-speed network, the MTBF is typically higher than the time which is required to complete the NetRec reconfiguration procedures. Also, a failure-affected region is typically small compared to the whole high-speed network, so the probability that a second failure will occur in a failure-affected region before the reconfiguration is finished is very small. Before we proceed to describe, in detail, NetRec, we make the following assumptions:
Assumption 2.1. After a failure is detected, no additional failures will occur in the failure-affected region, until NetRec finishes reconfiguring the region. Assumption 2.2. The network is not partitioned as a result of the failures. The second assumption states that, regardless of the failure pattern, the remaining nodes are still physically connected to each other. It is worth pointing out that NetRec can react to additional failures that may occur during its operation. Indeed, as specified below, the nodes participating in NetRec are well-known to each other and the algorithm is based on exchanging messages in a strict order between these nodes, so any additional node or link failure in the affected region will result in some of these messages not
DYNAMIC RECONFIGURATION ALGORITHM NETREC
We focus our attention on networks with distributed, deterministic packet routing, based on routing tables in each node. Additionally, the nodes can use explicit-path routing for packets that specify each hop of the path to the final destination. NetRec does not use global knowledge about the topology, it only requires that each network node have an unique ID and know the IDs of its neighbors and of all nodes that are two hops away from it through each of its links. As a result of any failure, some of the network paths are broken. The objective of NetRec is to dynamically find restoration paths, which is done in several major phases restoration leader election, restoration tree construction, multiple failures synchronization, and routing information update. Consider the topology in Fig. 1 with 10 nodes, where nodes 3, 7, and 8 have failed. In the text below, they are called faulty nodes. The immediate neighbors of a faulty node F are referred to as F INF s, e.g., nodes 1, 6, and 5 are F IN7 s. When a failure F is detected, NetRec is triggered asynchronously by all F INF s. For each failure, the dashed line encircles the nodes that will participate in NetRec. These regions contain all neighbors of the faulty node and the nodes that lie on the restoration paths between them, determined as specified below. The clouds C1 -C10 represent the rest of the network, which is completely
AVRESKY AND NATCHEV : DYNAMIC RECONFIGURATION IN COMPUTER CLUSTERS WITH IRREGULAR TOPOLOGIES IN THE PRESENCE...
605
being delivered to their destinations. Such failures can be detected by the FINs by using timeouts on the expected messages. If additional failures are detected in a failureaffected region, then the nodes can take corrective actions, for example, NetRec might be restarted after a network stabilization period. This will ensure that all nodes which participate in a NetRec reconfiguration are correct and the paths between them exist.
2.3 Phase 0Restoration Leader Election All F INF s participate in the reconfiguration procedure, which is controlled by a restoration leader for failure F (RLF ) elected among them. NetRec does not possess global knowledge for the network, so the F INF with the highest ID becomes RLF . All other F INF s assume the role of passive F INF s (pF INF s). The leader election is executed without exchanging messages because each F INF knows the IDs of all nodes, which are two hops away through each of its links. For example, in Fig. 1, node 4 is RL for failure 3 (RL3 ) and nodes 1 and 2 are pF IN3 s. Link failures. Normally, a distinction between a node and a link failure cannot be made immediately by the nodes on each end of a faulty link. In this case, each of them will assume that the other one has failed and will execute the restoration leader election procedure without knowing that there is a link failure. Since this is a link failure, the other supposed FINs will not enter NetRec and, sooner or later, one of the two nodes will elect itself for RL by assuming that all previous possible RLs are faulty. This process is described in more detail in Phase 1. 2.4 Phase 1Restoration Tree Construction The next step of NetRec is to build a restoration tree for failure F (RTF ), which is rooted at RLF and spans all pF INF s. For this purpose, RLF broadcasts a Restoration Tree Message (RT MF ) which discovers the paths to all pF INF s (Step 1a of the pseudocode). All nodes that receive the RT MF will stamp it with (add to it) their ID and broadcast it on their active links, except the one on which it arrived. A node may receive a copy of the RT MF from several of its neighbors; such a node will process only the first one and drain the duplicates, which guarantees that, for a given failure, each node is visited only once. Because of the possibility for simultaneous node failures, the elected RL may be faulty, so if, after a given timeout, no RTM arrives, all nonfaulty pF INs will go back to the start of Phase 0 to elect a new RL. Upon receiving a copy of the RT MF , each pF INF will record the path traversed by the RT MF as the new restoration path to the leader and will send this information back to RLF through an RTM Response (RT MRF ) (Step 1b of the pseudocode). Since some of the pF INF s might be faulty, the RL will wait for RT MRF s until a given timeout expires and will continue the reconfiguration only with the nodes that have replied. It is important to note that, in the next steps, all reconfiguration messages will be forwarded by explicit-path routing following these restoration paths. In general, the restoration tree structure depends on the network conditions and it may contain nodes that are not F INs; such nodes are referred to as affected nodes for failure F
Fig. 2. Two possible RTs for failure of 7.
(F ANF s). For example, two possible restoration trees for failure 7 are presented in Fig. 2. Link failures. The RTM and RTMR messages are also used to distinguish between node and link failures. If a FIN receives an RT M from a node which is assumed to be faulty, then there is a link failure and the FIN must update its reconfiguration information to reflect this. If both nodes on the ends of the faulty link have decided that they are restoration leaders, then they will receive RT Ms from each other and the one with the higher ID will become the RL for the faulty link, while the other node will assume pF IN state. After this, NetRec can be applied to reconfigure around the faulty link. At the end of Phase 1, the RL sends a Build Restoration Tree (BRT) to the pFINs (Step 1c of the pseudocode). Any node that forwards a BRT marks the arrival and departure links of the message as belonging to the restoration tree for the failure. Additionally, if the node is not a pFIN for the failure, it will become a FAN. Upon receiving the BRT messages, the pFINs reply to the RL with a BRT Response (BRTR) message to confirm that the RT has been successfully built. It is worth noting that, in the presence of multiple failures, Phase 1 is executed independently, in parallel for all RTs. In such a case, two or more restoration trees may have common nodes. We refer to this situation as intersecting restoration trees or intersecting restoration leaders. If a node belongs to RTi and RTj , then we refer to it as joint node between the restoration trees (JNij ). For example, in Fig. 1, nodes 1 and 2 are joint for RT3 and RT7 . If two restoration trees have no joint nodes, then we refer to them as disjoint restoration trees or disjoint failures. The BRTR messages are also used for RT intersection detection and notification. Assume that K is a joint node for RTi and RTj and it first processes BRT Ri , thus, RLi will not be notified about the intersection. When K processes BRT Rj , it detects that there are intersecting RTs and marks BRT Rj with info about RLi , thus notifying RLj that there are intersecting RTs. If K is joint for more than two RTs, then each next BRT R will be used to notify the corresponding RL about all previously detected intersections. This procedure guarantees that at least one of the intersecting RLs will be notified about the intersection.
2.5 Phase 2Multiple Failures Synchronization NetRec can reconfigure the network in the presence of simultaneous failures by ensuring that, at any given time, only RLs with nonintersecting RTs can proceed to Phase 3,
606
MAY 2005
in which the routing information is updated. There are several different cases of multiple failures which can be handled by NetRec, the easiest being that of consecutive failures. According to Assumption 2.1, no additional failures will occur in the restoration tree while NetRec is operating. Thus, any consecutive failures are temporally disjoint and NetRec can be applied consecutively for each of them. In addition to consecutive failures, NetRec can handle any combination of simultaneous node and link failures. If there are simultaneous disjoint failures, NetRec can be applied in parallel for each of them without synchronization between the respective RLs. Another case is the simultaneous failures with the same restoration leader in which the RL can execute Phase 1 of NetRec in parallel for all failures. In this way, a super RT, which spans the pFINs for all failures, is created. After that, the regular NetRec Phases 2 and 3 can be applied. In all other cases, there are simultaneous failures with intersecting restoration trees and their RLs have to establish such an order of priorities among them that will result in a sequence of safe reconfigurations around single failures or simultaneous disjoint failures. Based on this order, the intersecting RLs will wait in Phase 2 until all higher priority RLs have completed Phase 3. Depending on the intersection notifications that were delivered in Phase 1, the order among the RLs is established in Phase 2 as follows (see the pseudocode for Phase 2 in the Appendix, which can be found on the Computer Society Digital Library at http:// computer.org/tc/archives.htm): 1. 2. All RLs that have not been notified about intersections, will proceed to Phase 3 without delay. All RLs that have been notified about intersections have to synchronize with other RLs and will not enter Phase 3 until their priorities for all intersections have been established, which is done as follows: a. The notified RLs will send Synchronization Request Messages (SRMs) to all RLs about which they have been notified. When an RL receives an SRM, depending on its state, it will react as follows: . If it has started (and may have finished) Phase 3, then, after completing Phase 3, it will reply with SRM Response (SRMR) to the sender. This ensures that all RLs that have sent SRMs to this RL will wait before proceeding to Phase 3. If it has been notified about this intersection, then it has also sent an SRM to the corresponding RL. Thus, both RLs know about each other, so the one with higher ID will automatically take priority and execute Phase 3 first. If it has been notified about other intersections, but not about the one from the arriving SRM, and it has higher ID than the sender, then it will have higher priority. Otherwise, it will reply with SRM to the sender, this way giving priority to the sender.
Any RL which has been notified about intersections and has received responses to all synchronization requests that it has sent can proceed to Phase 3. As mentioned above, upon completing Phase 3, such an RL will send Synchronization Responses to all leaders that have sent synchronization requests to it in Phase 2. The SRMs and the SRMRs use the joint nodes as gateways between the intersecting RLs, i.e., if RLi has to send a message to RLj , then RLi will send the message through an explicit path to the joint node, which will forward the message through an explicit path to RLj . c.
b.
2.6 Phase 3Routing Information Update For each failure F , the paths from RTF are used to replace the broken paths between the F INF s. This is done through a series of reconfiguration messages that are exchanged between the RL and the pFINs. First, the RL sends a Routing Update Request message (RUR) to each pF INF (Step 3a of the pseudocode). Upon receiving its RUR message, each pF INF replies to the RL with a Routing Update Message (RUM) which contains the list of destinations reachable by the pFIN through links that are not part of the RT. Each FAN on the RUMs path will update, if necessary, its routing table and will add to the first RUM, which it forwards, the list of destinations reachable by it through links that are not part of the RT. When the RL receives all RUMs, it uses the contained information to update its routing table, after which, it generates a RUM response (RUMR) message to each pF INF , which contains all destinations reachable by the RL, except those reachable through the link leading to the pF INF (Step 3b of the pseudocode). When a FAN receives an RUMR, it updates its routing table, if necessary, and updates the RUMR with the list of destinations that are reachable by the FAN through links that are part of the RT, except the arrival and departure links of the RUMR. When a pFIN receives its RUMR message, it updates its routing table and informs the RL that it is ready by sending an Update Complete message (UCM). The reconfiguration completes when the RL receives UCMs from all pFINs. Finally, the RL sends Resume Operation Messages (ROM) to all pFINs, which resume the normal operation of the pFINs and the FANs. It is worth noting that, during Phase 3, the RT nodes will block all nonreconfiguration messages. However, the rest of the network continues normal message exchange. After NetRec updates the routing tables of all RT nodes, they resume normal message routing. Thus, NetRec does not suspend the operation of the entire network during the reconfiguration.
EXAMPLES
3.1 Single Node Failure An example of a network with a single node failure is shown in Fig. 3. The faulty node is 3 and the links that are lost are denoted by dashed lines. The figure displays details of the network only in the vicinity of the failure. The rest of the network is represented by the destination sets CN . The
607
Fig. 3. Reconfiguration example.
original routing tables of the non-faulty nodes are given in Table 1. Phase 0. The first step of NetRec, after the failure is detected, is to elect a restoration leader among the FINs. For example, node 0 detects the failure on link 2. The nodes that are two hops away from 0 via that link are 7, 4, and 2, so 0 elects 7 for restoration leader. In a similar way, 7 is elected for RL by all FINs. The rest of them (4, 2, and 0) become pFINs. Phase 1. Phase 1 aims to reestablish the connectivity between the RL and the pFINs by building a restoration tree. As specified, the RL sends an RTM on all of its functioning links. Each RTM message is updated by the network nodes, so it contains the path it has traversed. The path of the first RTM that reaches a given pFIN is used as the restoration path between the pFIN and the RL. In this example, the content of the RTMs arriving at each pFIN is:
paths. The established restoration tree is confirmed by all nodes by exchanging BRT and BRTR messages between the RL and the pFINs, which concludes Phase 1. Each node that forwards a BRT becomes a FAN; in this example, such as nodes 6, 5, and 1. Phase 2. Since there is only one failure in the network, no intersection notifications will be generated and Phase 2 will be skipped by the RL. Phase 3. In Phase 3, the RL sends Routing Update Requests (RUR) to all pFINs. After receiving the RUR message, each pFIN issues an RUM for the RL, which contains all destinations that the pFIN can reach via links that are not part of the RT. Each node that forwards an RUM updates its routing table with the information carried by the RUM. Additionally, the first RUM received by any FAN will be updated with information about the destinations that the FAN can reach through its links that are not part of the RT. The content of the RUMs that will arrive at node 7 is:
After receiving the RTM, the pFIN sends a response (RTMR) to the RL, which follows the established restoration
The RL updates its routing table to reflect the new paths; in this example, all RUMRs will arrive on port 2, so the RL will associate destinations C0 , C1 , C2 , and C4 with port 2 (originally they were assigned to port 1). At this time, the routing tables of the nodes will have content as shown in Table 2. The RL then issues an RUM response to each of the pFINs which contain the list of destinations reachable by the RL through all ports, except the one leading to the destination pFIN. Each node that forwards an RUMR, updates its routing table with the information from the RUMR and adds to the RUMR the destinations that it can reach on all links except the ones on which the RUMR arrives and departs. The following RUMRs will arrive at the pFINs:
TABLE 1 Original Routing Tables
608
MAY 2005
TABLE 2 Partially Updated Routing Tables after Step 3a of NetRec
TABLE 3 Final Routing Tables
Phases 0 and 1. For each of the failures, the restoration tree will be constructed in parallel and will have the following structure: RT8 node 9 is RL, the RT consists of the path (9-10-5); RT7 node 6 is RL, the paths of the RT are (6-1), (6-1-2-5); . RT3 node 4 is RL, the paths forming the RT are (4-2), (4-2-1). Node 5 is joint for RT8 and RT7 and nodes 2 and 1 are joint for RT7 and RT3 . By following the above-mentioned assumption, the BRTRs will arrive at each of the joint nodes in the following order: at node 1BRT R7 before BRT R3 ; at node 2BRT R3 before BRT R7 ; and at node 5BRT R8 before BRT R7 . Thus, node 1 will notify node 4 about the intersection, but will not notify node 6. Similarly, node 2 will notify 6, but not 4 and node 5 will notify 6, but not 9. Phase 2. In this phase each of the RLs will proceed as follows: . . . Node 9 has not been notified about intersections, so it will immediately enter Phase 3.
When a pFIN receives its RUMR, it updates its routing table. The reconfiguration is completed by exchanging UCM and ROM messages between the RL and the pFINs, after which normal operation is resumed. The final content of the routing tables is presented in Table 3.
3.2 Simultaneous Node Failures The operation of NetRec for simultaneous node failures is illustrated for the irregular topology, shown in Fig. 1. The emphasis in this example is on the intersection detection in Phase 1 and the synchronization between the leaders in Phase 2 of NetRec. If we assume that, for each message, it takes the same time to traverse each link in the topology, then the following synchronization procedures will take place:
609
Node 6 has been notified about intersections with 4 and 9, so it will send SRMs to nodes 4 (through joint node 2) and 9 (through joint node 5) and will stop to wait for responses from them. . Node 4 has been notified about intersection with node 6, so 4 will send SRM to 6 through joint node 1 and will stop to wait for a response from it. After finite time, the RLs will receive these SRMs and will proceed as follows: . When node 9 receives the SRM from node 6, it is already executing Phase 3 of NetRec, so it will complete Phase 3 and then send SRMR to 6. . Node 6 is in Phase 2, waiting for a response from node 9. Meanwhile, it will receive the SRM from node 4. Since 6 has already sent the SRM to 4, it will automatically take higher priority than 4, but has to wait for a response from 9 before proceeding to Phase 3. . Node 4 is waiting in Phase 2 because it has sent an SRM to node 6. Thus, 4 will receive an SRM from 6 and will know that it has to wait for 6 to finish before proceeding to Phase 3. Phase 3. Based on the priorities, established above, Phase 3 will be executed by the RLs in the following order: . Node 9 will reconfigure RT8 , then will send SRMR to node 6 through joint node 5 and will exit NetRec. . Node 6 has priority over node 4, so, when the SRMR from node 9 arrives, 6 will proceed to Phase 3, after which it will send an SRMR to 4. . Finally, after node 4 receives the SRMR from node 6, it will execute to Phase 3, after which the execution of NetRec completes. In the general case, the assumption that it takes the same time for each message to traverse the network links is not applicable and it is not compulsory for the operation of NetRec. .
ON nmax Pmax messages. In addition, after each RT is reconfigured, SRMR messages are sent to the remaining leaders, which has complexity of ON 2 Pmax . Thus, the complexity of Phase 3 is ON nmax Pmax N 2 Pmax . As a result, the total message complexity of NetRec for simultaneous failures is ON L nmax Pmax N Pmax : t u Corollary 4.1. The complexity of NetRec for a single failure (i.e., N 1) is OL n Pmax . Indeed, by substituting N 1 and reducing the formula from Theorem 4.1, we obtain that the message complexity of NetRec for single failure is OL n Pmax .
4.2 Termination The following reliable message delivery properties are used for proving NetRecs termination:
Definition 4.1. If a nonbroadcast message is sent from source S to destination D, then it will be received once and only once by D. Definition 4.2. Each message sent by explicit-path routing following a restoration path will be reliably delivered to its destination. Definition 4.3. If two messages are sent with explicit-path routing from source S to destination D, then they will be received by D in the order in which they were sent by S. Lemma 4.1. For a given faulty node F , all FINs will elect the same RL. Proof. We prove by contradiction. Suppose that two FINs will elect different RLs. Since the node with the highest ID among the FINs is elected for RL, then these two FINs must have used different FIN sets. However, all FINs are two hops from each other through F and, by definition, each FIN knows its own ID and the IDs of all nodes that are two hops away from it. Thus, the FIN sets determined by the FINs cannot be different, which contradicts the supposition. u t Lemma 4.2. For a given faulty node F , all FINs will successfully complete Phase 1 of NetRec and will proceed to Phase 2 in finite time. Proof. According to Lemma 4.1, all nonfaulty FINs will elect the same RL. If it is not faulty, then the RTM is guaranteed to be broadcast in Step 1a. Assumptions 2.2 and 2.1 ensure that the RTM will be reliably delivered to all nonfaulty pFINs. Thus, the pFINs will not wait forever to receive the RTM. If the elected RL is faulty, then it will be excluded from the FIN set and Phase 0 will be repeated. Each such iteration results either in the election of a nonfaulty RL or in the removal of one ID from the FIN set. Since the cardinality nF of the FIN set is limited by the node degree of the faulty node F , then a nonfaulty RL will be elected after a maximum of nF iterations. Thus, all FINs will proceed to Step 1b in finite time.
PROPERTIES
OF
NETREC
4.1 Complexity of NetRec In this section, we examine the complexity of NetRec in terms of number of messages exchanged to complete the algorithm. Let L be the number of links in the network, N be the number of simultaneous failures, nF be the number of FINs for failure F , and PF be the longest path between RLF and the pF INF s.
Theorem 4.1. The complexity of NetRec for N simultaneous failures is ON L nmax Pmax N Pmax , where Pmax and nmax are the maximal of all PF and nF , respectively. Proof. For all RTs, Phase 0 is executed without exchanging messages, so it has complexity O0. Phase 1 is executed independently, in parallel for all RTs, by exchanging ON L nmax Pmax messages. In the worst case, in Phase 2, all leaders will intersect with each other and will send SRM messages to each other which are bound by ON 2 Pmax . When all RTs intersect with each other, Phase 3 is executed independently for each RT, which results in
610
MAY 2005
According to Definitions 4.1 and 4.2, the RTMRs sent by the nonfaulty pFINs in Step 1b will be reliably delivered to the RL. If there are faulty pFINs, the RLs timeout will expire and it will exit the RTMR wait loop and remove the faulty pFINs from the pFIN set. Thus, all FINs will proceed to Step 1c. Steps 1a and 1b guarantee that the pFINs and the RL are not faulty, therefore, in Step 1c, no node will wait for a message which was not sent. According to Definitions 4.1 and 4.2, all generated messages will be reliably delivered. The algorithm for Step 1c for all FINs has no loops. Thus, all FINs will successfully complete Step 1c. u t Lemma 4.3. In the presence of multiple intersecting RTs, none of the intersecting RLs will remain forever in Phase 2. Proof. The goal of Phase 2 is to ensure that, at any given time, only RLs with nonintersecting RTs will be executing Phase 3, in which the routing information is updated. In the cases of consecutive failures and simultaneous disjoint failures, this is always true. In these cases, Phase 2 will be skipped and the RLs will proceed to Phase 3 independently of each other. If there are simultaneous failures with intersecting restoration trees, then their RLs must establish such an order, which results in a sequence of temporally disjoint reconfigurations around single failures. For each two intersecting RTs, there is at least one joint node which detects the intersection. This guarantees that at least one of the RLs in each intersection will be notified about it. The temporal order is established by the intersecting RLs based on their node IDsnodes with higher IDs have higher priority. All lower priority RLs will wait in Phase 2 until all higher priority RLs have completed Phase 3. Following the algorithm, after completing Phase 3, each RL notifies all lower-priority RLs, which allows the next leader in the temporal order to execute Phase 3. Thus, all leaders that were waiting in Phase 2 will eventually receive the required synchronization messages that allow them to proceed to Phase 3.t u Lemma 4.4. For a given failure, all FINs and FANs will successfully complete Phase 3 and will exit NetRec. Proof. Since all faulty FINs have been eliminated after the execution of Phase 1, the required Phase 3 messages will be sent. Also, according to Definitions 4.1 and 4.2, all messages will be reliably delivered to their destinations. Therefore, none of the FINs will wait to receive a message indefinitely. As well, the RL-pFIN relation is based on a strict request-response model, so there are no message race conditions on the paths between them. Thus, all FINs and FANs will proceed through Phase 3 of NetRec until they reach the end of the algorithm. u t Theorem 4.2. All nodes will successfully complete NetRec in the presence of multiple failures, i.e., NetRec will terminate. Proof. Based on Lemmas 4.1-4.4, it can be concluded that the RLs for each RT will proceed with all phases of NetRec and will generate the required reconfiguration messages, which will be reliably delivered to the pFINs. As a result, the pFINs and FANs in each RT will also proceed through all phases of NetRec until the algorithm terminates. u t
4.3 Liveness In this section, it is proven that NetRec will result in a reconfigured network.
Theorem 4.3. The execution of NetRec guarantees the reachability of all connected nodes. Proof. As a result of any failure, the paths through the faulty node or link are broken, i.e., they contain faulty segments formed by the links between the FINs. By Assumption 2.2, the network is not partitioned, so nonfaulty physical paths exist between all connected nodes. The RTM broadcast in Phase 1 will reach all nodes in the network, including all nonfaulty FINs, i.e., NetRec finds alternative paths between the FINs. Moreover, in Phase 3, the routing tables of all affected nodes are modified to reflect the newly established paths and each faulty segment is replaced with a restoration path. Theorem 4.2 proves that NetRec will terminate for any combination of single and multiple failures by executing a sequence of safe reconfigurations around single failures or simultaneous disjoint failures. Thus, for each failure, NetRec builds a restoration tree, replaces the faulty segments with restoration paths, and reconfigures the routing tables of all nodes as necessary, which guarantees the reachability of all connected nodes. u t
4.4 Safety The objective of this section is to define and prove the safety property of the NetRec algorithm, i.e., avoidance of infinite loops and cyclic dependencies.
Theorem 4.4. NetRec does not create infinite loops or cyclic dependencies. Proof. Cyclic dependencies between the RLs for multiple failures will not be created in Phase 1 because the RTs are built independently, in parallel for each failure. Since the relations between the RL and the pFINs are based on a request-response model, there are no cyclic dependencies between them. All loops in this phase will terminate either after receiving the expected messages or after timeout expires, so they will not be infinite. Additionally, the execution of Phase 1 guarantees that only nonfaulty nodes will participate in the reconfiguration. Lemma 4.3 proves that no restoration leader will be blocked forever in Phase 2. As well, cyclic dependencies between the RLs cannot arise because they are resolved by always giving priority to the nodes with higher ID or nodes that are already in Phase 3. In the presence of multiple failures, the RLs will enter Phase 3 in the priority order which was established in Phase 2, i.e., at any time, only RLs with disjoint restoration trees are permitted to concurrently execute Phase 3. Therefore, cyclic dependences cannot be formed between the RLs. The RL-pFIN relations are based on a strict request-response model, so there are no cyclic dependencies between them. Since all possible faulty FINs have been isolated from the RT in Phase 1 and all reconfiguration messages are reliably delivered, all loops in Phase 3 will terminate after the corresponding messages are received.
611
Fig. 5. Irregular topology. Fig. 4. Distant and NetRec.
Therefore, NetRec does not create infinite loops or cyclic dependencies. u t
topologies. Validation was performed with all possible failure positions and, in all cases, NetRec correctly updated the routing tables of all failure-affected software routers.
NETREC VALIDATION
For validation purposes, NetRec was integrated into Distant a distributed testbed for verification of routing and reconfiguration algorithms, as shown in Fig. 4. Description of the testbed. Distant is a distributed network testbed that can be used to validate routing and reconfiguration algorithms in arbitrary network topologies. Distant is implemented on a cluster of Compaq ProLiant servers with dual Intel Pentium III 1GHz processors, running Microsoft Windows 2000, that are connected through gigabit Compaq ServerNet II switches. In addition, the cluster nodes are linked through a 10Mb Ethernet network running TCP/IP. MPI 1.2 software, installed in the cluster, is used as communication interface for Distant. Based on MPI, Distant provides the following features: Software routers that implement distributed hop-byhop message routing, based on routing tables in each software router. The software routers also handle the failure detection by means of periodic Im alive messages and reconfiguration in the presence of failures. . Virtual point-to-point topologies of software routers. Direct message exchange is allowed only between software routers that are linked in the topology. Distant can work with a generic or application-driven traffic pattern. In the first case, a generic random generation application is attached to each software router, which loads the network with randomly generated messages with random destinations. In the second case, a variety of real-life distributed applications (e.g., sorting, data and image compression, etc.) can be executed in Distant. Additionally, Distant can operate in several different modes: on a single cluster node, distributed on several cluster nodes with communication over ServerNet/ VIA, and distributed on several cluster nodes with communication over Ethernet/TCP/IP. NetRec was exhaustively validated in Distant for single, multiple consecutive, and multiple simultaneous node and link failures, with both generic and applicationdriven traffic in a variety of regular and irregular virtual .
5.1 Validation with Generic Traffic The goal is to validate NetRec with a generic traffic pattern in the presence of simultaneous failures of node 3 and link 4-7, as shown in Fig. 5. As a result of the validation, complete execution traces of the message routing before failures and after reconfiguration have been collected. A fragment of such a trace, showing the reconfiguration procedures for this case, is given below:
... Message sent from 104 to 107. Message arrives at 107 from 104, path 107-7-4104. ... Message sent from 100 to 104. Message arrives at 104 from 100, path 100-0-1-4104. ... Link between 4 and 7 fails ! Node 3 fails ! . 4 detects failure of 7! 0 detects failure of 3! 7 detects failure of 4! 5 detects failure of 3! 6 detects failure of 3! 4 sinks RTM from 7 for failure 4. Link failure (4-7), pFIN is updating info! 5 sinks RTM from 6 for failure 3. 0 sinks RTM from 6 for failure 3. RL 7 sinks RTMR from 4 for failure 4. Link failure (7-4), RL is updating info! RL 6 sinks RTMR from 5 for failure 3. RL 6 sinks RTMR from 0 for failure 3. 4 sinks BRT from 7. 0 sinks BRT from 6. 5 sinks BRT from 6. RL 7 sinks BRTR from 4. RL 7 not notified, skips Phase 2, enters Phase 3. RL 6 sinks BRTR from 0. RL 6 sinks BRTR from 5, intersection with RL 7. RL 6 sends SRM to 7. RL 6 enters Phase 2.
612
MAY 2005
4 sinked RUR from 7. RL 7 sinks SRM from 6. RL 7 sinks RUM from 4. 4 sinks RUMR from 7. RL 7 sinks UCM from 4. RL 7 completes reconfiguration for link failure (7-4). RL 7 sends SRMR to 6. FAN 5 completes reconfiguration for failure with RL 7. RL 6 sinks SRMR from 7. RL 6 exits Phase 2, starts Phase 3. 4 sinks ROM from 7. pFIN 4 completes reconfiguration for failure 4. 4 is back online ! 5 sinked RUR from 6. 0 sinked RUR from 6. RL 6 sinks RUM from 0. RL 6 sinks RUM from 5. 0 sinks RUMR from 6. 5 sinks RUMRfrom 6. RL 6 sinks UCM from 5. RL 6 sinks UCMfrom 0. RL 6 completes reconfiguration for failures 3. 6 is back online ! FAN 7 completes reconfiguration for failure with RL 6. 7 is back online ! FAN 2 completes reconfiguration for failure with RL 6. 2 is back online ! 0 sinked ROM from 6. pFIN 0 completes reconfiguration for failures 3. 0 is back online ! 5 sinked ROM from 6. pFIN 5 completes reconfiguration for failures 3. 5 is back online ! ... Message sent from 104 to 107. Message arrives at 107 from 104, path 107-7-5-4-104. ... Message sent from 100 to 104. Message arrives at 104 from 100, path 100-0-1-4-104. ... The fragment shows the message generations, message arrivals, and the route taken by each message. In addition, it shows the sequence of NetRec messages applied to reconfigure the network. For example, from the trace, we can observe that, before the failure, the traffic from node 107 to node 104 was traversing path 107-7-4-104, but, after NetRec was executed, the messages are redirected through path 107-7-5-4-104. Thus, the broken paths that were going through the faulty link have been discovered and replaced with alternative paths. It is worth noting that the paths that are not affected by the failure are not modified by NetRec, i.e., the traffic from node 100 to node 104 is passing through path 100-0-1-4-104 both before and after the reconfiguration.
5.2 Validation in Application Mode To evaluate the effect of multiple failures on the performance of an application running in the topology, shown in Fig. 5, we used a distributed quad-tree matrix compression algorithm [31]. It involves mapping the matrix data onto a quad-tree (QT) by recursively dividing the matrix into four equal submatrices. For our tests, we used a quad-tree composed of five application processes, with the following initial mapping of the application processes on the virtual point-to-point topology in Fig. 5: 107 was the QT root and 100-103 were the QT leaves, while 104-106 were initially spare. Validation results for simultaneous failures of nodes 1 and 6 are presented. A consequence of the failure of node 1 is the loss of application process 101. In all cases, the reconfiguration was successful and the loss of application process 101 was detected, after which its task was reassigned by the root to spare process 104. The application performance before and after the failures is measured by compressing the same set of 2,000 matrices, ranging from incompressible (compressibility ratio 1) to 100 percent compressible (compressibility ratio 0) and measuring the execution time. Based on the experimental data, shown in Figs. 6, 7, 8, and 9, the following observations can be made for each of the presented cases: Execution on two cluster nodes over ServerNet II. The virtual-to-physical node mapping is as follows: Software routers 0-3 and the corresponding application processes 100-103 were assigned to cluster node (CN) 1, while software routers 4-7 and processes 104-107 were assigned to CN 2. ServerNet II was used for communication between the cluster nodes. It can be seen that the total execution time is unaffected by the compressibility ratio of the data, as shown in Fig. 6. Thus, it can be concluded that the prevailing component in the execution time is the computation time. This conclusion is also supported by the fact that, after the failures, the compression times remain in the same range, even though the path between the application root 107 and the replacement process 104 is only one hop, which is not over ServerNet II (104 and 107 are assigned to the same cluster node), compared to the path to the lost process 101, which was two hops and traversed the network (101 and 107 are in different cluster nodes). Execution on two cluster nodes over Ethernet. The virtual-to-physical node mapping is the same as in the previous experiment, but Ethernet was used for communication between the cluster nodes. The execution time before the failures varies significantly, depending on the compressibility ratio, and is almost double for incompressible data. Based on this, it can be concluded that the communication overhead is much higher than the computation time because large amounts of information traverse the Ethernet network twicefrom the root to the leaves and back. It is interesting to observe that, after the failures, the application performance is improved, which is because the path to the replacement process is shorter and it does not traverse an Ethernet linkthe original path was 107-7-4-1101 and the link between 104 and 101 is over Ethernet, while the replacement path is 107-7-4-104 and there are no internode (Ethernet) links. Additionally, even though the paths between the application root 107 and the rest of the
613
Fig. 6. Execution on two nodes (ServerNet): (a) no failures, (b) after failures of 1 and 6.
Fig. 7. Execution on two nodes (Ethernet): (a) no failures, (b) after failures of 1 and 6.
Fig. 8. Execution on three nodes (ServerNet): (a) no failures, (b) after failures of 1 and 6.
application processes have become longer in terms of hops, they still traverse the Ethernet network only once. For example, the path between 107 and 102 before the failures was 107-7-6-2-102, where the link 6-2 is over Ethernet, and the path after the failures is 107-7-5-3-0-2-102, but, again, only one link (5-3) is internode. Execution on three cluster nodes over ServerNet II. The virtual-to-physical mapping is: software routers 0 and 1 with their application processes to CN 1, 2-4 on CN 2, and 5-7 on CN3. As in the previous ServerNet II case, the total execution time depends mainly on the computation time,
which results in compression times that do not depend on the compressibility ratio and the failures. Execution on three cluster nodes over Ethernet. The application mapping is as for the previous case. The same conclusions can be made for the dependence of the compression time on the data compressibility as in the first Ethernet case above: The communication overhead is the main component of the execution time. However, in this case, the application performance decreases after the failures. The explanation is that the new paths between 107 and processes 100 and 102 traverse the Ethernet network several times as opposed to only once before the failures, which
614
MAY 2005
Fig. 9. Execution on three nodes (Ethernet): (a) no failures, (b) after failures of 1 and 6.
incurs significant additional communication overhead. For example, in the original path 107-7-6-2-102, only link 6-2 is internode, while, in the replacement path 107-7-5-3-0-2-102, links 5-3, 3-0, and 0-2 are over Ethernet. Additionally, even though the path to the replacement process 104 is shorter in terms of hops, it still traverses one Ethernet link (link 7-4 is internode) because of the virtual-to-physical mapping. It is worth noting that the mapping of the application processes onto the virtual network topology and of the virtual topology to the physical cluster nodes has a significant impact on the application performance. However, our goal was to validate NetRec, so the optimality of the mapping and task reassignment are not in the focus of this paper.
ACKNOWLEDGMENTS
This work was supported by the US National Science Foundation under grant CCR-0004515. The authors would like to thank the anonymous reviewers and the associate editor, whose detailed and constructive comments significantly contributed to the improvement of the presentation of this paper.
REFERENCES
D. Garcia and W. Watson, ServerNet II, Proc. Parallel Computer Routing and Comm. Workshop, pp. 119-136, June 1997. [2] N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su, MyrinetA Gigabit per Second Local Area Network, IEEE Micro, vol. 5, no. 1, pp. 29-36, Feb. 1995. [3] M. Schroeder, A. Birrell, M. Burrows, H. Murray, R. Needham, and T. Rodeheffer, Autonet: A High-Speed, Self-Configuring Local Area Network Using Point-to-Point Links, IEEE J. Selected Areas in Comm., vol. 9, no. 8, pp. 1318-1335, Oct. 1991. [4] D. Oppenheimer, A. Brown, J. Beck, D. Hettena, J. Kurode, N. Treuhaft, D.A. Patterson, and K. Yelick, ROC-1: Hardware Support for Recovery-Oriented Computing, IEEE Trans. Computers, special issue on fault-tolerant embedded systems, D. Avresky, B.W. Johnson, and F. Lombardi, eds.,vol. 51, no. 2, pp. 100-107, Feb. 2002. [5] R. Horst, Tnet: A Reliable System Area Network, IEEE Micro, vol. 15, no. 1, pp. 37-45, Feb. 1995. [6] W. Baker, R. Horst, D. Sonnier, and W. Watson, A Flexible ServerNet-Based Fault-Tolerant Architecture, Proc. 25th Intl Symp. Fault-Tolerant Computing, pp. 2-11, June 1995. [7] J. Duato, R. Casado, F. Quiles, and J. Sanchez, Dynamic Reconfiguration in High Speed Local Area Networks, Dependable Network Computing, D. Avresky, ed., Kluwer Academic, 2000. [8] C. Fang and T. Szymanski, An Analysis of Deflection Routing in Multi-Dimensional Regular Mesh Networks, Proc. IEEE INFOCOM 91, Apr. 1991. [9] G.D. Pfiarre, L. Gavano, A. Feliperin, and J.L.C. Sanz, Fully Adaptive Minimal Deadlock-Free Packet Routing in Hypercubes, Meshes and Other Networks: Algorithms and Simulations, IEEE Trans. Parallel and Distributed Systems, vol. 5, no. 3, pp. 247-263, Mar. 1994. [10] P.E. Berman, L. Gravano, G.D. Pfiarre, L. Gavano, and J.L.C. Sanz, Adaptive Deadlock- and Livelock-Free Routing with All Minimal Paths in Torus Networks, Proc. Fourth ACM Symp. Parallel Algorithms and Architectures, June 1992. [11] P.T. Gaughan and S. Yalamanchili, Adaptive Routing Protocols for Hypercube Interconnection Networks, Computer, vol. 26, no. 5, pp. 12-23, May 1993. [12] D. Avresky, J. Acosta, V. Shurbanov, and Z. McAffrey, Adaptive Minimal-Path Routing in 2-Dimensional Torus ServerNet SAN, Dependable Network Computing, D. Avresky, ed., Kluwer Academic, 2000. [1]
CONCLUSION
This paper presents NetRec, a novel algorithm for reconfiguring an arbitrary network when multiple permanent node and link failures occur. NetRec is applicable for all types of pointto-point computer networks, including wormhole-based System Area Networks. The algorithm relies only on local information about the network. As a result of applying NetRec, the faulty nodes and links are isolated from the network in an application-transparent way. The complexity analysis of NetRec, in terms of the number of messages exchanged while reconfiguring the network, is presented. The correctness, termination, liveness, and safety properties of the algorithm are also proven. NetRec has been implemented and validated in the distributed network testbed Distant, which is based on the MPI 1.2 features for building arbitrary virtual topologies. The validation was accomplished in a generic mode and with a real-life application. The introduced overhead during reconfiguration of the network is experimentally measured for a quad-tree data compression algorithm for different compressibility ratios and sizes of the matrices. Based on the presented theoretical and experimental results, one can conclude that NetRec is an efficient reconfiguration algorithm for arbitrary topologies. It can be the basis for implementing seamless task execution in computer clusters with arbitrary topologies.
615
[13] D. Avresky et al., Embedding and Reconfiguration of Spanning Trees in Faulty Hypercube, IEEE Trans. Parallel and Distributed Systems, vol. 10, no. 3, pp. 211-222, Mar. 1999. [14] D. Avresky and C. Cunningham, Single Source Fault-Tolerant Broadcasting for Two-Dimensional Meshes without Virtual Channels, Microprocessors and Microsystems, vol. 21, pp. 175-182, 1997. [15] D. Avresky, C. Cunningham, and H. Ravichandran, FaultTolerant Routing for Wormhole-Routed Two-Dimensional Meshes, Intl J. Computer Systems Science & Eng., vol. 14, no. 6, Nov. 1999. [16] C. Cunningham and D. Avresky, Fault-Tolerant Adaptive Routing for Two-Dimensional Meshes, Proc. IEEE First Intl Symp. High Performance Computer Architecture, pp. 122-131, Jan. 1995. [17] W. Qiao and L.M. Ni, Adaptive Routing in Irregular Networks Using Cut-Through Switches, Proc. 1996 Intl Conf. Parallel Processingg, Aug. 1996. [18] S. Konstantinidou and L. Synder, The Chaos Router: A Practical Application of Randomization in Network Routing, Proc. Second Ann. Symp. Parallel Algorithms and Architectures (SPAA 1990), pp. 21-30, 1990. [19] X. Lin, P.K. McKinley, and L.M. Ni, The Message Flow Model for Routing in Wormhole-Routed Networks, Proc. 1993 Intl Conf. Parallel Processing, Aug. 1993. [20] W.J. Daly and C.L. Seitz, Deadlock-Free Message Routing in Multi-Processor Interconnection Networks, IEEE Trans. Computers, vol. 36, no. 5, pp. 547-553, May 1987. [21] D.H. Linder and J.C. Harden, An Adaptive Deadlock and Fault Tolerant Wormhole Routing Strategy for K-Ary N-Cubes, IEEE Trans. Computers, vol. 40, no. 1, pp. 2-12, Jan 1991. [22] F. Silla and J. Duato, On the Use of Virtual Channels in Networks of Workstations with Irregular Topology, IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 8, pp. 813-828, Aug. 2000. [23] F. Silla and J. Duato, High-Performance Routing in Networks of Workstations with Irregular Topology, IEEE Trans. Parallel and Distributed Systems, vol. 11, no. 7, pp. 699-719, July 2000. [24] R. Casado, A. Bermudez, J. Duato, F.J. Quiles, and J.L. Sanchez, A Protocol for Deadlock-Free Dynamic Reconfiguration in HighSpeed Local Area Networks, IEEE Trans. Parallel and Distributed Systems, special issue on dependable network computing, D. Avresky, J. Bruck, and D. Culler, eds., vol. 12, no. 2, pp. 115132, Feb. 2001. [25] T.M. Pinkston, R. Pang, and J. Duato, Deadlock-Free Dynamic Reconfiguration Schemes for Increased Network Dependability, IEEE Trans. Parallel and Distributed Systems, vol. 14, no. 8, pp. 780794, Aug. 2003. [26] D. Avresky, N. Natchev, and V. Shurbanov, Dynamic Reconfiguration in High-Speed Computer Networks, Proc. IEEE Symp. Cluster Computing, Oct. 2001. [27] D. Dolev, R. Friedman, I. Keidar, and D. Malkhi, Failure detectors in Omission Failure Environments, Proc. 16th Symp. Principles of Distributed Computing (PODC), 1997. [28] T. Chandra and S. Toueg, Unreliable Failure Detectors for Reliable Distributed Systems, J. ACM, vol. 43, no. 1, pp. 225267, Mar. 1996. [29] N. Oh, S. Mitra, and E. McCluskey, ED4 I: Error Detection by Diverse Data and Duplicated Instructions, IEEE Trans. Computers, special issue on fault-tolerant embedded systems, D. Avresky, B.W. Johnson, and F. Lombardi, eds., vol. 51, no. 2, pp. 180-199, Feb. 2002. [30] N. Lynch, Distributed Algorithms. Morgan Kaufmann, 1996. [31] H. Samet, Design and Analysis of Spatial Data Structures, pp. 2-40. Addison-Wesley, 1990.
Dimiter Avreskys current research interests include network computing, high-speed networks and routing, trustworthy network computing, performance analysis and modeling, faulttolerant computing, embedded fault-tolerant systems, and verification of protocols. He has published more than 90 papers in refereed journals and conferences. He was a guest coeditor for the IEEE Transactions on Parallel and Distributed Systems special issue on dependable network computing, February 2001, IEEE Transactions on Computers special issue on fault-tolerant embedded systems, February 2002, IEEE Micro special issue on embedded fault-tolerant systems, September/October 2001, IEEE Micro special issue on embedded faulttolerant systems, September/October 1998, and the Journal of Supercomputing special issue on embedded fault-tolerant systems, May 2000. He is a founder and served as a program chair for three IEEE International Symposiums on Network Computing and Applications (NCA01-NCA03). He founded and served as a program chair for nine IEEE Workshops on Fault-Tolerant Parallel and Distributed Systems (FTPDS01-FTPDS09), held in conjunction with the IEEE IPDPS and for three IEEE Workshops on Embedded Fault-Tolerant Systems (EFTS01EFTS03). He has also served as a program committee member for numerous IEEE conferences and workshops. He edited and coauthored four books: Dependable Network Computing (Kluwer Academic, 2000), Fault-Tolerant Parallel and Distributed Systems (Kluwer Academic, 1997), Fault-Tolerant Parallel and Distributed Systems (IEEE CS Press, 1995), and Hardware and Software Fault Tolerance in Parallel Computing Systems (Simon&Schuster International, 1992). He has been a consultant to several companies, including Bell Labs, Tandem, Compaq, and Hewlett Packard. He is a senior member of the IEEE and a member of the IEEE Computer Society. Natcho Natchev received the MSc degree in electronics engineering from the Technical University of Sofia, Bulgaria, in 1993. He is currently working toward the PhD degree in computer engineering at Northeastern University, Boston. He has developed several software solutions for simulation, verification, and performance analysis of high-speed switched networks. His areas of interest include reliable computing, distributed computing, and network performance analysis.
. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

Dynamic Reconfiguration in Computer Clusters With Irregular Topologies in The Presence of Multiple Node and Link Failures

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Dynamic Reconfiguration in Computer Clusters With Irregular Topologies in The Presence of Multiple Node and Link Failures

Enviado por

Direitos autorais:

Formatos disponíveis

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 54, NO. 5,

IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 5,

Fig. 1. Irregular topology with three failures.

DYNAMIC RECONFIGURATION ALGORITHM NETREC

Fig. 2. Two possible RTs for failure of 7.

IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 5,

Fig. 3. Reconfiguration example.

TABLE 1 Original Routing Tables

IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 5,

TABLE 2 Partially Updated Routing Tables after Step 3a of NetRec

TABLE 3 Final Routing Tables

IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 5,

Fig. 5. Irregular topology. Fig. 4. Distant and NetRec.

Therefore, NetRec does not create infinite loops or cyclic dependencies. u t

IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 5,

IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 5,

Você também pode gostar