Você está na página 1de 32

Routing in Multi-hop Packet Switching Networks:

Gbps Challenge

Cesur Baransel Wlodek Dobosiewicz Pawel Gburzynski

Abstract T is that it doesn't grow as more stations are added


to the network. Consequently, the more stations the
The paper is a survey of networking solutions that network supports the smaller fraction of T is allocated
have been proposed for high-speed packet-switched to one station. In fact, FDDI assumes that at most
applications. Using these solutions as examples, we one pair of stations can communicate at any given mo-
identify the speci c problems resulting from very high ment. Besides, due to the bandwidth allocation policy
transmission rates and explain how these problems in- of FDDI, more stations means larger access delays. If
uence the design of high-speed networks and proto- the network is expanded geographically (even with-
cols. We conclude that the solutions based on de- out increasing the number of stations), L will grow
ection routing are the most promising ones and we reducing the e ective throughput in a linear fashion.
suggest a number of directions for their evolution. The same phenomenon will occur, if somebody tries
to extrapolate the FDDI concept onto transmission
rates substantially higher than 100 Mb/s, e.g., into
1 Introduction the Gb/s range. Thus, we have to conclude that the
network doesn't scale up very well: its principle of op-
Not so long ago, computer networks with high eration has inherent limitations which restrict its ap-
transmission rates (e.g. several Mb/s) were naturally plicability to a relatively narrow range of transmission
con ned to local domains. Although such (and higher) rates, geographical areas, and populations of users.
transmission rates were available in telephony on long All unidimensional networks, e.g., busses, rings,
distances, they were used on a point-to-point basis. and stars, are bound to su er from poor scalability
Concepts of highly-connected fast networks spanning to the increasing number of users. A network whose
geographical areas larger than the acreage typically all transmission resources (or a xed large fraction
covered by a single institution are relatively new and, of these resources) must be reserved for every single
besides the emerging atm technology, there are no transfer cannot cater to a large population of users.
standard commercially available solutions that can be Most painfully, it cannot take advantage of the lo-
recommended for such projects. calized communities of interest which naturally occur
Most of the legacy from local-area networking is in any large population.1 Besides, the need to ne-
not easily adaptable to larger-scale networks and/or gotiate medium access across the entire network re-
networks with transmission rates substantially higher sults in a throughput deterioration when the network
than several Mb/s. Consider FDDI as an example. diameter2 is large. Networks that avoid this prob-
The network operates at 100 Mb/s and is intended to lem, e.g., Metaring [CiO93], su er from poor fairness
provide campus-area service. The formula expressing and/or starvation potential. Various protocol add-ons
the maximum e ective throughput of FDDI can be aimed at alleviating these problems are either partially
roughly written as follows: successful (e.g., they exhibit slow responsiveness to
THT dynamic unfairness patterns) or they tend to sacri ce
T = THT +L throughput to achieve their objective.
The moral from the above observations is that the
where THT is the maximum token-holding time dur- future of high-speed networking, at least beyond the
ing one rotation of the token, and L is the propaga- 1 Attempts to accommodate communities of interest in some
tion time across the ring. One obvious property of bus networks, e.g., DQDB [PPP92], have been only partially
 The authors are with the Department of Computing Sci- successful.
ence, University of Alberta, Edmonton, Alberta, Canada T6G 2 Expressed in the normalized way|as the number of bits
2H1. E-mail: cesur,dobo,pawel@cs.ualberta.ca. separating two most distant stations in the network [Kle75].

1
local-area scale, lies with meshed networks. A two- Clearly, these components are not independent, but
dimensional (planar) mesh network with N stations is they are closely related to each other and o er together
potentially able
p to achieve a global throughput pro- a single functionality. Re ecting this relationship, our
portional to N . By exploring other dimensions, this paper is organized as follows. We start from discussing
gure can be asymptotically brought as close to N routing protocols and congestion-control mechanisms
as required. Mesh architectures also tend to be orga- that are employed in contemporary packet-switched
nized according to the geographical distribution of the networks, not necessarily in networks operating at
interconnected nodes; thus, they may naturally take very high transmission rates. Then, after providing
advantage of the communities of interest to further some basic de nitions related to the topology compo-
improve their performance. nent, we investigate the challenges posed by the Gb/s
The existence of multiple paths between nodes com- transmission rates. Section 7 is devoted to case studies
plicates the communication protocols by introduc- which cover a number of contemporary design propos-
ing packet switching |the very issue that the uni- als. Finally, in section 8, we present conclusions and
dimensional networks were meant to avoid. Gen- give suggestions for further research.
eral packet-switching and ow-control techniques tra-
ditionally employed in slow long-haul networks are
usually inapplicable to gigabit networking. To take
full advantage of the high transmission rate of its
2 Routing protocols
channels, a gigabit packet-switching network cannot In a multi-hop packet-switching network, the func-
spend too much time on making sophisticated rout- tion of the routing algorithm is to guide the pack-
ing decisions. Also, it should avoid bu ering transient ets through the communication subnet to their proper
packets at intermediate nodes for extensive periods of destinations. Di erent taxonomies of routing algo-
time. These postulates stem not only from the natu- rithms are possible, e.g., static vs. adaptive or cen-
ral need to avoid unnecessary delays, which are mag- tralized vs. distributed. Here we prefer to put them
ni ed by the high transmission rates, but also from into two groups, namely, table-based routing and self
the need to make these delays predictable. Modern routing.4 The rst class covers most of the tradi-
networks are expected to handle trac patterns of tional approaches which have been applied to many
various kinds and delay-sensitive or delay-variation- slow networks, including shortest path routing and op-
sensitive patterns will constitute a substantial part of timal routing. Aside from other problems, as con-
their load. vergence delays proportional to the network diameter
In this paper, we conduct a survey of packet- and susceptibility to oscillations, these algorithms are
switching protocols applicable to meshed networks op- computationally expensive and require a substantial
erating in the Gb/s range. By a packet-switching amount of bookkeeping and periodic transmission of
protocol we mean the network-speci c portion of the status information among the nodes. In the case of self
third OSI layer (i.e., the network layer) of the proto- routing, the routing decision is solely made based on
col stack. The full set of packet-switching rules deter- some information extracted from the packet's header
mines how the network organizes reliable packet deliv- which is typically the destination address. Most of
ery between a pair of communicating switches, possi- the multiprocessor interconnection networks use this
bly located in distant geographical regions of the com- scheme (e.g., shue-exchange networks, hypercubes,
munication subnet.3 One part of a packet-switching data manipulator networks, Benes networks, and Clos
protocol (according to our de nition) is the routing networks). Another well-known example that can be
scheme, i.e., the set of rules that assign incoming pack- included in this category is ooding. This classi ca-
ets to output links. In general, we can talk about the tion groups the routing algorithms according to the
following three components of the communication sub- complexity of the routing decisions and, consequently,
net which are relevant from our point of view: to the speed at which packet switching can be carried
 the routing protocol; out. Another important factor a ecting the switch-
ing speed is the organization of bu ers for storing
 the congestion-control mechanisms that can be transient packets. When there are no bu ers at all,
e ectively incorporated into the routing protocol; purely photonic switching becomes feasible, and E/O
and O/E (electronic-to-optic and optic-to-electronic)
 the network topology. conversions can be avoided. However, the photonic
3 According to the OSI terminology. 4 Also known as header routing.

2
switching technology is still in its infancy and for the also account for the perceived congestion level of the
time being is not a practical alternative to its elec- links. In the latter case, the shortest paths are de-
tronic counterpart|see [JaM93]. termined dynamically and the routing algorithm can
adapt itself to variable trac conditions.
2.1 Table-based routing Regardless of the criteria determining the path
length used in selecting the output link suitable for
In this category of routing schemes, upon a packet a given destination, all shortest-path schemes are es-
arrival at an intermediate node, the node consults a sentially based on global topological knowledge of the
table to select the outgoing link on which the packet network. This knowledge is represented by the list
is to be forwarded. Although the location of the rout- of all nodes and their interconnections (the network
ing tables, the way they are maintained, and the in- graph) with a cost assigned to every link [Sch87]. One
formation contained within them may di er from one can see both centralized (as in tymnet) or distributed
implementation to another, some common character- (as in arpanet) techniques of representing this knowl-
istics are shared by all solutions that t into this class: edge; however, every node must have access to its lo-
cally and momentarily relevant portion to be able to
1. In most applications, the routing table contains make routing decisions. tymnet uses virtual circuits
an entry for every destination, indicating the out- and the shortest path calculations are performed by a
put link appropriate for a packet addressed to special node called the supervisor. The supervisor also
that destination. Therefore, the table size in- decides upon the path to be used by a virtual circuit.
creases with the network size and can be large The intermediate nodes are noti ed about the path
for a network with many nodes. by a needle packet that travels from the source to the
2. For practical reasons (e.g., to cope with conges- destination threading the virtual circuit along its way,
tion and to bypass faulty links or nodes), entries with the data packets trailing behind. The shortest-
in the routing tables need to be updated. There- path calculation algorithm used by the supervisor is a
fore, some network capacity must be allocated to modi ed version of Floyd's algorithm [Sch87].
the extra trac that disseminates status informa- In contrast, arpanet employs a distributed ap-
tion, reducing the capacity available to the users. proach in which every node maintains its own database
and carries out the shortest path calculations taking
The problem of large routing tables in networks itself as the source. The original algorithm, based on
consisting of a huge number of stations can be allevi- the Bellman-Ford method, was implemented in 1969.5
ated by introducing domains, each domain represent- It has been modi ed twice since then, due to problems
ing a cluster of closely-located stations which appear caused by oscillations, in 1979 and 1987. The latest
(almost) equally distant from a suciently remote lo- modi cation was warranted by the increased trac
cation. From the point of view of such a location, it load which once again lead to severe oscillations. The
may be appropriate to route all packets addressed to latest algorithm is still prone to oscillations, but not
the domain in the same way, e ectively treating the nearly as much as the rst one [BeG92]. The details
domain as if it were a single station. This approach of these algorithms can be found in [CLR90].
requires a hierarchical structure of the destination ad- The common drawback of all shortest-path al-
dress. Otherwise, lookup tables are needed to identify gorithms is the use of only one path per source-
the domains, which may reduce or even nullify the destination pair and their poor adaptability to abrupt
potential savings. trac shifts, which is further limited by their inherent
susceptibility to oscillations [BeG92].
2.1.1 Shortest-path routing
The basic premise of a routing scheme from this class 2.1.2 Optimal routing
is to have at every switch a unique mapping of the Optimal routing is based on the theory of optimal
destinations to the output links. Given a destina- multi-commodity ows. Assume that Zij (r) is a func-
tion address extracted from the header of an incoming tion that gives the cost of transmitting data at rate
packet, the switch selects the output link that o ers r (which may be viewed as the percentage of the link
the \shortest path" to the destination. The notion utilization) through link ij . Now, the routing problem
of length may be static, i.e., it may re ect the prop-
agation distance, the number of hops (intermediate 5 In that version, the nodes exchanged their estimated short-
switches), and the nominal link capacities, but it may est distances to every destination every 625 msec.

3
can be viewed as an optimization problem: the rout- simplicity of the switch, and the absence of adminis-
ing decisions should minimize the cost of resolving the trative trac in the network.
o ered load.6 Most commonly used cost functions are
related to link capacities and the amount of trac
carried by each link which is viewed as a ow. 2.2.1 Networks with regular topologies
The basic goal, i.e., optimal routing, is not always
attainable solely by optimizing the average levels of If the network forms a regular grid with a simple
link trac. Theoretically, there exist more e ective repetitive structure, then every switch may be able to
alternatives (e.g., ones that take queue lengths into de-facto \know" the con guration of the entire net-
consideration as well), but they are impractical due work without resorting to a data structure describ-
to the overhead and large delays involved in the ex- ing the individual locations of all stations. Networks
change of the queue length information among the with highly regular topologies commonly occur as in-
nodes [BeG87]. terconnection backplanes for multiprocessor systems.7
For example, optimal routing in the codex network Interconnection networks can be constructed from a
is based on the following cost function: single stage of switches or from multiple stages of
switches. In a single-stage network, packets may
Zij (Fij ) = C F?ijF + dij Fij (1) have to pass through the switches several times before
ij ij reaching their destinations. Therefore, multi-stage
where Cij is the link capacity, Fij is the data rate networks are sometimes called recirculating networks
of the link and dij is the processing and propagation and the subsequent passes of the same packet through
delay. codex uses virtual circuits for user trac and the stage of switches are called recirculations. The
datagrams for its own system messages. Every node number of recirculations depends on the connectiv-
monitors some parameters of its adjacent links and ity. Generally, the higher the connectivity the smaller
periodically broadcasts them to all other nodes. The the number of recirculations. In a multi-stage net-
above formula applies to the case when all links are work, typically one pass through the multiple stages
of the same priority. For multiple priorities and other of switches is sucient to deliver a packet to its des-
details see [BeG87] and the references in that book. tination [SiH88]. A survey of switching techniques in
high-speed interconnection networks can be found in
2.2 Self routing [OSM90, Hui90].
On a larger geographical scale, the regularity in
In this class we put all routing techniques that ei- the network topology is taken advantage of in the
ther avoid routing tables completely or use static, pos- Manhattan-street network which is a single-stage
sibly incomplete, routing tables which are seldom (or solution.8 Although originally proposed to cover
never) updated during the normal operation of the metropolitan areas, Manhattan-street networks can
network. With self routing, a switch accepting an also be used as interconnection networks.
incoming packet is able to determine its fate locally Interconnection networks have been studied exten-
without consulting the network's data base in its cen- sively in the literature, mostly owing to their appli-
tralized or distributed form. The price paid for the cations in distributed computing. Several books are
simplicity of the routing algorithm is its suboptimal available on this subject, e.g., [Bae80, HwB84, Sto87,
character. The gain is in the low cost of routing, the Sie90] as well as survey papers [Fen81, ReG87, YaA87,
6 Provided that the cost function is suciently di erentiable,
SiH88]. Manhattan-street networks were introduced
it can be expanded as a Taylor series. If the rst derivatives by Maxemchuk [Max85] and since then they have been
exist, then at local minima the Jacobian gradient vector has extensively studied by Maxemchuk [Max87, Max89,
all elements zero. If the second derivatives exist, the Hessian Max90] and other authors [BoC87, ChA90]. In sec-
is positive de nite at the minimum. For convex functions the tion 7, we discuss hypercubes, shue networks, and
local minima are also global. The gradient methods are based Manhattan-street networks.
on the Taylor series expansion. Optimization methods which
use only Jacobian gradient vector are termed rst-order meth-
ods. If the optimization method utilizes second derivatives as 7 Many internal designs for atm switches are based on the
well, it is called as a second-order method. The steepest descent interconnection paradigm, i.e., the switch is treated as a network
method uses Jacobian gradient to determine a suitable direction of specialized processors.
of movement and is the fundamental rst order method. All in 8 In a congested Manhattan-street network, a packet may
all, the appropriate choice of the cost function greatly simpli- visit the same switch several times; however, multiple visits
es the optimization process. For details, see [AdD74, Tah82, at the same switch are never necessary for a successful packet
PSU88]. delivery.

4
2.2.2 Flooding networks and eventually exceed the available bu er space. Con-
Another simplistic approach to routing is ooding sequently, some packets will have to be discarded and
which, in its purest sense means that an incoming later retransmitted, thereby wasting communication
packet is forwarded on every outgoing link except the resources and feeding back the congestion. It is thus
one it arrived on [BeG87]. The outstanding qualities necessary to prevent excess trac from entering the
of ooding can be summarized as follows: network.
A special (local) case of congestion is when due to a
1. The approach is highly robust in case of link fail- speed disparity and/or temporary unavailability of re-
ures. As long as the network graph is not discon- sources, one receiver cannot accept the incoming ow.
nected, packets always make it to their destina- Should this happen, the sender must be made aware
tions. If the network is richly connected, ooding of the situation as soon as possible and either adjust
makes excellent use of alternative routes. its speed or abstain from further transmissions which
are bound to be rejected.
2. Error recovery at the destination is simpli ed by Congestion control is a dynamic problem and can-
the availability of extra copies of the same packet. not be solved with static mechanisms alone. It is also
a dicult problem to solve due to the following re-
3. No routing tables (or other data structures repre- quirements that must be ful lled by a good solution
senting the network con guration) are required. [Jai90]:
Thus, network modi cations can be made on a
live network. 1. The scheme must have a low overhead and should
4. The scheme is suitable for all topologies, possibly not o er new trac to the network during con-
very irregular ones. In consequence, the network gestion.
is easily expandable. 2. The scheme must be fair so that during the
5. Flooding automatically chooses the shortest path congestion the available resources are allocated
(since it chooses every possible path in parallel). fairly.9
6. It is simple to implement and introduces less pro- 3. The method must be responsive. Due to the
cessing overhead than any other routing scheme. highly dynamic nature of the network, the re-
source availability pro le changes very rapidly.
The most important weakness of ooding is that The congestion-control procedure should be ag-
packets may loop and, as a result, unlimited numbers ile enough so that the demand curve can follow
of copies of a single packet can crop up in the net- the capacity curve very closely.
work. Therefore, some countermeasures to choke this
process are necessary for the approach to be useful. 4. The procedure must be robust so that it can func-
In general, ooding is considered to be more useful in tion e ectively under unfavorable conditions (e.g.,
broadcasting rather than one-to-one communication. poor availability of network resources at the time
Even in networks based on sophisticated routing meth- of congestion).
ods, ooding is occasionally used as a simple broad-
casting technique, e.g., to disseminate various compo- 5. The scheme must be socially optimal. That is,
nents of the network database among individual sta- it should optimize the performance of the entire
tions. arpanet uses ooding to broadcast periodic network, as opposed to considering each user in
status information to the nodes. In section 7, we will isolation.
discuss some ooding-based designs.
No single classi cation of congestion-control tech-
niques will satisfy everybody as the criteria that can
3 Congestion control be used for classi cation are often orthogonal and re-
ect the point of view of the researcher. One can natu-
Congestion is the network state where because of rally consider the following attributes of a congestion-
mismanagement (e.g., improper access and routing), control scheme:
excessive requests, or faults, the demand for resources 9 It should be pointed out that although several formal and
exceeds their availability [KAS91]. When this hap- precise de nitions of fairness can be found in the literature, none
pens, the queues at bottleneck nodes grow inde nitely in particular is widely inferred from the general term.

5
1. Preventive vs. reactive character of the method. and useless. Window-based schemes are also very slow
Preventive schemes try to avoid congestion and to adapt to changing load patterns and they are only
reactive ones try do something about it, once it e ective for congestion scenarios that last for several
occurs. round-trip delays.
2. The OSI layer in which the scheme is imple- 3.2 Acceptance-level schemes
mented. For example, schemes operating in the
transport layer involve the end-points of a data In high-speed networks, ow-control mechanisms
path and are global in nature, whereas data-link have to be more preventive than reactive because
schemes take care of local congestion, e.g., result- to react the involved parties must exchange status
ing from incompatible transmission rates of two information across the large normalized diameter of
immediate neighbors. the network. Most of the contemporary designs or
3. Feedback-based vs. feedback-free character of proposals12 prevent congestion by exercising ow con-
the scheme. Generally, the attractiveness of trol at two levels. First, at the circuit-setup level (call-
feedback-based techniques decreases with the in- acceptance level, according to the atm terminology) it
creasing transmission rate of the network and/or is checked whether the new data stream can be accom-
its geographical size. Feedback-free schemes are modated within the network considering its present
usually rate-based, i.e., they try to allocate some load. According to our classi cation, such schemes
portion of the network's global rate to every data operate in the transport layer and their primary role
path and con ne each sender to this portion. is to prevent excess trac from entering the network,
thus avoiding long-term congestion. The user is re-
4. Guaranteed-delivery schemes vs. schemes that ad- quested to submit indicators specifying the extent and
mit packet loss during congestion. quality of service it demands from the network. The
typical examples are the declaration of the peak rate,
3.1 Window-based schemes the minimum throughput demanded in case of conges-
tion, or some parameters describing the burstiness of
Consider the class of window-based schemes [Jai90] the trac (e.g., the peak rate, the average rate, and
which require the recipient to adjust the window size10 the maximum burst size [Tur92]). Regardless of the
of the sender by sending feedback signals. According speci c details of the individual designs, the bottom
to the above classi cation, window-based control tech- line is the necessity for the user to have a \contract"
niques are reactive in nature and based on feedback. with the network before being able to proceed with the
However, they can be implemented in any of the three transmission. After the call/session/ ow has been ac-
relevant OSI layers,11 can be loss-less (if the amount of cepted by the network (using atm terminology, we will
reserved bu er space accounts for the maximum feed- say that the virtual path has been established), the re-
back delay) or can occasionally lose packets and force sponsibility of monitoring the user's adherence to the
their retransmission. Window-based schemes have declared parameters is carried out at the packet (cell)
been used in a number of slow networks, arpanet, level. Owing to the statistical and essentially unpre-
tymnet, sna and codex to name a few. In networks dictable nature of the data ow, this task is anything
that are not very fast, this approach is particularly but trivial. The bulk of all research in atm networks is
popular and e ective for preventing local congestion currently devoted to admission control and bandwidth
(in the sense mentioned above), i.e., for matching the allocation.
source speed to the processing speed of the destina-
tion in a point-to-point scenario (e.g., in the data-link 3.3 Grades of service
layer).
The applicability of window-based ow-control In the face of the fact that the proportion of mul-
schemes to high-speed networks is addressed in a num- timedia trac in contemporary networks is already
ber of references [Kle92, BCS90]. The problems as- signi cant and it is going to increase substantially in
sociated with this approach mostly stem from the the near future, it makes sense to consider the concept
large normalized propagation delays across the net- of \service grade" in the context of congestion-control
work which render the feedback information obsolete policies. A multimedia application may be willing to
10 The number of packets that can be outstanding in the net- accept a reduced bandwidth for its connection (and
work at a time.
11 I.e., data-link, network, and transport. 12 Including those aimed at atm networks.

6
operate at a lower quality of service) rather than re- entry point to the network. Note that its virtue is not
ceive no service at all|based on the original quality in guaranteeing a lossless connection at the negotiated
speci cation. For example, if there is no bandwidth rate, but rather in a simple means of enforcing the
at the moment to accept a videophone call, the cus- negotiated rate and indicating (and eliminating) the
tomer may downgrade the request to a regular voice packets that violate user's contract with the network.
connection. But even within the domain of video traf- According to our classi cation, leaky bucket is a
c (e.g., tele-conferencing), one can think of several preventive technique, operates in the transport layer
grades of service|lower grades o ering lower quality (buckets are allocated on a per-connection basis), is
of the picture. The grade-of-service approach applies not based on feedback, and admits packet loss. Vari-
both at the connection-setup level and at the level of ous forms of the leaky bucket scheme have been pro-
allocating bandwidth for individual packets (cells) re- posed in the literature. One variation of this tech-
layed by a switch. A call can be admitted at a high nique [BCS90] deals with two categories of packets
quality of service and later downgraded, if the switch which are marked by the source as green or red. Green
cannot deliver the high-quality service due to conges- packets are transmitted at the rate negotiated during
tion. The user's contract with the network becomes the call setup. Red packets represent the rate in ex-
now exible, within the limits of user's willingness to cess of the contract and are handled di erently at the
put up with service deterioration. The algorithm for intermediate nodes, according to the availability of re-
allocating bandwidth to multiple data streams han- sources. In particular, they can be dropped in case of
dled by the switch at any given moment must take congestion. The basic idea is to convey more impor-
into account the possibility of reducing the e ective tant or loss-sensitive data using green packets.
bandwidth assigned to exible connections. This com- Another rate-based control technique, dubbed vir-
plicates the optimization problem to be solved by the tual clock, was proposed as part of the Flow Network
switch [Den92]. Note that a reduction in the qual- design [Zha91]. The network extends guaranteed ser-
ity of service implies a reduction in the cost of the vice to its users, but the users are expected to specify
connection|both in terms of network resources and their bandwidth requirements. A bandwidth speci -
the charge to the user's account. As besides o ering a cation consists of two components: the average rate
reasonable service to its customers the network should at which packets will be submitted to the network
also maximize its revenue, the optimization problem and the time interval over which the measured average
must be parameterized by both cost components. The should match the declared value. The technique at-
issue of exible bandwidth allocation along the lines tempts to model time-division multiplexing (tdm) of
suggested above was investigated to some extent in the network resources among the multiple data paths
[KrS87] and [RaD90]. In [Den92], the problem is ana- (dubbed ows in the design), but, unlike the tradi-
lyzed in depth and several allocation policies are pro- tional tdm schemes, it accounts for the variability
posed and compared. of packet arrivals within each path. Therefore, the
multiplexing is carried out based on virtual time |in
3.4 Rate-based schemes which the arriving packets are assumed to be equally
spaced. The scheme guarantees loss-less packet deliv-
Current trends in preventive ow control seem to ery as long as the observed average data rate of a ow
be towards rate-based mechanisms [Jai90]. A well- measured over the declared interval does not exceed
known rate-based control technique is the leaky-bucket the value speci ed when the connection was setup. A
scheme [Tur86]. It is a mechanism for policing the ow violating its contract receives poor service and
negotiated transmission rate which is translated into its packets are either placed at the end of the ser-
the size of a virtual bucket allocated to the session. vice queues or get dropped. Functionally, the method
This bucket is lled by incoming packets, which may resembles leaky-bucket policing (and it ts into the
arrive spontaneously, and is emptied (leaks) at the same category in our classi cation), but is more exi-
negotiated constant rate.13 Packets arriving when the ble. Besides the average rate, the user is able to specify
bucket is full are discarded. Leaky bucket is basically the burstiness of the trac.14 The call is admitted or
a packet policing scheme which exercises control at an rejected based on these two speci cations combined.
13 In some variants of this policy, a credit may be allowed
and, as long as the credit is not exceeded, the bucket may be 14 If the measurement interval is long the data stream can be
emptied at the arrival rate of the incoming packets. This way, suspected to exhibit a substantial degree of burstiness. Con-
the packets need not be delayed if they occasionally arrive a versely, short intervals indicate good predictability and steady
little too fast. rate of the incoming trac.

7
3.5 Local schemes 4 Bu ering policies: de ection vs.
Congestion-control schemes operating locally (in
store-and-forward
the data-link layer) tend to be simpler and less expen- It is possible for more than one incoming packet
sive in terms negotiation delays than connection-based to opt for the same outgoing link at the same time.
techniques. In very high-speed networks, these de- In such case, the node has to decide what to do with
lays are more pronounced and practically restrict the the packet (or packets) that cannot be immediately
applicability of the solutions operating in the trans- relayed on their preferred links. Such a packet can
port layer to connection-oriented sessions of a non- be bu ered until the link becomes available or it can
trivial duration. Datagram trac is often handled be relayed on another (available) link along a sub-
di erently. For example, in the Flow Network, a por- optimal path to the destination. In the latter case,
tion of network resources is set aside and reserved for we say that the packet has been de ected. De ection
datagrams. The network o ers the best-e ort service routing is suitable for networks with limited or non-
to datagram trac, within the limitation of the pre- existent bu er space at the nodes. Generally, bu ering
allocated resource pool. transient packets at intermediate switches-relays has
Many simple switching techniques (e.g., self rout- a number of disadvantages which are ampli ed when
ing without intermediate bu ering) accompanied by the network operates at a very high transmission rate.
simple routing rules (e.g., as in Manhattan-street or The arguments in favor of eliminating bu ers or at
shue-exchange networks), o er natural means of least reducing their size to a minimum can be stressed
avoiding local congestion and guarantee reasonable as follows:
packet delivery on the global scale. These methods
belong to the routing protocols (congestion is avoided 1. Networks with large (practically in nite) mem-
as a byproduct of the routing rules) and will be dis- ory switches are as susceptible to congestion as
cussed in section 7 where we present a number of networks with low-memory switches. In the for-
case studies. To hint at some other solution oper- mer case, the queuing delays can get so long that
ating in the data-link/network layer, let us mention by the time the packets come out of the switch,
the design introduced in [KAS91]. With the proposed most of them may have been already retransmit-
solution, every switch is equipped with a neural ar- ted by the higher layers due to timeouts. In fact,
biter which learns to make optimal routing decisions too much memory is more harmful than too lit-
by adjusting the parameters of a set of fuzzy rules tle memory, since the packets or their retrans-
that determine the suitability of an output link for missions have to be dropped after they have con-
an incoming packet. Although the scope of the exer- sumed precious network resources [Jai90].
cise discussed in [KAS91] seems rather insigni cant (a
4x4 Manhattan-street network), the solution suggests 2. Due to the nature of real-time applications which
an interesting approach to handling congestion in a are characterized by stringent delay requirements,
large-scale high-speed network. Notably, the neural long bu ers should be avoided, at least for this
arbiter bases its decisions on local information: the particular group of users.
contents of the packet header and the status of out- 3. Elimination of bu ers can speed up switching
going links. Another solution with a similar avor| signi cantly so that the process can follow the
aimed the integration of atm call admission and link link speed as closely as possible. Particularly,
capacity control|can be found in [Hir91]. all electronic components can be removed from
The issue of speed disparity between the commu- the switch and replaced with their optic equiva-
nicating peers receives a special avor in the context lents that can operate at the link speed by avoid-
of inter-networking. Connecting two networks with ing e/o and o/e conversions. Bu er-based con-
di erent transmission rates and di erent characteriza- tention resolution without resorting to e/o con-
tion of the trac pattern at their gateways typically versions seems to be very dicult at the current
requires a non-trivial congestion resolution scheme. In level of technology and requires cumbersome op-
[WoS89], the bottleneck situation at a gateway that tical delay lines [JaM93].
connects a lower-speed LAN to a high-speed MAN
is discussed and a ow-control scheme applicable to It is obvious that de ection causes some packets
such a scenario is proposed. A study of packet loss to traverse longer paths. Furthermore, unless some
in high-speed networks interconnecting conventional countermeasures are taken, it is possible for a packet
local-area networks can be found in [GMW92]. to travel inde nitely. In general, it can be said that

8
if the probability of de ection at every intermediate compete with other packets possibly causing them
node is equal to 1=2 or greater, de ection routing is to de ect.
not a good alternative to bu er-based contention reso-
lution schemes. The problem can be presented as the Undeniably, de ection routing can cause some de-
so-called Gambler's Ruin problem. Suppose that at terioration in the performance of the network com-
a given instant a packet is at half-way distance to its pared to its store-and-forward version with the same
destination and still has d=2 hops to cover. In other geometry. In [AcS92] a comparative analysis is pre-
words, it has some money (d=2, the distance it has sented for ShueNet with p = 2.15 The study shows
already covered) and it needs as much more to n- that the maximum throughput achievable by the net-
ish the game (to reach its destination) as opposed to work operating under de ection routing can be sub-
going bankrupt (being pushed d hops away from its stantially lower than that achieved by its store-and-
destination again, equivalent to going back to the po- forward counterpart, with the di erence becoming big-
sition where it started). At every node it rolls a dice ger for a larger number of nodes. Nonetheless, even
(competes with other packets for a particular link) and large de ection-based networks with several thousand
loses with a certain probability pd . If it wins (and is nodes are able to achieve no less than 25% of the
not de ected), it gets closer to the destination by one throughput attainable by their store-and-forward ana-
hop. Otherwise, the remaining distance increases by logues with unlimited bu ers.
the penalty of de ection. In general, the de ection The most painful disadvantage of de ection, as op-
penalty is at least 2. For example, in ShueNet, an posed to store-and-forward routing, seems to be the in-
unde ected packet can gain only one step in the right herent unpredictability of delays of individual packets
direction while losing k (the binary logarithm of the belonging to the same higher-level message. Although
number of nodes) in case of de ection. The probabil- in many solutions based on the store-and-forward ap-
ity of de ection (pd) can be decreased in the following proach (including atm networks), the delays of indi-
ways: vidual packets may also vary substantially, at least
these solutions are capable of preserving the ordering
1. Keeping the o ered load below the saturation of packets upon their arrival at the destination. Most
threshold of the network|at a level that guar- people believe that the inherent inability of de ection-
antees few contentions. based routing schemes to deliver packets \in order"
2. Choosing a network topology that o ers multiple renders them unsuitable for \serious" applications in
shortest paths or at least many paths that are connection-oriented environments. We will return to
only marginally worse than the shortest path. this issue in section 8.
3. Giving priority to the packets that are closer to
their destinations (reducing the relative impact of 5 Topology
the de ection penalty).
4. Giving priority to the packets that have been pre- Topological properties of a network can be ex-
viously de ected. This approach attempts to bal- amined separately from the routing and congestion-
ance the number of de ections su ered by a single control mechanisms and can provide clues regarding
packet in a congested network. the suitability of di erent choices in the design of the
routing scheme. They are also directly related to the
5. Giving priority to the packets that have spent the maximum throughput achievable by the network and
longest time in the network. This solution is sim- to its resistance to faults.
ilar to the previous one, but it also accounts for According to their topologies, networks can be
the propagation distance traveled by a packet. grouped into two broad categories: point-to-point net-
works and broadcast networks.16 A broadcast network
6. Bu ering the packets that are to be de ected for a employs a single channel accessed by all nodes. A node
short amount of time (e.g., until the next round) willing to transmit a packet must follow some rules to
to give them another chance, as opposed to di- 15 is the order of the network graph, i.e., = 2 means that
verting them from their course immediately. p p
there are two incoming and two outgoing links per each node.
16 For eciency reasons, the topology of a very large network
7. Discarding the packets that have exceeded a cer- may be organized into several hierarchical layers, e.g., as in
tain hop count limit. Such packets stand a good telephone networks. Consequently, hybrid topologies may form.
chance of being obsolete and they unnecessarily We are not dealing here with such cases.

9
reserve the channel. Every single transmission propa- is related to the costs assigned to the links. In most
gates to all stations in the network and, in particular, cases (see section 3), the cost of a link is determined by
it reaches the intended recipient. For the reasons men- its nominal capacity, its length, and its current con-
tioned in the introduction, broadcast networks are not gestion level. If we assume that all links are of the
good candidates for large high-speed networks, unless same capacity and the same (or comparable) length,
the vast majority of the trac is indeed of a broadcast and if we ignore the congestion component of the link
nature. But even then, the methods of accessing and cost, then the shortest path is also the path with the
reserving broadcast channels tend to incur overheads minimum hop count.
proportional to the propagation length of the chan- In some cases, when the network topology is highly
nel. Consequently, the impact of the access overhead irregular, its properties can be fully described only by
will grow with the increasing transmission rate of the specifying the complete network graph together with
network and/or its geographical diameter. the parameters of its edges. Generally, one can get
In a point-to-point network, a single channel con- some idea as to the expected behavior of a network by
nects one pair of nodes and is typically used for trans- looking at the following global parameters:
fers in one direction.17 Consequently, there is no
need to arbitrate channel access. Due to the high 1. The network size (denoted by N ): de ned as the
(quadratic) cost of providing a direct link from every number of nodes in the network.
node to all other nodes, in most networks of a non- 2. The network diameter (denoted by D) de ned as
trivial size a node is connected directly to a subset follows:
of nodes. Connections are made in such a way that
between any given pair of nodes there is at least one D = maxfij g 1  i; j  N (2)
path in each direction. By a path we mean a collection
of node-to-node links connecting a given source to a where ij stands for the minimum number of hops
given destination. In this structure, a packet needs separating nodes Si and Sj . According to the
to be relayed from node to node to reach its ultimate above formula, D gives the maximum of all short-
recipient. If a given node has more than one outgoing est path lengths over all pairs of nodes. One in-
link, it must make routing decisions, i.e., for every in- teresting characterization of this parameter is its
coming packet it must decide on which outgoing link dependence on the network size N (e.g., logarith-
the packet will be relayed. mic vs. linear).
The number of nodes that perform routing tasks
along a given path is called the hop count of the path. 3. The average hop count (h): the average length of
Note that, according to our de nition, the hop count a shortest path taken over all pairs of nodes. This
of a path can be di erent from the number of nodes parameter is calculated as follows:
that the path passes through. This happens if the path PN PN
includes repeaters, i.e., nodes with a xed assignment h = N1 i=1 j =1 ij (3)
of the input links to the output links. Generally, the N ?1
situation may not be so simple as a node appearing where 1  i; j  N and i =
6 j.
as a repeater for one packet may appear as a routing
switch for another.18 We will say that a given node 4. The degree of connectivity : gives the number of
contributes to the hop count of a packet, if the node incoming and outgoing links connected to one
has actually made a routing decision determining the node. If all nodes have the same degree of
fate of the packet, i.e., the packet could have followed connectivity19 we say that the network topology
another path forking at the node in question. is regular. Such a network is also referred to as
For obvious reasons of eciency, a routing node p-connected where p is equal to the in/out degree
typically tries to relay packets along paths of mini- of a node. For some interconnection patterns the
mum delay (i.e., the shortest paths) leading to their degree of connectivity has to be increased as the
destinations. The determination of the shortest path network size grows.
17 At least this is the way the channels appears to the data-link One more simple numerical parameter of a network,
layer.
18 For example, this may happen in a network in which a vir-
which isn't strictly topological in nature, although it
tual topology is embedded into a physical topology. A single 19 Note that if every node in the network is interfaced to the
physical node may emulate components of several logical paths same number of links, the number of incoming links per node
with di erent properties. must be the same as the number of outgoing links.

10
depends primarily on the topology, is the de ection Useful as it is, formula 4 has its limitations. Al-
penalty. De ned with respect to the routing algo- though it says that networks with lower values of
rithm, this parameter gives the least upper bound on h should achieve a better throughput than networks
the number of hops that a single de ection adds to the with longer shortest paths, it has been reported that
packet's path on its way to the destination. Some- some semi-random or random topologies with lower
times the de ection penalty can be de ned without hop counts yield signi cantly poorer throughput com-
explicit reference to the routing scheme as the girth pared to regular networks with higher values of h
(the length of the shortest cycle, if any) of the net- [Ro92a, Ro92b]. This is not surprising since the fair-
work graph. This is the case when any packet can be ness and regularity of the topology, as well as the vari-
potentially relayed on any outgoing link of a node (i.e., ance of h perceived by di erent nodes have their own
the routing algorithm potentially explores all paths in merits which should be carefully considered prior to a
the network graph). meaningful evaluation.
The topological properties of a network are directly
related to its maximum achievable throughput (U ). If
all links are of the same capacity and length, mean-
ing that a packet traverses one link within one unit of
6 Gbps networks and new challenges
time, then h gives the sojourn time per packet|the In this section, we will investigate the fundamental
mean amount of time a packet must spend in the net- issues that render the Gbps networks di erent from
work before reaching its destination. In other words, their slower counterparts and discuss their implica-
h gives the average cost per packet transmission in tions on the design of the network topology, rout-
terms of time and/or the number of links that must ing schemes, and congestion-control mechanisms. We
be visited by the packet. If the following conditions choose to group these issues into two categories:
are ful lled: (1) Processing Bottleneck: In a packet switch-
1. the network is symmetric, ing network, the time required to make a routing deci-
sion cannot be longer than a single packet's transmis-
2. all links are of the same capacity, sion time. Otherwise, the system becomes unstable
since the ratio  = = of the arrival rate  to the
3. the amount of bu er space at a node is in nite service rate  at a node becomes greater than 1 and
(i.e., packets are never lost and they always travel queue lengths grow inde nitely. Assuming the 53-byte
along the shortest paths), cell length20 of atm networks and the transmission
rate of 1 Gbps, a node has at most 424 ns to complete
4. the trac is uniformly distributed, the following tasks for every packet arrival:
5. the packet generation rate is the same for all 1. Selecting the appropriate output link on which
nodes, the packet should be relayed. This is accom-
then the relationship between h and U can be ex- plished either by using the routing information
pressed as follows: contained in the packet header or by consulting a
table which is usually indexed by the destination
address.
U = Total Number of Links
h
(4)
2. Resolving a possible contention in such a way that
Although, some of the prerequisites to formula 4 are the network performance (according to the rout-
not very realistic, the formula can be used to esti- ing criteria) is optimized.
mate the maximum throughput of various realistic A de ection-routing scheme can be synchronous,
networks. Generally, the larger the network, the more when multiple packets (slots) arrive at a switch at the
the small irregularities in the network topology tend same time and their fate is determined globally in a
to cancel out statistically. One can argue that the single compound routing decision [BoC90, Max87], or
assumption about the uniform distribution of trac asynchronous, when packets are treated individually
does not hold in a large network, due to the presence and relayed on whatever outgoing links are available
of local communities of interest. In such case, some at the time of their arrival [GbM93]. With the syn-
simple biased distributions can be considered [Dan91, chronous approach, the routing decisions can be more
GbM93] which only slightly complicate the above for-
mula. 20 Which is considered small for data applications.

11
intelligent because more options are available when Alternatively, one can opt for asynchronous rout-
they are being made. However, the incoming slots ing. With this approach, when a packet arrives at
must be aligned prior to the decision which compli- a switch, it is relayed on the most suitable from the
cates the switch design and may require backpressure currently available outgoing links. This idea works
mechanisms if the network is large. A natural objec- poorly in networks with low connectivity. To see the
tive of a synchronous routing scheme is to ful ll the problem imagine that a packet arrives at an idle 2-
demands of as many packets as possible, e.g., by min- connected switch. Suppose that the packet prefers no
imizing the overall de ection penalty. For a switch speci c output link so the switch is free to select one
with non-trivial connectivity, the best decision can of the two links at random. Now, while the packet is
only be arrived at by solving a dicult optimization retransmitted another packet arrives and prefers the
problem. Although, due to the discrete and limited busy link. Clearly, the new packet must be de ected,
set of parameters, there are always ways of solving but it could have been relayed according to its prefer-
this problem in a fast way (by resorting to lookup ta- ence, had the decision regarding the rst packet been
bles), the amount of storage required to represent the delayed until the second packet was available. One
needed data as well as the access techniques required can easily imagine how this scenario could evolve into
to retrieve those data on-line may pose serious imple- a situation in which packets arriving at the switch con-
mentation problems. tinuously on both incoming links are inde nitely de-
In a node with the connectivity degree of p, p in- ected because they keep arriving \out of phase" and
coming packets can be assigned to p outgoing links in every new packet nds its preferred link busy. How-
p! ways. With a store-and-forward approach the sit- ever, when the connectivity degree is high (e.g., 8 or
uation is even worse: multiple incoming packets can 16), asynchronous routing may be a viable alternative
be directed to the same outgoing link, which gives pp to the synchronous approach. Note that besides reduc-
possible assignments. The following table illustrates ing enormously the complexity of routing decisions,
the scope of the problem space for a few typical con- asynchronous routing schemes eliminate the need for
nectivity degrees: slot alignment and they can handle packets of variable
length. In [GbM93] the reader will nd the perfor-
mance study of a prototype gigabit network based on
p De ection S&F asynchronous de ection.
2 2 4 (2) Dominating Propagation Delays: The
4 24 256 propagation latency of channels resulting from the -
8 40320 16777216 nite speed of electro-magnetic signals in the media is
16 2:09  1013 1:84  1019 ampli ed by high transmission rates. In [Kle92], this
relationship is described by the parameter a which is
de ned as follows:
Multiple priority levels may further contribute to
the cost. Similar problems are encountered in tele- Propagation Delay
a = Packet (5)
phone networks and multiprocessor interconnection Transmission Time
networks. Most of the proposed solutions are heuristic The following table lists the values of a for a few typi-
in nature. cal channel types (the ber link is assumed to connect
In this regard, structures with low connectivity de- two points across the continental USA), for the packet
grees have better chances for realistic optimal im- length of 1000 bits:
plementations. Additionally, regular topologies with
many alternative shortest paths between every pair of
nodes may reduce the size of the routing problem by Capacity Prop. Delay
grouping large classes of solutions into clusters with NETWORK (Mbps) (microsec.) Ratio a
the same rank. As pointed out in [GbM93], small Local Net 10.00 5 0.05
departures from the optimality of routing decisions WAN 0.05 20000 1.00
may have no visible impact on network performance, Satellite 0.05 250000 12.50
yet they may signi cantly reduce the complexity of Fiber link 1000.00 15000 15000.00
the routing problem. Moreover, by selecting one of
equally-ranked alternatives in a nondeterministic way,
the routing scheme reduces the likelihood of a live- The value of a tells how many packets can be
lock [Max90]. pumped into one end of the link before the rst bit

12
of the rst packet appears at the other end. The large ogy is highly regular (e.g., as in ShueNet). In or-
value of a for high-speed channels is the source of nu- der to retain the advantages of regular interconnec-
merous problems encountered not only in routing and tion networks, embedding a virtual topology into a
ow control, but also at the application level. given physical topology can be considered. One way
of achieving this is by resorting to wavelength divi-
6.1 Impacts on topological design sion multiplexing (wdm). wdm carves the bandwidth
o ered by the optical ber into a number of smaller
Due to high installation and maintenance costs, it is chunks and assigns them on a wavelength basis to the
reasonable to expect that real-life mans and wans will nodes. A wavelength can be assigned to a transmitter-
be deployed with some optimization criteria in mind. receiver pair (producing a virtual point-to-point link)
Taking into account that the propagation time on the or it can be shared by a group of transmitters and re-
links is the major contributor to packet delays, it is ceivers. The latter approach is particularly useful for
desirable to minimize the total cable length needed to describing virtual topologies that can formally be de-
connect the network. The relevant geometric problem ned only for certain \round" numbers of nodes (e.g.,
of connecting n given points in the plane with a set ShueNet or a hypercube). By dedicating a wave-
of shortest possible lines is known as the Euclidean length to a speci c subnet (e.g., a plane in the hyper-
minimum spanning tree problem [Sed88]. One way to cube), one can reserve the virtual links needed in the
solve this problem is to build a complete graph with complete structure without connecting them to the
n(n ? 1)=2 undirected edges which are weighted ac- actual nodes which may be absent. This way incom-
cording to the distance21 between the vertices. Next plete con gurations can be built and later expanded in
one can apply a minimum spanning tree (mst) con- an incremental way. The advocates of this approach
struction algorithm to select the smallest subset of claim that changes in the virtual topology can be car-
edges that connect the graph. All links in the net- ried out dynamically to follow the observed changes in
work can then be laid along these edges. the trac pattern. Thus, the network can adapt its
Sometimes it is possible to improve upon the re- topology to the trac conditions in a way that max-
sult produced by the above method by introducing imizes its performance. Details of wdm can be found
additional (arti cial) vertices to the graph, which do in [VWD91, MoG90, BFM90, LaA91, ZhA90, HlK88,
not correspond to any nodes in the network [Law76]. Aca87, AKH87]. These studies raise the following two
The extra vertices are referred to as the Steiner points problems:
and it can be proved that for any n points to be
spanned there exists a minimum-length Steiner tree 1. The embeddings of the hypercube and ShueNet
which contains no more than (n ? 2) Steiner points. are discussed in [VWD91] and [BFM90], respec-
It is also conjectured that the cost of the minimum tively. However, the ecient bit-controlled rout-
spanning tree (based on thep original collection of ing algorithms developed for these networks are
nodes) is no more than (2= 3) times higher than solely concerned with minimizing the number of
the cost of the minimum-length Euclidean Steiner hops and ignore the lengths of the links. In
tree [SLL81]. However, the problem of building the other words, the dominating propagation delays
minimum Steiner tree for a given collection of points is are excluded from the scope of the routing crite-
NP -complete [GaJ79]. It has been expressed in many ria. Consequently, the minimization of the hop
forms and a catalog of its di erent formulations can be count does not imply the minimization of the de-
found in [GoM93]. The most ecient known heuris- lay. A remedy is suggested in [VWD91] which
tic [Win87] which seems to produce good solutions in consists in assigning the wavelengths in a fashion
a reasonable time is given in [SLL81]. that minimizes the physical lengths of the virtual
The optimal physical con guration of a network in connections. However, the authors point out that
terms of its total cable span may be incompatible with the related subproblems of network con guration,
the requirements of the routing algorithm, which may namely: (1) the mapping of the nodes in the phys-
be based on the assumption that the network topol- ical topology to the nodes in the virtual topology,
(2) the mapping of the edges in the physical topol-
21 The metric in which this distance is expressed is not nec-
ogy to the edges in the virtual topology, (3) the
essarily Euclidean: the distance between two points ( 1 1 )
x ;y
allocation of the wavelengths to edges, are all ex-
and p( 2 2 ) can be expressed either in Euclidean metric L2
pected to be NP -hard as the search space grows at
x ;y

as ( ( 1 ? 2 )2 + ( 1 ? 2 )2 ) or in rectilinear metric L1 as
least as fast as N ! Consequently, e ective heuris-
x x y y
(j 1 ? 2 j + j 1 ? 2 j). Rectilinear metric has been suggested as
x x y y
a more realistic alternative for urban environments [BFM90]. tics are in demand.

13
2. The other problem is related to the cost of the making switch from the destination. The coordination
routing procedure. In [LaA91], the total network of the status information among the nodes is achieved
throughput of ShueNet is improved by modi- using a rather complex collection of algorithms that
fying the topology such that the largest ow on work more or less independently and yet support each
any link is minimized. The throughput improve- other by exchanging information or services. This
ment ranges from 6:8% for the quasi-uniform traf- approach presents serious drawbacks for gigabit net-
c matrix to 24:3% for the ring-type trac. The works. In [Sch80] six performance measures are pro-
employed ow deviation method requires an iter- posed that can be used to compare routing algorithms
ative dynamic re-calculation of the shortest paths. operating in a decentralized fashion, The rst and the
However, it is not clear how a switch designed for most important measure is the speed of response:
ShueNet's connectivity pattern is supposed to
function on the modi ed topology. It seems that This measure is obviously extremely impor-
the gain in exibility is paid by re-introducing tant in dynamic environments since the speed
large routing tables into the network and increas- of response must be faster than the rate of
ing the cost of the routing algorithm. change of network topology. Otherwise con-
vergence will not occur and the routing algo-
The dynamic recon guration of virtual topologies rithm will be useless.
based on the changes in the observed trac pattern is
too tempting to be rejected because of the diculties Regarding ow based methods, in [BeG87] it is stated
mentioned above. It seems that the issue should be ad- that:
dressed in the context of the routing schemes that were Implicit in ow models is the assumption that
originally designed for static topologies with physical the statistics of the trac entering the net-
rather than virtual regularities. Perhaps a good so- work do not change over time. This is a
lution can be arrived at by designing special routing reasonable hypothesis when these statistics
schemes which would inherently assume the dynamic change very slowly relative to the average
nature of the network topology. time required to empty the queues in the net-
work and when link ows are measured ex-
6.2 Impacts on routing protocol design perimentally using time averages.
Most routing protocols that were proposed for large In other words, any relevant changes that a ect
and slow networks are based on variants of shortest- the trac patterns in the network should be closely
path or least-cost algorithms [BeG87]. From a theoret- followed by the corresponding updates in the routing
ical standpoint, the routing problem can be viewed as information available at individual nodes. The time
an optimization problem. Optimal routing is shown interval between the changes must allow for the up-
to be formally equivalent to optimal ow control date information to propagate to the involved nodes,
[BeG87]. Consequently, the optimality criteria and al- for the nodes to calculate a new optimum state, and
gorithms developed for the latter are applicable to the for the network (speci cally its users) to bene t from
former. Given a speci c cost function, optimal rout- the reaction of the routing algorithm to the require-
ing can be achieved|at least in principle. One can ments of the new state.
always argue, however, whether the optimality crite- In a gigabit network, owing to large a, the gap men-
ria well re ect network goals, i.e., whether the optimal tioned above may be one or two orders of magnitude
behavior of the network according to those criteria ap- larger than the actual duration of a typical transition
pears also optimal (in the informal sense) to the users. from one trac pattern to another. For example, to
The cost function typically associates with each link transmit a 1 MB le across a LAN (see the table at
a certain value which is adjusted dynamically accord- the beginning of section 6) one can negotiate all the
ing to the varying load of the link. In the simplest resources needed for this transfer in advance, which
case, the function returns one of two values indicating operation will take a tiny fraction of the time needed
whether the link is available at the moment or used to to transmit the rst packet of the le. On the other
relay a packet. In a global optimization scheme, the hand, when the le is to be sent through a 1 Gbps
suitability of a link to relay a packet addressed to a network spanning the area of continental USA, the
given destination is a ected by the status of the re- amount of time required to announce the transfer to
maining components of the packet's path, i.e., by the all the nodes that may be involved in it is almost twice
congestion level at the nodes separating the decision- as large as the amount of time needed to pump the le

14
into the network by the source. Of course, we have to of the trac will require some level of
add to this amount the time needed for the feedback bandwidth guarantee. Real-time trac
information to propagate across the network as well as (e.g., voice, video) has an intrinsic rate
the time required by the switches to recalculate their determined by the external factors that
routing parameters. are outside the control of the network.
In the light of the above observation, a reasonable Typically this rate can be estimated by
set of optimality criteria applicable to a gigabit net- the network prior to the establishment
work must account for the relatively short duration of of the connection. The ability to slow
many transient trac episodes. Such episodes should down such sources is usually very lim-
not force distant nodes to re-negotiate their routing ited. However the packet arrival process
criteria. Instead, those nodes should base their de- is stochastic implying that there is no
cisions on local criteria, possibly employing a policy guarantee that over short periods the re-
of verifying these criteria against long-term patterns source will keep to the speci ed average
observed over extended periods of time. rate. In addition, the initial estimate of
the rate may be incorrect.
6.3 Impacts on congestion control The subject is also discussed in [Jai90] where the
author draws attention to the basic principles of con-
In [BCS90] the unsuitability of the conventional trol theory. It is pointed out that no scheme can possi-
mechanisms based on end-to-end or hop-by-hop win- bly solve a congestion problem that lasts shorter than
dowing schemes for controlling congestion within high- the feedback delay needed to notify the o ending par-
speed networks is pointed out as follows: ties and perceive their reaction. The author suggests
1. Window-based mecha- a multi-level congestion control architecture for han-
nisms typically rely on end-to-end ex- dling short-term and long-term congestion scenarios
change of control messages in order to at di erent levels of the hierarchy.
regulate trac ow. The control mes-
sages (sometimes with additional con-
gestion information added by the inter- 7 Case studies
mediate nodes) are used as feedback by
the source node to regulate its trac. In this section, we discuss a few proposals for a
In high-speed networks, the propagation high-speed network design that have appeared in the
delays across the network typically dom- literature. Their common characteristics are the self
inate. Thus the feedback is usually out- routing capability and the ability to function with a
dated and any action the source takes is limited bu er space at individual nodes. The latter
too late to resolve bu er over ows and also implies the compatibility with routing schemes
avoid congestion. This argues for mech- based on de ection.
anisms that do not heavily rely on net-
work feedback. 7.1 Hypercube
2. It is also important that the congestion- A complete hypercube consists of N = 2n nodes
control mechanisms operate at the which are numbered by n-bit binary numbers, from
speed of the communication link. For 0 to 2n ? 1 and interconnected in such a way that
this reason, especially in the case there is a link between two nodes if and only if the
of hop-by-hop window-based mecha- binary representations of their addresses di er on ex-
nisms, computationally-intensive con- actly one bit. Therefore, the nodal connectivity de-
trol schemes are less desirable than sim- gree is n. The network diameter is also n, since this
ple schemes that can be easily imple- is the maximum number of binary positions on which
mented in high-speed hardware. two node addresses may di er. The distance between
3. The nature of the trac also a ects the a pair of nodes is equal to the Hamming distance be-
design of the congestion control. While tween their addresses expressed in binary format. The
data trac can usually be slowed down average hop distance is equal to half the diameter
in order to cope with network conges- (h = (N2(log 2 N) n
N?1)  2 ). The penalty of de ection is 2
tion, it is likely that the real-time nature since all connections are bidirectional and a hop that

15
doesn't reduce the Hamming distance to the destina- Compute: relativeaddr = currentaddr  destinationaddr.
tion by one, increases it by one. Other topological Starting with the most signi cant bit of relativeaddr:
properties of the hypercube are discussed in [SaS88]. let i be the bit number of rst 1 in relativeaddr,
A hypercube is a richly connected structure. Al- where link i exists for the current node.
though it is non-optimal in terms of diameter, it Forward the packet on link i.
can deliver optimal performance when the trac is For an incomplete hypercube with N nodes, this al-
uniformly distributed, even with very simple routing gorithm o ers the worst-case path length of dlog2 Ne.
mechanisms. Routing in a hypercube can be per-
formed as follows. At every intermediate node xor The main disadvantage of hypercubes is that their
is performed on the address of the current node and nodal connectivity degree increases (logarithmically)
the destination address. The locations of 1's in the bit with the network size [Muk92]. Consequently, hyper-
pattern representing the result indicate the preferred cube networks cannot be scaled up without reorganiz-
routes. Assume that the output links of every node s ing the node structure.
are numbered in such a way that link number i leads
to the node whose binary address di ers from the ad- 7.2 Shue-like (minimum diameter) net-
dress of s on position i. The basic routing algorithm works
for the hypercube is given in [Kat88] as follows (
represents the xor operation): As we pointed out in section 5, the network di-
Compute: relativeaddr = currentaddr  destinationaddr. ameter (D) has a paramount impact on the maximum
Starting with the most signi cant bit of relativeaddr: throughput achievable by the network. Thus, it is nat-
let i be the bit number of rst 1 in relativeaddr. ural to consider designs that minimize D for a given
Forward the packet on link i. number of nodes and degree of nodal connectivity. It
is known that the minimum diameter D and the max-
Numerous performance studies of hypercubes indi- imum connectivity degree p of a directed graph are
cate that the structure is very suitable for de ection related to each other in the following way:
routing [Szy90, GrH92, Haj91, GbM93]. Between two
(D +1)
nodes, node(i) and node(j ) with the Hamming dis-
tance of H(i; j ) < n, there are H(i; j ) node-disjoint N  1 + p + p2 +    + pD = p p ? 1? 1 (6)
paths of length H(i; j ). Furthermore, n di erent node- If the equality holds, then it is said that the Moore
disjoint paths whose lengths are less than or equal to bound is achieved and the graph is called a Moore
(H(i; j ) + 2) are available [SaS88]. Consequently, the graph. Note that for a directed Moore graph the lower
average number of de ections su ered by a packet un- bound on diameter is:
der uniform load is at most O(log n), regardless of the
trac intensity [GrH92]. There are n unidirectional Dmin = dlogp (N (p ? 1) + 1)e ? 1 (7)
links in the network per node and a packet must tra-
verse n=2 links on the average, limiting the maximum and the average hop distance h is bounded from below
achievable throughput to 2 per node.22 In [GrH92], it by [HlK88]:
is demonstrated that even with the added penalty of D+1
de ection, a maximum throughput very close to 2 can hmin = p ? p +(NND (p ? 1)2 + D(p ? 1) (8)
? 1)(p ? 1)2
be sustained.
Since the number of nodes in a hypercube must be It is also known that there are no directed Moore
a power of 2, there are large gaps in the sizes of the graphs for nontrivial values of D and p [BrT80].
system that can be built with this architecture. One One of the best known family of graphs which
solution to this problem can be found within wdm- come close to the Moore bound are de Bruijn graphs
based structures|as suggested in section 6.1. Another
possibility is to use incomplete hypercubes which are [Bru46]. A directed de Bruijn graph with connectivity
de ned for an arbitrary number of nodes [Kat88]. In p and diameter D has N = Dp nodes. A general class
an incomplete hypercube, node connectivity rules re- of graphs has been proposed by Imase and Itoh [ImI83,
main the same as before (i.e., link(i) connects two ISO85, HoP88] which contains de Bruijn graphs as a
nodes whose addresses di er at the i-th bit position subclass. The design procedure outlined in [ImI83]
and nowhere else), but some of the links are missing. produces graphs with the upper bound D  dlogp Ne
The basic routing algorithm has to be modi ed as fol- for arbitrary values of N and p. Another approach to
lows (the modi cation is indicated in boldface): constructing networks with low mean inter-nodal dis-
22 Meaning 2 packets per one-packet time slot or two bits per tances, based on simulated annealing, can be found in
one-bit slot. [Ro92b].

16
Shue-like networks provide mean inter-nodal dis- ecient routing algorithm that could explore these
tances approaching the Moore limit [Ro92b]. How- new connections is available and therefore they are
ever, these topologies are not de ned for networks of only used in case of de ections or link failures.
arbitrary size. Figures 1{3 depict three shue-like
networks with p = 2. 0 0 To
0 1 2 3 4 5 6 7
From
1 1
0 2 2 2 1 1 3 3 14/7 = 2.00
0 0 To
0 1 2 3 4 5 6 7 2 2 1 2 2 2 3 3 1 1 14/7 = 2.00
From
1 1 2 2 2 2 1 1 3 3 14/7 = 2.00
0 1 2 2 3 3 3 3 17/7 = 2.43 3 3
3 2 2 2 3 3 1 1 14/7 = 2.00
2 2 1 3 1 1 2 2 2 2 13/7 = 1.86 4 4 4 1 1 3 3 2 2 2 14/7 = 2.00
2 2 2 2 1 1 3 3 14/7 = 2.00
3 3 5 3 3 1 1 2 2 2 14/7 = 2.00
3 3 3 3 2 2 1 1 15/7 = 2.14 5 5
4 4 6 1 1 3 3 2 2 2 14/7 = 2.00
4 1 1 2 2 3 3 3 15/7 = 2.14 6 6
7 3 3 1 1 2 2 2 14/7 = 2.00
5 3 3 1 1 2 2 2 14/7 = 2.00
5 5
6 2 2 2 2 1 1 3 13/7 = 1.86 7 7
Avg = 2.00
6 6
7 3 3 3 3 2 2 1 17/7 = 2.43
(a)
7 7 Avg. = 2.13
0 4 0

Figure 1: The topology of an 8-node de Bruijn Net- 1 5 1

work. 2 6 2

Shue-like networks based on de Bruijn graphs are 3 7 3

de ned when N is a power of the connectivity degree


p. Figure 1 shows a special case of p = 2 and N = 23 (b)

in which node(i) is connected to node(2  i mod N ) Figure 3: The topology of an 8-node ShueNet.
and node((2  i + 1) mod N ). In this network, even
if the o ered trac is fully symmetric, the link loads Another variation on the shue theme was pre-
are unbalanced [Muk92]. This is due to the existence sented in [AKH87] under the name of ShueNet and
of self loops23 which carry no trac. The maximum discussed further in [Aca87, HlK88]. In ShueNet,
de ection penalty in a de Bruijn network is logp N +1 N = kpk nodes are arranged in k columns of pk nodes.
since a packet can travel back to the point where it Each node is addressed with its row and column co-
was de ected in at most logp N steps. Communication ordinate pair (r; c). Columns are ordered left to right
networks based on de Bruijn graphs are discussed in from 0 to (k ? 1) and rows are numbered top to bottom
[SaP89, EsH85, SiR91]. from 0 to (pk ?1). The row coordinates are represented
in p-ary notation, (r = rk?1 rk?2    r0 ). Accordingly,
0 0 To
0 1 2 3 4 5 6 7 the p nodes that any given node (r; c) is connected to
1 1
From
0 1 2 2 3 3 2 1 14/7 = 2.00
are identi ed as follows:
2 2 1 3 1 1 2 2 2 2 13/7 = 1.86 ((c + 1) mod k; rk?2 rk?3    r0 0)
3 3
2 2 2 2 1 1 3 3 14/7 = 2.00
((c + 1) mod k; rk?2 rk?3    r0 1)
3 2 3 3 2 2 1 1 14/7 = 2.00
..
4 4 4 1 1 2 2 3 3 2 14/7 = 2.00
.
((c + 1) mod k; rk?2 rk?3    r0 (p ? 1))
5 3 3 1 1 2 2 2 14/7 = 2.00
5 5
6 2 2 2 2 1 1 3 13/7 = 1.86
6 6
7 1 2 3 3 2 2 1 14/7 = 2.00
Figure 3(a), redrawn in gure 3(b), shows a special
7 7
Avg = 1.96 case of p = k = 2. Note that this connection pattern
has no self-loops. Furthermore, the variation in hi
Figure 2: The topology of an 8-node shue network. across the nodes is zero since every node has the same
number of nodes that lie within a given shortest path
The con guration shown in gure 2 is discussed in distance from it (i.e., there are two nodes reachable in
[Max89] and referred to as a shue-exchange network one hop, three nodes reachable in two hops and two
(sxn). The self-loops are eliminated by connecting nodes reachable in three hops from any node). The
nodes 0 and N ? 1 to each other. Although this net- edge e ects are eliminated by placing the nodes into
work has the lowest h among the three variations, no two groups (shown as columns in gure 3(b)).
23 Links 0 ! 0 and 7 ! 7 in the gure. In general, there are The diameter of the network is D = 2k ? 1 with the
p of them in a -connected network.
p de ection penalty of k. The average number of hops

17
is given in [HlK88] as: Many routing algorithms are applicable to shue-
like networks [EsH85, SiH88, Sie90, Hui90]. Below we
h = kp (p ?2(1)(3 k ? 1) ? 2k(pk ? 1)
k discuss two such algorithms proposed for ShueNet.
p ? 1)(kpk ? 1) (9)
Under uniform trac conditions, ShueNet per-
forms very well with a simple routing algorithm which
For the special case of p = 2, the above equation takes uses a single path for every source-destination pair.
the following form [Aca87]: Given an intermediate node (^r; c^) and a transient
2
k kX ?1
3 packet addressed to destination (rd ; cd), the packet is
h = k12k 4 j 2j + (k + j )(2k ? 2j )5 relayed along the output link leading to the following
X
(10) node [KaS91]:
j =1 j =1
which simpli es to:
node((^c + 1) mod k; r^k?2 ;    ; r^0 rXd ?1 );
where X denotes the number of columns between the
h = 21k 3(k ? 1)2k?1 + 2
 
(11) current node and the destination and is given by:

X = (kk + c ? c^) mod k ifif ccd 6=
d d c^
giving the maximum throughput achievable under uni- = c^
form load as:
22k+1 In [EiM88] it is shown that the allowable through-
U = 3(k ?k1)2 k?1 + 2 (12) put per node with the xed routing algorithm and
realistic (nonuniform) trac patterns is reduced by a
Consequently, the throughput available per user is: factor between 0:3 and 0:5 with respect to that pre-
dicted for a uniform load. To reduce this throughput
k+1 deterioration, an adaptive routing scheme has been
Ui = 3(k ? 21)2k? + 2  43 k ?1 1 i = 1;    ; k2k (13)
1 proposed in [KaS91]. Following the notation used
in the description of the xed routing algorithm, let
Note that the value of h increases with increasing D denote the number of columns between the source
k impeding the growth rate of the maximum achiev- (rs ; cs ) and the destination (rd ; cd ):
able throughput. The 2-column con guration o ers 
D = (kk + c ? c ) mod k ifif ccd 6=
d s d cs
the highest throughput rate [HlK88] and, for k > 2,
no routing algorithm is known that yields a balanced = cs
use of links, even for a perfectly balanced load [EiM88]. If a packet cannot reach its destination in k hops, then
On the other hand, the increased number of columns the minimum-hop routing path length is D + k, re-
increases the number of multiple shortest paths; the gardless of what routing decisions are made in the
destination nodes that are k to (2k ? 1) hops away rst D hops. Therefore, if the source and destina-
from a given source can be reached via more than one tion are more than k hops apart, the packet can be
shortest path.24 The following table [EiM88] gives the routed arbitrarily for the rst D hops until it reaches
average hop counts and throughput rates for di erent column cd of its destination. Then a single path of
values of k: length k leads to (rd ; cd). The routing algorithm re-
quires marking each packet at the source according to
the distance to its destination either as M -type (which
k N h Ui U stands for multiple minimum-hop paths) or as S -type
2 8 2.0 1.0 8.0 (single minimum-hop path). Clearly, M -type packets
3 24 3.25 0.617 14.8 require more than k hops to reach their destinations.
4 64 4.625 0.433 27.7 At each intermediate node the remaining distance is
5 160 6.06 0.33 52.8 calculated for the packet and its type eld is updated
6 384 7.53 0.265 101.9 if necessary. There are two ways for a packet's type
7 896 9.09 0.222 198.8 to change. First, a type S packet can be de ected to
8 2048 10.51 0.19 389.8 a longer path increasing the remaining distance by k
9 4608 12.001 0.167 767.7 hops and changing its type to M . Second, a type M
24 This property of ShueNet is investigated in [Aya89] with packet can reach the column of its destination and be-
the help of signal ow graphs and the results are compared to come type S . The test for type can be performed as
those obtained for Manhattan-street networks. follows:

18
type s if rks?1?D = rkd?1 ; rks?2?D = rkd?2 ;    ; acted as a relay was determined. The results are
and r0s = rDd listed in the following table:
type m otherwise.
The routing scheme also employs bu ers and de ects
each packet at most once. Type M packets are al- Node de Bruijn ShueNet
ways placed in the shortest queue at a given node. 0 0 11
Type S packets are normally routed according to the 1 11 11
xed algorithm. But if the bu er of the preferred 2 9 11
outgoing link is full beyond a certain threshold, the 3 11 11
packet is placed in the shortest queue with a spe- 4 11 11
cial ag recorded in its header. A de ected packet is 5 9 11
always placed in the appropriate bu er regardless of 6 11 11
the threshold. If the bu er is full, then the packet is 7 0 11
dropped. Throughput results for networks with sizes
comparable to those discussed above are not available.
The throughput versus delay characteristics of de Note that in the de Bruijn network, node(0) and
Bruijn networks are compared with those of Shuf- node(7) never act as relays; therefore, their output
eNets in [SiR91]. For a given diameter, a de Bruijn links are never used to convey any non-local trac
network can support more nodes than its ShueNet and are always available to them. The loads of other
counterpart, as shown in the following table (for p=2): nodes are also unbalanced [SiR91].
k
2
N
23 = 8
D (ShueNet) D (de Bruijn)
3 3
7.3 2-dimensional toroidal networks
4 26 = 64 7 6 A 2-dimensional toroidal network is a rectangular
6 6  26 = 384 11 Not applicable mesh with orthogonal wrap-around connections (see
8 211 = 2048 15 11
16 220 = 1048576 31 20 gure 4). The reasons why such a structure is an in-
teresting topology for a communication network are
stated in [Rob88] as follows:
The increase in the number of stations comes at the
expense of some non-uniformity in edge loading under 1. Addressing and routing is straightforward.
uniform trac. To illustrate the di erences between 2. The topology is isotropic, i.e., every node has the
ShueNet and de Bruijn networks, we performed the same set of connections and perceives locally the
following calculations on the sample networks shown same topology of the network. Consequently, all
in gures 1 and 3. First, we listed all shortest paths nodes can execute exactly the same protocol.
for every source-destination pair in the network, indi-
cating also the intermediate nodes, e.g., the shortest 3. The wrap-around connections decrease path
path from node(0) to node(7) is (0 ! 1 ! 3 ! 7) in lengths and eliminate the edge e ects.
the de Bruijn network, whereas ShueNet o ers two
possibilities: (0 ! 4 ! 1 ! 7) and (0 ! 5 ! 3 ! 7). 4. For a metropolitan network, the topology eas-
Then we deleted the end nodes from the paths i.e., ily covers a rectangular grid of streets and av-
nodes 0 and 7 in the example. Our observations can enues, i.e., the topology makes sense geographi-
be stressed in the following two points: cally [Max85].
1. There are no multiple shortest paths in the de The torus that provides the topological surface of
Bruijn network whereas ShueNet o ers two the network is a manifold, i.e., a nite two-dimensional
shortest paths from every node to two destina- space without a boundary. Consequently, the network
tions out of seven. has no boundary and it looks the same from every
node. Not every network has this property; the edge
2. 56 di erent shortest paths exist in the de Bruijn e ects in de Bruijn networks were discussed in sec-
network. Due to the availability of multiple short- tion 7.2.
est paths, this number is 72 for ShueNet. For Figure 4 shows two toroidal mesh networks with the
each node, the number of times that the node nodal connectivity degree of two. The networks di er

19
For bmsn we have ([BoC87, BoC90]):
(
n
N?1) when n is even
3
h= 2(
n (17)
2
when n is odd
For htn, the following formula can be derived:
n?1
P n?1 i(2n ? i ? 1)
P
Highway Transfer Network Manhattan Street Network h = i=1 i(i + 1) +N ?i=1
1 (18)
Figure 4: Two toroidal meshes: the highway-transfer which simpli es to h = N =(n + 1)  n.
network htn and the Manhattan-street network msn.
3. Penalty of de ection: For htn, it is n. For msn,
it is 4, which is constant and independent of the
in the orientation of the links. Let us assume for sim- network size. For bmsn, it is 2.
plicity that the network grid is a square with n rows
and n columns (although in general it can be a rect- 4. Degree of connectivity: It is 2 for msn and htn,
angle). Thus, the total number of nodes N is equal and 4 for bmsn.
n2 . Every node can be identi ed by a pair of coordi-
nates (r; c) representing its row and column numbers, The preference of outgoing links for a packet to be
respectively. Assume that the rows and columns are routed can be determined locally by comparing the
numbered from zero up, starting from the left top cor- destination coordinates to the coordinates of the rout-
ner of the grid. The connection rules for the highway- ing node. In many cases, the packet gets closer to the
transfer network (htn) are given as follows: destination by moving along the column as well as

the row and the ordering of these steps is not impor-
((r + 1) mod n; c) tant, e.g., as in the hypercube. Consequently, mul-
(r; c) is connected to (r; (c + 1) mod n) (14) tiple shortest paths are available and there are many
cases when a packet equally prefers two outgoing links.
Similarly, in a unidirectional Manhattan-street net- This property makes the networks highly suitable for
work a node (r; c) is connected to the following nodes: de ection routing.
((r + 1) mod n; c) if c is even 7.3.1 Manhattan-street network
((r ? 1) mod n; c) if c is odd (15)
(r; (c + 1) mod n) if r is even The Manhattan-street network (msn) is a regular
(r; (c ? 1) mod n) if r is odd two connected network ( gure 4) designed for the
metropolitan-area environment [Max85, Max87]. A
Bidirectional Manhattan-street networks (bmsn) are complete msn grid must be rectangular and the num-
also considered; in such a network, every link goes in ber of columns and rows must be even. Therefore,
both directions and every node has four incoming and the network is not de ned for an arbitrary number
four outgoing links. of nodes. However, a fractional addressing scheme
Below we list the interesting numerical properties has been proposed that allows an arbitrary number
of square highway-transfer and Manhattan-street net- of pairs of rows to be added at any position in the
works. network, as well as a procedure to add one node at a
1. D: For htn it is 2(n ? 1). For msn, it is n when time [Max87]. The latter feature a ects the regular-
n=2 is odd and n + 1 when n=2 is even [ChA90]. ity of the network and makes it not always possible to
For bsmn it is (n ? 1) when n is odd and n when relay packets along the shortest paths to their desti-
n is even. nations, at least as long as the routing decisions are
based on the local information available at the rout-
2. h: For msns of di erent sizes di erent formulas ing node. In this section, we consider regular networks
are given in [ChA90]. The most succinct form is with even numbers of rows and columns.
obtained when n is divisible by 4: The virtue of local routing in a regular msn is in
ranking the outgoing links by comparing the destina-
N (n + 2) ? 4
= n2 NN? 1 + N
N ? 4 (16) tion address of an incoming packet (the row/column
h= 2
N ?1 ?1 coordinates) with the address of the node making the

20
routing decision. In [Max87] a precise set of local rules concepts are based on simple regular topologies and lo-
is given which guarantees that the preferred routes de- cal routing rules. However, although there exist ways
termined this way explore all shortest paths to the des- of relaxing the regularity requirement for msn (and
tination. The rules operate on transformed addresses retaining most of the advantageous properties of the
which are derived from the actual addresses by assum- network), there are no simple means of doing the same
ing that the destination is located at the center of the with sxn. Furthermore, one should mention the fol-
network25 and setting its row and column coordinates lowing general di erences in the behavior of msn and
to zero. Then the relative address of the current node sxn operating under de ection routing.
is calculated. Let m be the number of rows and n be msns responds well to uniform saturation patterns.
the number of columns (N = mn). The transformed An oversaturated network (in which every node is con-
address (r; c) of a node with actual address (rfr ; cfr ) stantly ready to transmit its own packet) is able to
with respect to the destination node with actual ad- maintain essentially the same throughput as the max-
dress (rto ; cto ) is: imum achieved below the saturation threshold.27 In
contrast, oversaturated sxns tend to choke up with
r = m2 ? m ? D (r ? r ) mod mo
n
c fr to trac, which results in a severe throughput dete-
2 rioration. The e ect becomes more pronounced as
the number of nodes increases, because the de ection
c = n2 ? n2 ? Dr (cfr ? cto ) mod n
n  o
penalty in sxn grows with the network size. Conse-
quently, ow-control mechanisms are needed for sxn,
where Dc and Dr are either ?1 or +1, depending on but not for msn, at least as long as the load pattern
the directions of the links. Since the destination lies is not excessively biased.
at the center, the current node has to be located in As a counterbalance to the above disadvantage, for
one of the following sections of the transformed grid: the same number of nodes, the sxn topology o ers
8
> Q1 if r > 0 and c > 0 smaller h than msn. Besides, h expressed as a func-
>
<
Q2 if r > 0 and c  0 tion of the number of nodes grows faster in msn than in
(r; c) is in > Q3 if r  0 and c  0 sxn. This implies that with perfect routing sxn will
>
:
Q4 if r  0 and c > 0 tend to achieve a higher maximum throughput per
node than msn. Indeed, this happens when bu ers
It is shown in [Max87] that a xed simple set of pref- are present (i.e., routing decisions are close to opti-
erences can be associated with every section. mal), but without bu ers, due to a lower de ection
Manhattan-street networks tend to perform best penalty, msn achieves a higher maximum throughput
when the number of rows is the same as the number than sxn.
of columns [MyZ90]. When N = n2 , the throughput
is limited to 2n2=(n=2) = 4n, since h = n=2 and there 7.3.2 Highway transfer network
are 2n2 links in the network. The performance of msn
is investigated in [GrG86, Max89] and compared to The routing scheme of htn is suitable for regular rect-
that of sxn. In [Max89], it is shown that the net- angular meshes (as shown in gure 4) as well as arbi-
work performs very well under (synchronous) de ec- trary (possibly irregular) topologies [KuY90]. A high-
tion routing. Even without bu ering, 55-70% of the way is de ned as a collection of adjacent links with
maximum theoretically possible throughput (achiev- each link belonging to only one highway. The ow
able with in nite bu ers) can be obtained under uni- of packets on a highway is unidirectional. A highway
form load, while the addition of a small one-slot bu er may be loop-shaped (i.e., look like a ring) or consist
per node increases this gure to 80-90%. A simple of an open-ended link (looking like a bus). The rout-
ow-control mechanism at the local source preventing ing mechanism favors highway connections over cross-
the node from transmitting its own packet when both highway connections ( gures 5 and 6) and can process
input links are busy is sucient to guarantee that no packets going along the highways with less overhead.
packet loss will occur within the network.26 The network operates in a slotted mode. A slot
Manhattan-street networks are often mentioned to- consists of a header followed by the payload part. The
gether with shue-exchange networks because both rst bit of the header indicates whether the slot is full
of empty. If the slot is full, the destination address is
25 It cannot be located exactly at the center, but in one of the included in the header. Each node maintains a set of
four central positions.
26 Assuming no hardware malfunctions. 27 The degradation does not exceed 3%.

21
routing tables, one table associated with every incom- immediately.28 The bu er associated with an outgo-
ing link. The routing tables are indexed by the des- ing link is examined whenever the slot to be relayed on
tination addresses extracted from slot headers. Each the link turns out to be empty. In such case, the rst
entry contains an outgoing link identi er and a high- waiting slot is extracted from the bu er and relayed
way indicator. The highway indicator is a binary ag instead of the empty slot.
which tells whether the incoming link (on which the The basic assumption of the routing scheme is that
slot has arrived) and the outgoing link (on which the packets traveling long-distance to their destinations
slot is to be relayed) belong to the same highway. are relayed in the \highway mode" and experience
shorter delays than packets trying to get en route.
Therefore, it is obvious that the key to the success
of this approach is in the proper placement of the
highways. The designers stated that the technique
was in its primitive stage and the optimal design rules
Cross_Highway Connections
Highway Connections were not clear at the time of writing. It is argued
Node that with the proper organization of the highways
this scheme can produce better results than store-
and-forward and cut-through switching methods, as
Figure 5: Examples of highway and cross-highway con- observed on a square torus network.
nections in htn.
7.3.3 Triangularly arranged network
RT− 0 Another routing strategy that we will brie y discuss
here is de ned on the so-called Triangularly Arranged
Connection Network (tac) and described in [MyZ90].
IL− 0 MUX OL− 0
SR −1

tac is a 3-connected toroidal mesh in which nodes are


0
BUF −0

RT− 1 located on vertices of equilateral triangles and refer-


enced by unique Cartesian pairs (x; y). The number
IL− 1
SR − 1
MUX OL− 1
of nodes needs to be a multiple of 4 in order for the
1
links to be oriented properly; therefore, the network
cannot be de ned for an arbitrary number of nodes.
BUF −1

Similarly to msn, the next hop in the path to a given


destination can be found solely by comparing the des-
RT−k tination address with the address of the current node
( gure 7).
IL− k MUX OL− k
SR −k k
BUF −k
1,3 3,3 5,3 7,3

Figure 6: Node con guration in htn. 0,2 2,2 4,2 6,2

As illustrated in gure 6, each incoming link (il)


is connected to a shift register (sr) whose purpose is 1,1 3,1 5,1 7,1

to bu er the header of an incoming slot before the


slot is relayed on one of the outgoing links. The rout- 2,0 4.0 6,0

ing tables (rt) assign the incoming slots to the out-


0,0

put links (ol). When a slot is to be forwarded on a


highway, it is fed directly into the outgoing link (ol). Figure 7: A 44 tac network.
On the other hand, if the slot is to be de ected from
one highway to another, its contents are transferred to
the fifo bu er (BUF) associated with the outgoing Before going into the details of the routing algo-
link and the current slot to be relayed on the highway rithm, we should clarify one geometric property of the
of the incoming link is marked as empty. With this 28 With the slight delay needed to examine the contents of its
approach, a slot traveling along a highway is relayed header.

22
network. In gure 7, the triangle formed by nodes down from
p the current coordinate, has the dis-
(3; 1), (5; 1) and (4; 2) is supposed to be equilateral. tance of 52.
To interpret the network geometrically, it is assumed
that the horizontal coordinates of the nodes repre- 3. Route 3: (3; 1), i.e., 3 units right and 1 unit up
sent correctly Cartesian distances along the horizontal from the current coordinate, has the distance of
p
axis, but the vertical coordinates are scaled properly 12.
to retain the Euclidean proportions of equilateral tri- 4. Route 4: (3; ?3), i.e., 3 units right and 3 units
angles. This means in particular that the di erence down from
between the yp-coordinates of nodes (3; 1) and (4; 2) p the current coordinate, has the dis-
tance of 36.
is not 1 but 3. Consequently, in the following al-
gorithm, y-coordinates of the nodes are multiplied by
p Note that the distance calculations have nothing
3 when they are used to calculate a distance or a to do with the actual outgoing connections available
vectorpmagnitude. The distance between two nodes at the routing node. The aim of this step is simply
(d = (x1 ? x2p )2 + (y1 ? y2 )2 ) and the magnitude of to determine the proper direction in which the packet
a vector (m = x2 + y2 ) can both be evaluated and should be forwarded. These routes and the distance
interpreted in the standard Euclidean sense. involved in each case are depicted in gure 9. The min-
Due to the orientation of the triangular links, there imum direct Euclidean distance is o ered by route 3.
appear to be eight di erent combinations of output
link directions, which are denoted by three digits. The
binary numbers are chosen in such a way that each bit 0 0 1 2 3 4 5 6 7

represents a particular line, and a 1 in any position


3
3

implies a rightward pointing arrow.29 The rst bit


Destination 2 Destination
2 Source
1

position denotes the main diagonal, the second the o


1

diagonal and the third represents the horizontal line.


0 1 2 3 4 5 6 7 0

The routing algorithm is executed in three steps: Figure 8: tac routing example: step 1.
Step 1: Suppose that a packet is to be relayed
by node (5; 1) on its way to node (0; 2). This rout-
ing problem is illustrated in gure 8.30 The rela-
tive coordinates of the destination are calculated as
(xd ? xs ; yd ? ys ) = (?5; 1). That reads: the x-
0
3
3
Destination
coordinate of the destination is 5 units less than the 2 Route1
Source
Route3 2 Destination

x-coordinate of the current node (hence the minus 1


1

sign) and the y-coordinate of the destination is 1 unit 0 1 2 3 4 5 6 7 0

greater than that of the routing node. In other words, 1 Route2 Route4 1

the minus sign of the x-coordinate of the destination 2 Destination 2 Destination

shows that the destination lies 5 units to the left of


3 3

the current node. By the same token, it is located 1


unit up.
Step 2: Owing to the torus structure, the destina- Figure 9: tac routing example: step 2.
tion can be reached via four di erent paths. These Step 3: No physical connection exists at node (5; 1)
alternatives and the distances involved are:
that could take the packet 3 units right and 1 unit
1. Route 1: (?5; 1), i.e., 5 units left and 1 unit up up. Consequently, the algorithm tries to choose the
from the current coordinate, has the distance of link that can cover as much distance as possible in the
p indicated direction. The minimum distance vector cal-
28.31 culated in the previous step enables the algorithm to
2. Route 2: (?5; ?3), i.e., 5 units left and 3 units choose between the three outgoing links available. The
best choice is the direction along the main diagonal,
29 Note that there are no arrows that would point strictly up leading the packet to node (6; 0) ( gure 10). A simi-
or down. lar operation will be performed at every intermediate
30 To account for the torus structure, the destination node is
redrawn at the right of the gure. node until the packet arrives at its destination.
p3.31 All units in the vertical direction have been multiplied by In a simulation experiment in which packets that
have not reached their destination after 7 hops on

23
(0,2) at the nodes.
28
A number of routing protocols have been suggested
(5,1) 28 for this connection pattern. Di erences between them
(3,1)
lie in the way of resolving contentions. Although the
number of outgoing links per node is increased only
4

twofold with respect to the unidirectional network,


(4,0) (6,0) the complexity of the routing problem (contention res-
olution) is multipled by a much larger factor. In a
Figure 10: tac routing example: step 3. unidirectional network, two incoming packets can be
assigned to two outgoing links in two ways, whereas
the corresponding number for the bidirectional case is
a 44 tac network were dropped, 90% of all pack- 4! = 24.
ets were able to make it to the destinations while A bidirectional msn with local de ection routing of-
queue lengths remained acceptable (less than 1/8 of fers slightly shorter h and a smaller de ection penalty
the packets had to wait in queues). than its unidirectional counterpart. Consequently, the
Notably, the cost of implementing a tac network maximum throughput achieved by the bidirectional
is higher than that of an msn with the same number variant of the network is more than twice higher than
of nodes: more links must be laid and the individ- for the unidirectional version.32 Thus, one can say
ual nodes are more complex and thus more expensive. that it pays to multiply the number of links by two,
The designers argue that the actual incremental cost as the gain is a more than twofold improvement in
may be small enough to be justi ed by the potential throughput. On the other hand, the routing opera-
increase in performance. The table below compares tion becomes signi cantly more complicated.
the data path length in msn and tac networks with hr4 -net [BoC87] employs a routing scheme which
the same number of nodes. organizes the network into two di erent ring structures
at two levels. Low-level rings (streets) are connected
Network Number of links Avg. Hops Max. Hops at each node by high-level rings (avenues). Each low-
Size msn tac msn tac msn tac level ring (L-ring) is identi ed by its speci c address
4x4 32 48 2.93 2.18 5 4 which is used as routing information. Each packet
8x8 128 192 5.02 3.76 9 6 travels on a high-level ring (H-ring) until it reaches a
12x12 288 432 7.02 5.32 13 10 node which is located on the same L-ring as its des-
16x16 512 768 9.02 6.89 17 12
tination. The routing decisions are precomputed and
stored in a rom which is indexed according to the
These results are open to criticism. It seems to us preference of the incoming packets (34 possible cases:
that the better performance of tac networks is re- empty, H-ring, L-ring) and link availability (24 pos-
lated to the increased connectivity rather than the sibilities). The information retrieved from the rom
advantages of the triangular topology with respect gives the allocation of the outgoing links to the in-
to the msn grid|as claimed in [MyZ90]. For exam- coming packet which maximizes the number of pack-
ple, we nd the maximum hop count to be 6 on an ets satis ed with their routes. In any case, the number
88 msn when we increase the connectivity of a node of misdirected packets is not greater than two and its
from 2 to 3. Moreover, the tac routing algorithm in- average33 is 0.75 when 4 incoming packets are always
troduces considerably more processing overhead when present. The routing mechanism does not attempt to
compared to that of msn. di erentiate between two horizontal and vertical direc-
tions. Consequently, h = n2 =(n+1) and the maximum
7.3.4 Bidirectional toroidal networks throughput is bounded by U = 4(n + 1), which is of
the same order as in a unidirectional msn. In [BoC87],
The networks discussed in this section are based on the loss of eciency is justi ed with the simplicity of
rectangular toroidal grids (as in gure 4) with bidi- implementation. A shortest-path routing scheme for
rectional connections. The total number of links is hr4 -net can be found in [WoK90]. It is a combina-
twice as much as in a unidirectional msn with a given tion of the ideas explored in msn (transformed ad-
number of nodes. The maximum throughput of this 32 When the network operates without bu ers, the improve-
topology (assuming a square grid with n  n nodes) ment is of order 2.5 to 3, depending on the total number of
is limited by 4n2 =(n=2) = 8n, under a uniform trac nodes.
pattern and with an unlimited bu er space available 33 Under uniform load.

24
dresses and grid sections with xed preference rules) riving at a node within one cycle, the routing deci-
and the original hr4 -net (104 precomputed switching sion maximizes the reduction of the weighted cumu-
decisions). lative distance of these packets to their destinations.
Another class of routing schemes applicable to With this approach, packets that are closer to their
bidirectional toroidal networks have found their destinations have a higher priority in contention for
place in signet (Slotted Interconnected-Grid Net- their preferred outgoing links. The authors arrive at
work) [ToB90]. Routing in signet makes use of the their routing criteria by noticing that when a packet
concept referred to as the preference vector (pv). It is must be de ected, it should be de ected towards the
an ordered list which indicates the preferred links for a node from which it will prefer the maximum number
packet, given the quadrant that contains the packet's of outgoing links. Consequently, de ections are forced
destination. Some routing algorithms look at the num- towards the topological antipode of the destination be-
ber of rows and columns separating the packet from cause the antipode (if it exists) is characterized by the
the destination. Six di erent routing mechanisms are property that all its outgoing links are equally pre-
disussed in [ToB90]. The rst classi cation is made ferred by the packet. Generally, a node lying along
according to the setting of pv, namely in conformance the direction from the destination to its antipode of-
with orthogonal (O) or diagonal (D) routing. With fers the maximum number of preferred outgoing links
orthogonal routing, the packet moves in the row di- from among all nodes to which the packet can be pos-
rection rst, and then, when it hits the column of its sibly de ected.
destination, it moves along that column. Note that
the same approach is used in hr4 -net and htn. The 7.4 Flooding networks
second algorithm attempts to route packets along a
stepwise diagonal from source to destination. Thus, In this section we discuss a few networking solutions
if the number of columns separating the packet from based on ooding. The rst three concepts, Flooding
its destination is larger than the number of rows, the Sink, Arbnet+ and Noahnet, have been designed for
packet will be moved along the row and vice versa. small to medium size local-area networks and, for the
A variation of diagonal routing is also suggested in reasons that will become clear, are not suitable for
[BaP89] under the name of Z 2 (Zig{Zag Routing) pol- metropolitan or wide area environments. They should
icy. be considered as alternatives to Ethernet or other net-
After the preferences of all incoming packets have works in which all stations are \ ooded" with every
been determined, contention is resolved according to single transmission. The fourth solution presented in
the priorities of the contending packets. These pri- this section is less limited in scope, but it su ers from
orities can be assigned to the packets in three dif- an unbalanced use of links, especially under heteroge-
ferent ways. The S algorithm looks at the so-called neous trac patterns.
secondary counter which gives the minimum of the
number of rows and columns separating the packet 7.4.1 Flooding sink
from the destination. The smaller the value of the sec- As stated in section 2.2.2, the most severe problem
ondary counter the higher the packet's priority. The D with ooding is the presence of many unneeded repli-
algorithm determines the number of hops the packet cas of the same packet. In Flooding Sink [HPU86],
must still travel, assuming that it will move along this problem is avoided at the switch level. Every
the shortest path to its destination. Packets that are node remembers the identi ers of the last-forwarded
closer to their destinations are privileged over those 255 packets. When a packet arrives at the node and
that have a longer remaining distance to travel. Fi- its identi er occurs in the list, the packet is ignored.
nally, the R algorithm assigns priorities at random. A Flooding Sink node is interfaced to the same
Each of the two routing mechanisms can be combined number of input and output links. It receives pack-
with any of the three priority schemes, which results ets from other nodes in the network and from its host.
in six routing protocols denoted OS, OR, OD, DS, Each packet has a header containing the source ad-
DR, DD. Simulation studies show that under uniform dress, the destination address, and a serial number.
trac, and for a variety of network sizes, the OS algo- Within a certain time frame (dependent on the prop-
rithm achieves the best overall performance in terms agation diameter of the network), the pair <source
of the average delay and maximum throughput. address, serial number> uniquely identi es a packet
In [BoC90], a locally-optimal routing algorithm is and is referred to as the packet identi er. Any packet
proposed for bmsn. Given a collection of packets ar- arriving on one of the input links is considered old if

25
the node remembers its identi er. Such a packet is principle of rst-come- rst-served with blocking. The
discarded and it doesn't propagate beyond the node. winning frame is repeated on all free output ports on
Otherwise, the packet is simply relayed on all outgoing the y. In case of a collision, the switch stops the
links. transmission and sends a Clear Link Signal (cls) on
Technically, the above procedure involves bu ering the link on which the collision has occurred. All ele-
transient packets at the node. Every incoming link ments of this protocol are performed by hardware.
is connected to an input bu er. After a packet has A collision in Arbnet+ can be of one of two types:
been received completely, the node asserts its validity unintentional or intentional. An unintentional colli-
(examines its crc code) and then passes the packet's sion occurs when two frames travel on the same link
identi er to the eliminator which tries to match it in the opposite directions. An intentional collision is
against one of the remembered identi ers. The storage forced by a switch that cannot accept a data frame for
for identi ers is associative and this part of the routing routing. In that case, the switch jams the incoming
operation takes little time. If the packet is old, it is signal by sending a cls.
simply ignored. Otherwise, the pointers to the start A switch can be in one of three states: idle, rout-
and the end of the packet in the bu er are stored for ing, or transmitting. It enters the routing state upon
processing. The destination address of the packet is detecting an incoming signal at one of its links. Sub-
compared to the address of the current node and the sequently, it repeats the incoming frame on all its free
result of this comparison is stored together with the links with practically no delay. A frame that arrives at
pointers. This way the node will know whether it is the switch when it is already in routing state, or when
supposed to receive the packet (and pass it to the host) no free link is available, will not be repeated and a cls
or relay it on the outgoing links. will be inserted into the link of arrival.
Arbnet+'s routing mechanism is based on a sim-
7.4.2 Arbnet+ ple shortest-path tree search technique. With each
transfer attempt, the protocol propagates the packet
Arbnet+ [Pun89, Pun90] is built of switches connected along a tree rooted at the source niu. A leaf collapses
by bidirectional point-to-point links. User devices when a re-transmission at that leaf is blocked, i.e.,
(hosts) are connected to the network via Network In- jammed by a cls. When a switch detects that all
terface Units (nius) that can also be used as multi- its retransmission have been blocked, it sends a cls
plexers to support multiple user devices attached to up the tree (the clear backward operation)|to col-
the same switch. The switches are responsible for lapse the branch connecting it to the uplink neighbor.
routing packets and the interswitch routing protocol Note that after some time, only the path connecting
is topologically indi erent. the source to the destination will remain active34 and
The access protocol of Arbnet+ is very similar to all the other branches of the network tree will have
the ieee 802.3 csma-cd medium access control layer. collapsed. When the source receives a cls from all
When an niu has a frame to transmit, it listens to the its links, it will conclude that the packet didn't make
link that interfaces it to the switch. If no activity is it. Such a packet will have to be retransmitted af-
detected in the link, the niu transmits the frame af- ter a randomized back-o delay|to avoid repetitive
ter an inter-frame delay and continues monitoring the contention patterns (lockout).
link|to detect a possible collision. If a collision oc- To prevent looping of frames within the network
curs during transmission, the niu stops the transmis- and to properly recognize the success/failure of its
sion immediately and emits a short jamming signal. transfer attempt, the root should not exhaust its
Then it enters a random back-o mode. Otherwise the transmission before a possible cls indicating the col-
transmission will continue until the end of the frame lapse of the most distant leaf is given a chance to ar-
has been reached. The niu is then ready to transmit rive at the switch. This implies a minimum frame size
another frame or receive a frame from the network. constraint for Arbnet+. The shortest frame should be
The reception begins with the detection of an activity longer than the longest possible round-trip delay be-
in the link. As soon as the destination address is rec- tween a pair of nius in the network. Shorter frames
ognized in the arriving frame, the niu compares it to are reported to be eventually absorbed by the network
its own address. If no match occurs, the niu emits a due to the collisions at the expense of some degrada-
jamming signal into the link. Otherwise it passes the tion in performance. Arbnet+ o ers lower delays and
packet to the host.
The switch is in fact a simple transceiver. The ar- 34 Assuming that all the switches along it are available and
bitration of contention at the switch is based on the the destination is willing to accept the packet.

26
better throughput than Ethernet (with the same num- 7.4.4 Controlled ooding
ber of stations), because it tends to isolate the amount Controlled Flooding, introduced in [LeR90] and inves-
of network resources needed for a packet transmission tigated further in [ANR92], is a technique for reducing
to the shortest path in the network graph. However, the total length of the path traveled by the multiple
this procedure is not instantaneous and its duration is copies of every packet inserted into the network. The
proportional to the propagation diameter of the net- extent of ooding is limited by assigning a cost to ev-
work. Consequently, the improvement is only visible ery link traversal and allowing every packet a limited
when packets are long and the network is neither very credit for traversing links. Links are treated as toll
large nor fast. highways or bridges: every packet passing through
a link must pay a toll. When a packet is launched
at the source, it is assigned a numerical value which
7.4.3 Noahnet represents its credit or wealth. In order to traverse
a link, the packet must possess a wealth which is at
least equal to the cost of the link. The cost of a tra-
Noahnet is a lan architecture implemented at the Uni- versed link is deducted from the wealth of the packet.
versity of Delaware [FaP86]. The network operates on Therefore, at every intermediate node, the packet is
a randomly-connected graph topology, uses a ooding repeated only on the links that it can a ord. Finally,
protocol to route packets, and is intended for high- when its wealth is reduced below the cost of the cheap-
bandwidth media. est link, the packet is discarded. Note that to route
The network handles three types of packets: data, packets every node must only know the costs of its
status, and command. Data packets carry the actual outgoing links. No routing tables are necessary.
information exchanged among hosts. Status packets The costs assigned to the links and the wealth
are used for two purposes: as acknowledgements, i.e., assigned to the packets control the scope of ood-
to indicate whether a data packet has arrived at a ing. Di erent patterns of allocating costs to links and
node in a good shape, and to indicate the ood status wealth to packets result in di erent routing schemes.
of a downstream node. The ood status can be: for- A heuristic algorithm for assigning costs to links is
warding, blocked, or got to destination (gtd). At the given in [LeR90]. The objective of that algorithm
time of writing, the only implemented command was is to minimize the number of nodes that unnecessar-
stop ooding. All status messages are transmitted by ily receive packets addressed to some other regions
a downstream node to its immediate upstream neigh- of the network. A later study [ANR92] claims that
bor, whereas command messages are transmitted in the proposed scheme yields an unbalanced resource us-
the opposite direction. age and compares it to two other routing algorithms
A switch that receives a packet tries to relay it to which choose routes along breadth- rst search trees
all unoccupied adjacent nodes. The adjacent nodes re- and shortest paths.
peat the same process until the packet either reaches
its destination or it cannot be forwarded any further.
A forwarding switch receives ood status packets from 8 Conclusions
all its downstream nodes and, based on these pack-
ets, sends one status packet to its immediate upstream Gigabit transmission rates bring forward their spe-
neighbor. As in Arbnet+, the path of the data packet ci c issues in uencing the design of the network hard-
forms a tree rooted at the source node. To speed up ware and software. The most drastic di erence with
the operation of isolating the path to the destination respect to a slow network is the in ation of the net-
and releasing the resources not needed for the transfer, work's apparent size. Even a geographically small
the network carries out this procedure from both ends, network operating at a very-high transmission rate
i.e., from the leaves to the root as well as from the root looks large because the path between a pair of antipo-
to the leaves. A status packet that says \blocked" is dal nodes can contain a substantial number of bits
interpreted as a collision in Arbnet+. Such a packet is (and packets) at a time. This phenomenon alone is
sent by a collapsing leaf and it propagates upward to a strong indication that locality should be the pre-
collapse the branch leading from the leaf to the root. dominant premise of a high-speed networking solu-
A \stop ooding" command can be sent down the link tion. Complicated routing schemes that with every
to release switches downward from the top of various decision try to comfort every remote node in the net-
branches. work are bound to fail because they cannot respond

27
in a reasonable time to the feedback received across source could actually \optimize" the transfer by se-
the large apparent network diameter. Similarly, \ob- lecting the pages in the order that would minimize
sessive" congestion-control techniques that try to ac- the total time needed to read them from the disk.
count for global changes in the trac o ered to the If we look carefully at those communication scenar-
network will not work because they will tend to react ios that appear to require the preservation of packet
to events that happened long ago and are no longer ordering, we will see that most of them t into the
relevant. following categories:
Switched networks built of meshes of nodes in- 1. Scenarios that in fact could be carried out with
terconnected via point-to-point channels scale better packets arriving in any order (like the le transfer
than busses, rings, or stars, but in contrast to those case discussed above). They enforce packet or-
simple unidimensional topologies, they require non- dering because higher protocol layers view them
trivial routing algorithms and, arguably, congestion (unnecessarily) as stream-oriented sequential sce-
control. Among the large number of routing schemes narios.
proposed for the gigabit range, the competition is
between point-to-point store-and-forward techniques 2. Scenarios involving relatively short transfers (e.g.,
and de ection routing. Flooding-based solutions do a piece of text to appear on a screen). Messages
not seem to o er a viable alternative as their per- of this sort can be safely reassembled in a small
formance is severely impaired by the multiple copies bu er space at the destination.
of every packet propagating out of their \legitimate"
way. All techniques intended to contain ooding are 3. Long sustained continuous transfers that actually
ine ective when the network appears large and the require packets to arrive in order (e.g., video,
length of a typical packet is much shorter than the voice). Such scenarios typically admit a cer-
network's diameter. tain packet loss rate. Consequently, one can im-
The store-and-forward approach in which packets plement them with a limited reassembly bu er,
addressed from a given source to a given destination dropping packets that arrive out of sequence while
are always routed along the same path has the ad- the bu er is full. Note that store-and-forward
vantage of delivering packets \in order." In contrast, techniques generally do not guarantee packet de-
de ection routing is tainted with the non-trivial re- livery in this scenario either|due to the jitter
assembly problem: packets may arrive out of order (even though the packets arrive in order, some of
and the destination must rearrange them upon arrival them may be too late to be useful).
into the original message. Many people perceive this One can identify a number of avenues for future
disadvantage as a disqualifying aw in the whole con- research aimed at making de ection-routing schemes
cept of de ection routing. We believe that de ection more attractive. In particular, the impact of de ec-
routing with its simplicity and self-adaptation to vary- tions (and the packets arriving out of order) on the
ing loads is too attractive to be rejected that easily. network performance under synchronous trac sce-
One can argue that in many cases the need for the narios should be investigated carefully and classi ed
preservation of packet ordering at the destination is depending on the topology, the routing scheme, and
apparent and could be relaxed without a ecting the the amount of bu er space available at a node. The
integrity of data transfer. Consider for example a le average performance of many de ection schemes tends
transmission across the Internet. Most people would to improve drastically when small bu ers are present
view this operation as a typical connection-oriented at nodes. Perhaps one can arrive this way at a compro-
scenario in which the preservation of packet ordering mise between store-and-forward and de ection routing
is absolutely critical. Note, however, that the operat- which will retain the simplicity of the latter and some
ing systems of the hosts involved in the data transfer desirable properties of the former.
perceive the le as a random collection of pages that The present authors are aware of de ection schemes
just appear ordered because the user (or the transport that limit the number of hops traveled by a packet
layer) wants to view them that way. If the transfer on its way to the destination without losing packets
protocol could be aware that the le consists of indi- or negotiating resources across the network.35 With
vidual fragments scattered over a disk, it could trans- such schemes, it may be possible to limit the size of a
fer these fragments independently (together with their reassembly bu er at the destination and accommodate
relative locations in the le) and the destination could
store them on its disk as they arrive. Even better, the 35 Ongoing research.

28
continuous stream-type trac of any duration without and{Forward and Hot{Potato Routing", IEEE
losing a single packet. Trans. on Communications, 40, 6 (June 1992),
Another important issue is the mapping of virtual 1082{1090.
topologies onto real topologies. The backbones of fu- [AdD74] P.R. Adby, M.A.H. Dempster, Introduction to
ture networks spanning metropolitan and larger areas Optimization Methods, Halsted Press, 1974.
will often consist of virtual links built on top of some [AKH87] A.S. Acampora, M. Karol, M.G. Hluchyj,
generally available networking services, e.g., atm. The \Terabit Lightwave Networks: The Multihop
right way of embedding a virtual network into a col- Approach", AT&T Technical Journal, 66, 6
lection of such links should account for some speci c (November/December 1987), 21{34.
properties of these links, e.g., the variability of delays [ANR92] Y. Azar, J. Naor, R. Rom, \Routing Strategies
and available bandwidth, the grade of bandwidth al- for Fast Networks", INFOCOM'92, 170{179.
location, the cost of tearing down a link during idle [Aya89] E. Ayanoglu, \Signal{Flow Graphs for Path
periods and setting it up again. Enumeration and De ection Routing Analy-
Synchronous de ection schemes su er from the sis in Multihop Networks", GLOBECOM'89,
problem of slot alignment: all slots expected to arrive 1022{1029.
at a switch during one routing cycle must appear to [Bae80] J{L. Baer, Computer Systems Architecture,
have arrived simultaneously. In a network consisting Computer Science Press, 1980.
of a moderate number of nodes, this problem can be [BaP89] H.G. Badr, S. Podar, \An Optimal Shortest{
solved by using alignment bu ers that absorb slight Path Routing Policy for Network Computers
irregularities in the arrival rate of slots from di er- with Regular Mesh{Connected Topologies",
ent links. When the number of stations is large, this IEEE Trans. on Computers, 38, 10 (October
simple approach may be ine ective. Then one can 1989), 1362{1371.
consider using node-to-node backpressure mechanisms [BCS90] K. Bala, I. Cidon, K. Sohraby, \Congestion
indicating those irregularities to the neighbors. Al- Control for High Speed Packet Switched Net-
ternatively, one can think of de ection schemes that works", INFOCOM'90, 520{526.
operate correctly (without losing slots) despite occa- [BeG87] D. Bertsekas, R.Gallager, Data Networks,
sional alignment problems. Ultimately, it is possible Prentice-Hall, 1987.
to switch to an asynchronous routing scheme which [BeG92] D. Bertsekas, R.Gallager, Data Networks, sec-
may provide a reasonable service when the connectiv- ond edition, Prentice-Hall, 1992.
ity degree of the network is not too small. [BFM90] J.A. Bannister, L. Fratta, M. Gerla, \Topolog-
It is not clear how much is gained by insisting on ical Design of the Wavelength{Division Opti-
the regularity of the network topology. With a regu- cal Network", INFOCOM'90, 1005{1013.
lar topology, routing seems simpler because it can be [BoC87] F. Borgonovo, E. Cadorin, \HR4 Net: A Hi-
performed locally without resorting to a routing table. erarchical Random-Routing, Reliable and Re-
But it seems unlikely that realistic networks are going con gurable Network for Metropolitan Area",
to be perfectly regular. Optimal routing algorithms INFOCOM'87, 320{326.
based on local table-free rules tend to be tricky and [BoC90] F. Borgonovo, E. Cadorin, \Locally{Optimal
complex. Moreover, when the topology is even slightly Routing in the Bidirectional Manhattan Net-
irregular they either break down or become more com- work", INFOCOM'90, 458{464.
plex and suboptimal. Perhaps the cost of using xed [BrT80] W.G. Bridges, S. Toueg, \On the Impossibility
routing tables describing exactly the network topology of Directed Moore Graphs", Journal of Combi-
(which can be highly irregular) will be paid o by the natorial Theory, Series B, 29 (1980), 339{341.
real simplicity of the routing rules and the exibility [Bru46] N.G. de Bruijn, \A Combinatorial Prob-
of the network structure. lem", Koninklijke Netherlands: Academe Van
Wetenschappen, Proc, 49, 20 (1946), 758{764.
[ChA90] T.Y. Chung, D.P. Agrawal, \On the Network
References Characterization of and Optimal Broadcast-
ing in the Manhattan Street Network", INFO-
[Aca87] A.S. Acampora, \A Multichannel Multihop COM'90, 465{472.
Local Lightwave Network", GLOBECOM'87, [CiO93] I. Cidon, Y. Ofek, \Metaring|A Full-Duplex
1459{1467. Ring with Fairness and Spatial Reuse", IEEE
[AcS92] A.S. Acampora, S.I.A. Shah, \Multihop Light- Transactions on Communications, 41, 1 (Jan-
wave Networks: A Comparison of Store{ uary 1993), 110{120.

29
[CLR90] T.H. Cormen, C.E. Leiserson, R.L. Rivest, In- [HlK88] M.G. Hluchyj, M.J. Karol, \ShueNet: An
troduction to Algorithms, McGraw{Hill, 1990. Application of Generalized Perfect Shues
[Dan91] S.P. Dandamudi, \Hierarchical Hypercube to Multihop Lightwave Networks", INFO-
Multicomputer Interconnection Networks", COM'88, 379{390.
Ellis Horwood, 1991. [HoP88] N. Homobono, C. Peyrat, \Connectivity of
[Den92] S. Deng, \Flexible Access Control in Broad- Imase and Itoh Digraphs", IEEE Trans. on
band Communication Networks", Ph.D. The- Computers, 37, 11 (November 1988), 1459{
sis, University of Alberta, Edmonton, Alberta, 1461.
Canada, 1992. [HPU86] N. Hutchinson, T. Patten, B. Unger, \The
[EiM88] M. Eisenberg, N. Mehravari, \Performance Flooding Sink: A New Approach to Local Area
of the Multichannel Multihop Lightwave Net- Networking", Computer Networks and ISDN
work Under Nonuniform Trac", IEEE Jour- Systems, 11 (1986), 1{14.
nal on Selected Areas in Communications, 6, 7 [Hui90] J.Y. Hui, Switching and Trac Theory for
(August 1988), 1063{1078. Integrated Broadband Networks, Kluwer Aca-
[EsH85] A. Esfahanian, S.L. Hakimi, \Fault{Tolerant demic Publishers, 1990.
Routing in De Bruijn Communication Net-
works", IEEE Trans on. Computers, 34, 9 [HwB84] K. Hwang, F.A. Briggs, Computer Architec-
(September 1985), 777{788. ture and Parallel Processing, McGraw{Hill,
1984.
[FaP86] D. J. Farber, G.M. Parulkar, \A Closer Look
at Noahnet", SIGCOMM'86, 205{213. [ImI83] M. Imase, M. Itoh, \A Design for Directed
[Fen81] T. Feng, \A Survey of Interconnection Net- Graphs with Minimum Diameter", IEEE
works", IEEE Computer, December 1981, 12- Trans. on Computers, 32, 8 (August 1983),
27. 782{784.
[GaJ79] M.R. Garey, D.S. Johnson, \Computers and [ISO85] M. Imase, T. Soneoka, K. Okada, \Connec-
Intractability: A Guide to NP{Completeness", tivity of Regular Directed Graphs with Small
W.H. Freeman and Co., 1979. Diameters", IEEE Trans. on Computers, 34, 3
[GbM93] P. Gburzynski, J. Maitan, \De ection Routing (March 1985), 267{273.
in Regular MNA Topologies", Journal of High [Jai90] R. Jain, \Congestion Control in Computer
Speed Networks, 2, 2 (1993), 99{131. Networks: Issues and Trends", IEEE Networks
[GMW92] M. Gumbold, P. Martini, R. Wittenberg, Magazine, 4, 3 (May 1990), 24{30.
\Temporary Overload in High Speed Back- [JaM93] A. Jajszczyk, H.T. Mouftah, \Photonic Packet
bone Networks", INFOCOM'92, 2280{2289, Switching", IEEE Communications Magazine,
1992. 31, 2 (February 1993), 58{65.
[GoM93] M.X. Goemans, Y. Myung, \A Catalog of [KaS91] M.J. Karol, S.Z. Shaikh, \A Simple Adap-
Steiner Tree Formulations", Networks, 23, tive Routing Scheme for Congestion Control
1993, 19{28. in ShueNet Multihop Lightwave Networks",
[GrG86] A.G. Greenberg, J. Goodman, \Sharp Approx- IEEE Journal on Selected Areas in Communi-
imate Models of Adaptive Routing in Mesh cations, 9, 7 (September 1991), 1040{1050.
Networks", Teletrac Analysis and Computer [KAS91] B. Khasnabish, M. Ahmadi, M. Shridhar,
Performance Evaluation, Elsevier 1986, 255{ \Congestion Avoidance in Large Supra{High{
270. Speed Packet Switching Networks Using Neu-
[GrH92] A.G. Greenberg, B. Hajek, \De ection Rout- ral Arbiters", GLOBECOM'91, 140{144.
ing in Hypercube Networks", IEEE Trans. on
Communications, 40, 6 (June 1992), 1070{ [Kat88] H.P. Katse , \Incomplete Hypercubes", IEEE
1081. Trans. on Computers, 37, 5 (May 1988), 604{
608.
[Haj91] B. Hajek, \Bounds on Evacuation Time for
De ection Routing", Distributed Computing, 5 [Kle75] L. Kleinrock, Queueing Systems, John Wiley
(1991), 1{6. & Sons, Inc., 1975.
[Hir91] A. Hiramatsu, \Integration of ATM Call Ad- [Kle92] L. Kleinrock,
mission Control and Link Capacity Control \The Latency/Bandwidth Tradeo in Gigabit
by Distributed Neural Networks", IEEE Jour- Networks; Gigabit Networks are Really Di er-
nal on Selected Areas in Communications, ent!", IEEE Communications Magazine, 30, 4
September 1991, 1131{1138. (April 1992), 36{40.

30
[KrS87] B. Kreimeche, M. Schwartz, \A Channel Ac- [PSU88] A.L. Peressini, F.E. Sullivan, J.J. Uhl, Jr.,
cess Structure for Wideband ISDN", IEEE The Mathematics of Nonlinear Programming,
Journal on Selected Areas in Communications, Springer{Verlag, 1988.
5, 8 (August 1987), 1327{1335. [Pun89] H. K. Pung, et al., \Arbnet+: An Experimen-
[KuY90] T. Kubo, K. Yoguchi, \Highway Transfer: A tal Mesh-like Local Area Network", SICON'89,
New Forwarding Technique for Real{Time Ap- Singapore, 301{306.
plications", INFOCOM'90, 403{408. [Pun90] H.K. Pung, et al., \Performance of Arbnet
[LaA91] J.P. Labourdette, A.S. Acampora, \Logically from the Logical Link Control Point of View",
Rearrangeable Multihop Lightwave Network- Singapore ICCS'90, 1133{1137.
s", IEEE Trans. on Communications, 39, 8 [RaD90] G. Ramamurthy, R.S. Dighe, \A Network Ac-
(August), 1991, 1223{1230. cess Control for Integrated Broadband Packet
[Law76] E. Lawler, Combinatorial Optimization: Net- Networks", INFOCOM'90, 896{907.
works and Matroids, Holt, Rinehart and Win- [ReG87] D.A. Reed, D.G. Grunwald, \The Performance
ston, 1976. of Multicomputer Interconnection Networks",
[LeR90] O. Lesser, R. Rom, \Routing by Controlled IEEE Computer, June 1987, 63{73.
Flooding in Communication Networks", IN- [Rob88] T.G. Robertazzi, \Toroidal Networks", IEEE
FOCOM'90, 910{917. Communications Magazine, 26, 4 (June 1988),
[Max85] N.F. Maxemchuk, \Regular Mesh Topologies 45{50.
in Local and Metropolitan Area Networks", [Ro92a] C. Rose, \Mean Internodal Distance in Regu-
AT&T Technical Journal, 64, 7 (September lar and Random Multihop Networks", IEEE
1985), 1659{1685. Trans. on Communications, 40, 8 (August
[Max87] N.F.Maxemchuk, \Routing in the Manhattan 1992), 1310{1318.
Street Network", IEEE Trans. on Communi- [Ro92b] C. Rose, \Low Mean Internodal Distance Net-
cations, 35, 5 (May 1987), 503{512. work Topologies and Simulated Annealing",
[Max89] N.F.Maxemchuk, \Comparison of De ection IEEE Trans. on Communications, 40, 8 (Au-
and Store{and{Forward Techniques in the gust 1992), 1319{1326.
Manhattan Street and Shue{Exchange Net- [SaP89] M.R. Samatham, D.J. Pradhan, \The De
works", INFOCOM'89, 800{809. Bruijn Multiprocessor Network: A Versatile
[Max90] N.F.Maxemchuk, \Problems Arising from De- Parallel Processing and Sorting Network for
ection Routing: Live{Lock, Congestion and VLSI", IEEE Trans. on Computers, 38, 4
Message Reassembly", Proc. of NATO Work- (April 1989), 567{581.
shop on Architecture and Performance Issues [SaS88] Y. Saad, M.H. Schultz, \Topological Proper-
of High Capacity Local and Metropolitan Area ties of Hypercubes", IEEE Trans. on Comput-
Networks, 1990, 209{233. ers, 37, 7 (July 1988), 867{872.
[MoG90] J.A.S. Monteiro, M. Gerla, \Topological Re- [Sch80] M. Schwartz, \Routing and Flow Control in
con guration of ATM Networks", INFO- Data Networks", IBM Research Report 36329,
COM'90, 207{214. 1980.
[Muk92] B. Mukherjee, \WDM{Based Local Lightwave [Sch87] M. Schwartz, Telecommunication Networks,
Networks, Part II: Multihop Systems", IEEE Protocols, Modeling and Analysis, Addison{
Network Magazine, 6, 4 (July 1992), 20{32. Wesley, 1987.
[MyZ90] G.E. Myers, M. E. Zarki, \Routing in TAC [Sed88] R. Sedgewick, Algorithms, second edition,
| a Triangularly Arranged Network", INFO- Addison{Wesley, 1988.
COM'90, 481{486. [Sie90] H. J. Siegel, Interconnection Networks for
[OSM90] Y. Oie, T. Suda, M. Murata, D. Kolson, H. Large-Scale Parallel Processing, McGraw{Hill,
Miyahara, \Survey of Switching Techniques 1990.
in High{Speed Networks and Their Perfor- [SiH88] H.J. Siegel, W.T. Hsu, \Interconnection Net-
mance", INFOCOM'90, 1242{1251. works", chapter 6 in Computer Architectures,
[PPP93] A.R. Pach, S. Palazzo, D. Panno, \Slot Pre- Concepts and Systems, V.M. Milutinovic, ed.,
Using in IEEE 802.6 Metropolitan Area Net- Elsevier Science Publishing, 1988.
works", IEEE Journal on Selected Areas in [SiR91] K. Sivarajan, R. Ramaswami, \Multihop
Communications, 8, 11 (October 1993), 1249{ Lightwave Networks Based on De Bruijn
1258. Graphs", INFOCOM'91, 1001{1011.

31
[SLL81] J.M. Smith, D.T. Lee, J.D. Liebman, \An
O(n log n) Heuristic for Steiner Tree Problems
on the Euclidean Metric", Networks, 11, 1981,
23{29.
[Sto87] H.S. Stone, High Performance Computer Ar-
chitecture, Addison{Wesley, 1987.
[Szy90] T. Szymanski, \An Analysis of Hot{Potato
Routing in a Fiber Optic Packet Switched Hy-
percube", INFOCOM'90, 918{925.
[Tah82] H.A. Taha, Operation Research, An Introduc-
tion, Collier{Macmillan, 1982.
[TrD90] P. Tran-Gia, R. Dittmann, \Performance
Analysis of the CRMA-Protocol in High-Speed
Networks", Univ. of Wurzburg, Institute of
Computer Science Research Report Series, Re-
port No. 23, December 1990.
[ToB90] T.D. Todd, A.M. Bignell, \Performance Mod-
elling of SIGnet MAN Backbone", INFO-
COM'90, 192{199.
[Tur86] J.S. Turner, \New Directions in Communica-
tions", IEEE Communications Magazine, 24,
10 (October 1986), 8{15.
[Tur92] J.S. Turner, \Managing Bandwidth in ATM
Networks with Bursty Trac", IEEE Network
Magazine, 6, 5 (September 1992), 50{58.
[VWD91] R.J. Vetter, K.A. Williams, D.H.C. Du,
\Topological Design of Optically Switched
WDM Networks", IEEE 742, 114{127.
[Win87] P. Winter, \Steiner Problem in Networks: A
Survey", Networks, 17, 1987, 126{167.
[WoK90] J.S.K. Wong, Y. Kang, \Distributed and Fail{
Safe Routing Algorithms in Toroidal{Based
Metropolitan Area Networks", Computer Net-
works and ISDN Systems, 18, 1989/90, 379{
391.
[WoS89] L. Wong, M. Schwartz, \Flow Control in
Metropolitan Area Networks", INFOCOM'89,
826{833.
[YaA87] S. Yalamanchili, J.K. Aggarwal, \A Charac-
terization and Analysis of Parallel Processor
Interconnection Networks", IEEE Trans. on
Computers, 36, 6 (June 1987), 680{691.
[Zha91] L. Zhang, \The Virtual Clock: A New Trac
Control Algorithm for Packet Switching Net-
works", ACM Trans. on Computer Systems, 9,
2 (1991), 101{124.
[ZhA90] Z. Zhang, A.S. Acampora, \Analysis of Mul-
tihop Lightwave Networks", GLOBECOM'90,
1873{1879.

32