Escolar Documentos
Profissional Documentos
Cultura Documentos
Conventional Ethernet protocols struggle to meet the scalability and performance requirements of data centers. Viable replacements
have been proposed for data center ethernet (DCE): link-layer Multipathing (MP) is deployed to replace spanning tree protocol (STP)
and thus improve network throughput; end-to-end link-layer congestion control (CC) is proposed to better guarantee loss-free frame
delivery for Ethernet. However, little work has been done to incorporate MP and CC to offer a more comprehensive solution for DCE.
In this paper, we propose a two-tier solution by integrating our dynamic load balancing Multipath (DLBMP) scheme with CC. Instead
of using two separate parameters, i.e., path load and buffer level, to trigger MP and CC, our solution only needs to monitor path load
metric to manage MP and CC in an integrated way. Different from a single CC mechanism, which generates congestion notifications from
network core, our integrated CC can make use of link load information in access switches which directly inform sources to control their
traffic admission. To minimize overhead and accelerate update, SDN techniques are employed in our implementation, which decouples
routing intelligence from data transmission. Hence, data sources can react more rapidly to congestion and network can be guaranteed
with loss-free delivery. In addition, our MP scheme is further improved by introducing application-layer flow differentiation. With such
a fine flow differentiation (FFD) mechanism, traffic can be more evenly distributed along multipaths, resulting in better bandwidth
utilization. Simulation results show that our combined solution can further improve network throughput with FFD mechanism and
guarantee loss-free delivery with integrated CC.
Index TermsData center ethernet, dynamic multipath, load balancing, rate control.
TABLE I
routingTable OF
the most heavily loaded link along the path, which is proper
to represent the corresponding pathLoad (Step 3 in Fig. 4).
The pathLoad information for respective routes are recorded in
the routingTable of egress access switches. Table IV gives an
example routingTable with such information (Step 4 in Fig. 4).
Periodically, the central controller checks routingTable in
egress access switches about the pathLoad information of
their associated routes and updates ingress access switches
accordingly, details will be illustrated in the next section (Step
Fig. 4. Illustrations of path load updating. 5 in Fig. 4).
2) Downstream Path Load Updates Using Software-Defined
Networking: Access switches connect to a central controller
The purpose of maintaining such consistent routeID infor-
which enables pathLoad exchange among access switches
mation among all switches is that a switch can uniquely map
without network delay. Instead of running a sole distribution
a routeID to a route recorded in its routingTable. When a data
algorithm, which mixes data traffic and control information
frame enters the network, the access switch chooses one route to
with each other and results in performance degradation, or a
forward the frame and attaches the routeID of the chosen route
sole SDN protocol, which depends only on a central controller
in the frame header. As a non-access switch receives the frame,
that accesses to every switch and controls all the intelligence,
it simply forwards the frame to its next hop according to the
our design lets a data frame piggyback control information in
routeID by consulting its routingTable. When the corresponding
its frame header in the data path on the way to its egress access
access switch receives the frame, it checks its MAC association
switch, while ingress access switches obtain pathLoad updates
table and forwards the frame to the destined end station through
through the control path.
the correct Ethernet port.
Specifically, only access switches in our proposal run SDN
protocols, which allows the central controller to access rout-
B. Path Load Updating
ingTables of all the access switches. When SDN-enabled
In our design, path loads of routes are updated through two switches start, they open a secure channel to the central con-
phases. In the first phase, a data frame piggybacks the highest troller. The controller can query, insert and modify flow entries.
link load along intermediate switches on the route to its egress The switches maintain statistics in routingTables, such as
access switch. In the second phase, ingress access switches re- pathLoad and lastUpdate as illustrated in Table IV.
ceive path load updates through central controller which peri- The central controller performs as follows: it obtains network
odically exchanges control information among access switches. status from routingTables in all access switches periodically,
1) Upstream Path Load Updates Using Piggybacking: and updates ingress access switches accordingly when it detects
During operation, each switch continuously measures the pathLoad change on the associated routes (Step 5 and step 6 in
number of outgoing frames and the size of each frame comes Fig. 4).
from its Ethernet ports, so that it can determine the occupied The switches perform as follows: if an incoming frame
link capacity for all attached links. For a link , the ratio does not match any of the flow entries in the routingTable, the
between utilized link capacity and total link capacity is defined switch inserts a new flow entry with the appropriate output
as the link load for . Hence, for a path composing of port (based on Section III-C) which allows any subsequent
several links , the corresponding path load frames to be directly forwarded at line rate in hardware (Step
is defined as follows: 1 in Fig. 4). Once traffic volume on a route grows beyond the
specified threshold , the access switch may inform the source
(1) to adjust its data rate according to the mechanism discussed
in Section III-D.
where represents the link load of To summarize, an illustration of updating process is presented
, respectively. in Fig. 4. The process follows seven steps, including Step 1,
When forwarding the frame for an end station, besides the ingress access switch frame forwarding; Step 2, source access
routeID of the chosen route, the access switch also piggybacks switch route selection; Step 3, intermediate switch path load pig-
the pathLoad information in the frame header. Here, pathLoad gybacking; Step 4, egress access switch path load updating; Step
is the transient link load of the corresponding outgoing link 5, central controller path load updating; Step 6, source access
(Step 2 in Fig. 4). switch path load updating; and optionally Step 7, CC feedback
As the frame flows through the path, each intermediate if path load exceeds preset threshold.
switch will check for the pathload information recorded on the Our design differentiates itself with a central algorithm in two
frames header. If recorded pathLoad is higher than the switch ways. First, only access switches that are SDN-enabled and the
outgoing links linkLoad, the frame remains intact. Otherwise, rest are still common switches, which permits easy implemen-
the switch replaces pathLoad field in the frame header with its tation and high scalability. Second, it also runs distributed algo-
outgoing links linkLoad, and forwards the frame to its next hop. rithm in case of controller failure, which updates ingress access
Upon the arrival of the frame at its destination, the pathload switches through data paths, this ensures that our system oper-
recorded in the frames header always shows the linkLoad of ates even when the controller can not function well.
FANG et al.: LOSS-FREE MULTIPATHING SOLUTION FOR DATA CENTER NETWORK 2727
TABLE IV
routingTable OF
TABLE V
FLOWINFOTABLE OF S5
C. Load Splitting and Balancing isting flow or not. If the frame from an existing flow, the access
When an access switch receives a frame belonged to a new switch first updates the lastUpdate field with current time of the
flow, it chooses among all the available routes with the proba- entry corresponding to the flow in the flowInfoTable, and then
bility that is reverse proportionally to their pathLoad to deliver uses the route referred by the routeID field to deliver the current
this new flow by checking the pathLoad information in its rout- data frame. In addition, it attaches the corresponding routeID
ingTable. This helps to achieve fairer and better utilization of in the frame header. Otherwise, it selects a route to forward the
network links and routes. frame and adds an entry for the new flow in its flowInfoTable
1) Per-Flow Forwarding: DLBMP provides dynamic load as the frame is likely the first frame for a new flow.
balancing by splitting traffic across multiple paths. However, if D. Congestion Control
paths have different delays, split traffic at frame granularity can
cause large number of frames arriving out-of-order. TCP con- By integrating the CC mechanism into DLBMP, we aim to
fuses this reordering as a sign of congestion, resulting in degraded handle network congestion better and prevent loss by control-
performance [23], [24]. Even some UDP-based applications are ling data rate from data sources.
sensitive to packet reordering. Most importantly, it is critical to 1) Congestion Detection and Notification: Instead of gen-
preserve the in-order delivery for FCoE SAN traffic as neither FC erating congestion notification from any switch with potential
nor Small Computer System Interface (SCSI) can handle packet congestion, access switches are equipped with intelligence for
reordering well. In DLBMP, all frames of a flow are forwarded detecting congestion on a route and notifying corresponding
on the same path to preserve in-order delivery and traffic splitting data sources with explicit messages.
for load balancing purpose is only occurred at flow granularity. Since each access switch records up-to-date pathLoad infor-
Besides using MAC address and EtherType in Ethernet frame mation for each route originated from itself, it can easily detect
header, application-layer related information is also considered a congested route when the pathLoad keeps increasing and ex-
in flow differentiation such that flows can be differentiated with ceeds a predefined threshold , and inform all the data sources
finer granularity. For example, IPv4 packets can be further dif- on that route according to its flowTable, requiring them to re-
ferentiated by its TCP or UDP port number, and FCoE flows can duce the data rate of their flows. Specifically, We implement
be further differentiated by their OXID along with other SCSI a sample unit at access switches for congestion detection. The
exchange parameters. As a result, application-layer flows can be sample unit controls sample interval by using a preset sample
finely differentiated in DLBMP. byte . In case of a potential congestion upon receiving an up-
In DLBMP, each access switch records all the flow that it date which pathLoad of any entry exceeds , the sample unit
currently handles in an information table named flowInfoTable. starts recording the number of bytes that has been sent through
Table V gives an example flowInfoTable. flowID field is utilized the switch. When the accumulated bytes reach bytes, flows
to uniquely identify an application-layer flow, and actually it on routes that exceeds stand a probability of reduction. By
represents the header portion of all frames, which belongs to this implementing this sample unit, the sample intervals is different.
flow. routeID field gives the route used for the corresponding That means, when traffic volume becomes higher, sampling be-
flow, and lastUpdate field records the time of the last frame comes more frequent, and vise versa.
received for the flow to help validate the freshness of a flow. Similar to TCP, which utilizes explicit congestion notification
Nonaccess layer switches do not maintain such flowInfoTable as (ECN) to indicate potential congestion, in our mechanism, we
they always forward a frame according to the routeID indicated also notifies the source through an explicit feedback.
in the header of the received frame. It is important to note that this congestion control mechanism
As a data frame enters the network through some access is able to differentiated over different priorities by selecting dif-
switch, the switch can tell whether the frame belongs to an ex- ferent parameters according to the guidelines provided in [22].
2728 IEEE TRANSACTIONS ON MAGNETICS, VOL. 49, NO. 6, JUNE 2013
However, the topic is not within the scope of this paper, which TABLE VI
we do not provide details and experiments due to space limit. BASIC PARAMETERS
2) Rate Limiter Control: At source side, we use rate limiters
to control traffic generating rate from application layer.
Initially, a rate limiter starts with a sending rate of and in-
creases exponentially by doubling the rate in each predefined
time interval called a slot. Upon receiving congestion notifica-
tions from access switches, rate limiters enter into adjustment
cycles. Similarly as additional increase and multiplicative de- TABLE VII
crease (AIMD) in TCP, we implement our rate control in such a PARAMETER SETTINGS FOR SINGLE FLOW TEST
way that once a source receives a congestion notification it will
reduce rate to a proportion of and then increase linearly by
in a slot as shown in (2).
Let be the sending rate flow transmits to the network
at time through a certain route, source will adjust its load at
the next slot by
TABLE IX
PARAMETER SETTINGS FOR COMPARISON TEST
TABLE VIII
PARAMETER SETTINGS FOR MULTIPLE FLOWS TEST
REFERENCES
[1] S. Gai, Data Center Networks and Fibre Channel over Ethernet
(FCoE) 2008.
[2] G. Silvano and D. Claudio, I/O Consolidation in the Data Center. A
Complete Guide to Data Center Ethernet & Fibre Channel Over Eth-
ernet 2009.
[3] J. B. Graham Smit, Converged Enhanced Ethernet-Good for iSCSI
SANs [Online]. Available: http://bladenetwork.net/userfiles/file/PDFs/
WP_NetApp_Enhanced_Ethernet.pdf 2008
[4] A. Benner, P. Pepeljugoski, and R. Recio, A roadmap to 100 G Eth-
ernet at the enterprise data center, IEEE Commun. Mag., vol. 45, no.
Fig. 9. Throughput at each outgoing port with regard to different algorithms.
11, pp. 1017, 2007.
(a) STP; (b) TRILL; (c) DLBMP; (d) DLBMP+CC.
[5] T11-FC-BB-5 Standard, 2010.
[6] 802.1D MAC Bridges [Online]. Available: http://www.ieee802.org/1/
anism can guarantee loss-free delivery by successfully control- pages/802.1D-2003.html
[7] 802.1wRapid Reconfiguration of Spanning Tree [Online]. Available:
ling data sending rate from sources in an overloaded network. http://www.ieee802.org/1/pages/802.1w.html
Hence the performance gaps become more obvious after flow [8] C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, D. Wischik, and M.
number increases to higher than 1900 flows. Handley, Improving datacenter performance and robustness with mul-
tipath TCP, ACM SIGCOMM Comput. Commun. Rev., vol. 41, no. 4,
Second, we show the comparison of throughput on each out- pp. 266277, 2011, ACM.
going port. Due to the space limitation, we just list the statistics [9] Y. Dong, D. Wang, N. Pissinou, and J. Wang, Multi-path load bal-
ancing in transport layer, in Proc. IEEE Next Generation Internet
collected from the simulation test with 2100 flows. However, Netw. 3rd EuroNGI Conf., 2007, pp. 135142.
they are equally comparable to those tests with other flow num- [10] J. Mudigonda, P. Yalagandula, M. Al-Fares, and J. C. Mogul, Spain:
bers. In Fig. 9, we list total million number of frames sent from COTS data-center ethernet for multipathing over arbitrary topologies,
in Proc. 7th USENIX Conf. Netw. Syst. Design Implement. USENIX
all the switch ports towards destinations, Pod3 and Pod4. These Assoc., 2010, pp. 1818.
ports include uplink ports of SW1, SW2, SW3, and SW4, as well [11] R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S.
as downlink ports of SW5, SW6, SW7, SW8, SW9, and SW10. Radhakrishnan, V. Subramanya, and A. Vahdat, Portland: A scalable
fault-tolerant layer 2 data center network fabric, ACM SIGCOMM
To separate the ports in the same switch, we name them as L Comput. Commun. Rev., vol. 39, no. 4, pp. 3950, 2009, ACM.
port, for those on the left, and R port, for those on the right. In [12] D. Bergamasco, Ethernet congestion manager (ECM) specification,
Fig. 9, we can see that traffic load is mostly imbalanced in STP Cisco Systems initial draft EDCS-574018, 2007.
[13] 802.1QauCongestion notification, [Online]. Available: http://
algorithm. TRILL improves significantly as compared to STP. www.ieee802.org/1/pages/802.1au.html IEEE Standard for Local
However, our DLBMP and DLBMP+CC schemes outperform and Metropolitan Area NetworksVirtual Bridged Local Area Net-
worksAmendment 2007
TRILL, which is mainly due to their periodic inspection on path [14] N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson,
load and dynamic traffic distribution. With a fine grained flow J. Rexford, S. Shenker, and J. Turner, Openflow: Enabling innovation
control, DLBMP+CC balances traffic most evenly among the in campus networks, ACM SIGCOMM Comput. Commun. Rev., vol.
38, no. 2, pp. 6974, 2008.
four schemes with throughput of all ports ranging between 30 [15] Y. Yu, K. Aung, E. Tong, and C. Foh, Dynamic load balancing multi-
to 35 million of frames due to the introduction of FFD. pathing for converged enhanced Ethernet, in Proc. 18th Annu. Meet.
IEEE Int. Symp. Modeling, Analysis and Simulation of Computer and
Telecommunication Systems (MASCOTS), Miami, FL, USA, 2010.
V. CONCLUSION [16] R. Perlman, Rbridges: Transparent routing, in Proc. INFOCOM
In this paper, we proposed an integrated Ethernet solu- 2004. 23rd Ann. Joint Conf. IEEE Comput. Commun. Soc., 2004, vol.
2, pp. 12111218, IEEE.
tion, DLBMP+CC, by combining MP with CC based on our [17] Transparent Interconnection of Lots of Links (trill) [Online]. Available:
previously proposed DLBMP. As compared with DLBMP, http://www.ietf.org/html.charters/trill-charter.html IETF WG
[18] C. Kim, M. Caesar, and J. Rexford, Floodless in seattle: A scalable
DLBMP+CC improves network throughput because its ap- Ethernet architecture for large enterprises, ACM SIGCOMM Comput.
plication-layer flow differentiation can make full utilization Commun. Rev., vol. 38, no. 4, pp. 314, 2008, ACM.
of network bandwidth and congestion control can prevent [19] M. Al-Fares, A. Loukissas, and A. Vahdat, A scalable, commodity
data center network architecture, ACM SIGCOMM Comput. Commun.
excessive traffic from entering network. Rev., vol. 38, no. 4, pp. 6374, 2008, ACM.
With the introduction of SDN, our dynamic algorithm can [20] A. Greenberg, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta, To-
update and react towards load imbalance and traffic congestion wards a next generation data center architecture: Scalability and com-
moditization, in Proc. ACM Workshop on Program. Routers for Ex-
promptly. In a heavily loaded network, DLBMP+CC can still tensible Services of Tomorrow, 2008, pp. 5762.
guarantee load balance and loss-free frame delivery with its fast [21] A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D.
reaction, as the central controller amass network status and ex- Maltz, P. Patel, and S. Sengupta, VL2: A scalable and flexible data
center network, ACM SIGCOMM Comput. Commun. Rev., vol. 39,
change information effectively among ingress and egress access no. 4, pp. 5162, 2009.
switches. Simulation results demonstrated the effectiveness and [22] S. Fang, C. Foh, and K. Aung, Differentiated congestion management
of data traffic for data center Ethernet, IEEE Trans. Netw. Service
efficiency of DLBMP+CC in different scenarios. Manag., no. 99, pp. 112.
We use a single measuring parameter, path load, in this paper [23] T. Chim, K. Yeung, and K. Lui, Traffic distribution over equal-cost-
to track network traffic load for integration of MP and CC. How- multi-paths, Comput. Netw., vol. 49, no. 4, pp. 465475, 2005.
[24] S. Kandula, D. Katabi, S. Sinha, and A. Berger, Dynamic load
ever, buffer utilization level is another important parameter that balancing without packet reordering, ACM SIGCOMM Comput.
can indicate network load and congestion. Hence, our possible Commun. Rev, vol. 37, no. 2, pp. 5162, 2007.