Escolar Documentos
Profissional Documentos
Cultura Documentos
1145/ 2 9 75 1 5 9
Abstract have been doubling every 12–15 months (Figure 1), even
We present our approach for overcoming the cost, operational faster than the wide area Internet. Essentially, we could
complexity, and limited scale endemic to datacenter net- not buy a network at any price that could meet our scale
works a decade ago. Three themes unify the five genera- and performance needs.
tions of datacenter networks detailed in this paper. First, More recently, Google has been expanding aggressively
multi-stage Clos topologies built from commodity switch in the public Cloud market, making our internal hard-
silicon can support cost-effective deployment of building- ware and software infrastructure available to external
scale networks. Second, much of the general, but complex, customers. This shift has further validated our approach
decentralized network routing and management protocols to building scalable, efficient, and highly available data
supporting arbitrary deployment scenarios were overkill for center networks. With our internal cloud deployments,
single-operator, pre-planned datacenter networks. We built we could vertically integrate and centrally plan to co-
a centralized control mechanism based on a global configura- optimize application and network behavior. A bottleneck
tion pushed to all datacenter switches. Third, modular hard- in network performance could be alleviated by redesign-
ware design coupled with simple, robust software allowed ing Google application behavior. If a single site did not
our design to also support inter-cluster and wide-area net- provide requisite levels of availability, the application
works. Our datacenter networks run at dozens of sites across could be replicated to multiple sites. Finally, bandwidth
the planet, scaling in capacity by 100x over 10 years to more requirements could be precisely measured and projected
than 1 Pbps of bisection bandwidth. A more detailed version to inform future capacity requirements. With the exter-
of this paper is available at Ref.21 nal cloud, application bandwidth demands are highly
bursty and much more difficult to adjust. For example,
an external customer may in turn be running third party
1. INTRODUCTION software that is difficult to modify. We have found that
From the beginning, Google built and ran an internal cloud our data center network architecture substantially eases
for all of our internal infrastructure services and external-
facing applications. This shared infrastructure allowed us to Figure 1. Aggregate server traffic in our datacenter fleet.
run services more efficiently through statistical multiplexing
50x Traffic generated by servers in out datacenters
of the underlying hardware, reduced operational overhead
by allowing best practices and management automation to
apply uniformly, and increased overall velocity in delivering
Aggregate traffic
SE PT E MB E R 2 0 1 6 | VO L. 59 | N O. 9 | C OM M U N IC AT ION S OF T HE ACM 89
research highlights
The following sections explain these challenges and the ratio- that scaled to 10K machines at 1G average bandwidth to any
nale for our approach in detail. machine in the fabric.
Our topological approach, reliance on merchant sili- Since we did not have any experience building switches
con, and load balancing across multipath are substantially but we did have experience building servers, we attempted
similar to contemporaneous research.1, 15 Our centralized to integrate the switching fabric into the servers via a PCI
control protocols running on switch embedded processors board. See top right inset in Figure 3. However, the uptime
are also related to subsequent substantial efforts in SDN.12 of servers was less than ideal. Servers crashed and were
Based on our experience in the datacenter, we later applied upgraded more frequently than desired with long reboot
SDN to our Wide Area Network.17 For the WAN, more CPU times. Network disruptions from server failure were espe-
intensive traffic engineering and BGP routing protocols led cially problematic for servers housing ToRs connecting mul-
us to move control protocols onto external servers with more tiple other servers to the first stage of the topology.
powerful CPUs. The resulting wiring complexity for server to server con-
nectivity, electrical reliability issues, availability and general
3. NETWORK EVOLUTION issues associated with our first foray into switching doomed
3.1. Firehose the effort to never seeing production traffic. At the same
Table 2 summarizes the multiple generations of our cluster time, we consider FH1.0 to be a landmark effort internally.
network. With our initial approach, Firehose 1.0 (or FH1.0), our Without it and the associated learning, the efforts that fol-
nominal goal was to deliver 1Gbps of nonblocking bisection lowed would not have been possible.
bandwidth to each of 10K servers. Figure 3 details the FH1.0 Our first production deployment of a custom datacenter
topology using 8x10G switches in both the aggregation blocks cluster fabric was Firehose 1.1 (FH1.1). We had learned from
as well as the spine blocks. The ToR switch delivered 2x10GE FH1.0 not to use servers to house switch chips. Thus, we
ports to the fabric and 24x1GE server ports. built custom enclosures that standardized around the com-
Each aggregation block hosted 16 ToRs and exposed pact PCI chassis each with six independent linecards and a
32x10G ports towards 32 spine blocks. Each spine block had dedicated Single-Board Computer (SBC) to control the line-
32x10G towards 32 aggregation blocks resulting in a fabric cards using PCI. See insets in Figure 4. The fabric chassis did
Table 1. High-level summary of challenges we faced and our approach to address them.
Figure 3. Firehose 1.0 topology. Top right shows a sample 8x10G port Figure 5. Firehose 1.1 deployed as a bag-on-the-side Clos fabric.
fabric board in Firehose 1.0, which formed Stages 2, 3 or 4 of the
Legacy network routers Firehose 1.1 fabric
topology.
Spine Block
Bag-on-the-side clos
4×1G 4×10G
32x10G to 32 aggregation blocks
Stages 2, 3 or 4 board
Aggregation Block (32×10G to 32 spine blocks)
ToR ToR ToR ToR
Stage 5 board: Server Server Server Server
8×10G down
Rack Rack Rack Rack
Stages 2, 3 or 4 board: 1 2 3 512
4×10G up, 4×10G down
ToR (Stage 1) board:
2×10G up, 24×1G down
Stages 2, 3 or 4 linecard:
4×10G up, 4×10G down
SE PT E MB E R 2 0 1 6 | VO L. 59 | N O. 9 | C OM M U N IC AT ION S OF T HE ACM 91
research highlights
merchant silicon building blocks. A Saturn chassis supports 4x10G or 40G mode. There were no backplane data connec-
12-linecards to provide a 288 port non-blocking switch. tions between these chips; all ports were accessible on the
These chassis are coupled with new Pluto single-chip ToR front panel of the chassis.
switches; see Figure 7. In the default configuration, Pluto We employed the Centauri switch as a ToR switch with
supports 20 servers with 4x10G provisioned to the cluster each of the four chips serving a subnet of machines. In one
fabric for an average bandwidth of 2 Gbps for each server. ToR configuration, we configured each chip with 48x10G to
For more bandwidth-hungry servers, we could configure the servers and 16x10G to the fabric. Servers could be config-
Pluto ToR with 8x10G uplinks and 16x10G to servers provid- ured with 40G burst bandwidth for the first time in produc-
ing 5 Gbps to each server. Importantly, servers could burst tion (see Table 2). Four Centauris made up a Middle Block
at 10 Gbps across the fabric for the first time. (MB) for use in the aggregation block. The logical topology
of an MB was a 2-stage blocking network, with 256x10G
3.3. Jupiter: A 40G datacenter-scale fabric links available for ToR connectivity and 64x40G available
As bandwidth requirements per server continued to grow, so for connectivity to the rest of the fabric through the spine
did the need for uniform bandwidth across all clusters in the (Figure 9).
datacenter. With the advent of dense 40G capable merchant Each ToR chip connects to eight such MBs with dual
silicon, we could consider expanding our Clos fabric across redundant 10G links. The dual redundancy aids fast
the entire datacenter subsuming the inter-cluster networking
layer. This would potentially enable an unprecedented pool Figure 8. Building blocks used in the Jupiter topology.
of compute and storage for application scheduling. Critically,
the unit of maintenance could be kept small enough relative Spine Block
to the size of the fabric that most applications could now be Centauri
Merchant
agnostic to network maintenance windows unlike previous Silicon
32×40G up
1×40G
generations of the network. 16×40G
Jupiter, our next generation datacenter fabric, needed to 32×40G down
scale more than 6x the size of our largest existing fabric. 128×40G down to 64 aggregation blocks
64×40G up
Unlike previous iterations, we set a requirement for incre- Aggregation Block (512×40G to 256 spine blocks)
24×10G
port chip
Pluto ToR
288 port
Saturn
Chassis
SE PT E MB E R 2 0 1 6 | VO L. 59 | N O. 9 | C OM M U N IC AT ION S OF T HE ACM 93
research highlights
out-of-band Control Plane Network (CPN) appeared to be Figure 11. Firepath component interactions. (A) Non-CBR fabric
substantially simpler and more efficient from a computation switch and (B) CBR switch.
and communication perspective. The switches could then
Firepath Master Firepath Master
calculate forwarding tables based on current link state as del-
tas relative to the underlying, known static topology that was Firepath protocol Firepath protocol
pushed to all switches.
Firepath Firepath
Overall, we treated the datacenter network as a single RIB BGP
Client Client
fabric with tens of thousands of ports rather than a col-
updates
Intra, inter
status
route
Config Config
packets
port
updates
status
cluster route
eBGP
route
lection of hundreds of autonomous switches that had to
port
redistribute
dynamically discover information about the fabric. We Embedded Stack
were, at this time, inspired by the success of large-scale dis- Embedded Stack Kernel and Device drivers
tributed storage systems with a centralized manager.13 Our
(A) (B)
design informed the control architecture for both Jupiter
datacenter networks and Google’s B4 WAN,17 both of which
are based on OpenFlow18 and custom SDN control stacks.
Figure 12. Protocol messages between Firepath client and Firepath
5.2. Routing master, between Firepath masters and between CBR and external
BGP speakers.
We now present the key components of Firepath, our routing
architecture for Firehose, Watchtower, and Saturn fabrics. Interface state update
RP
FM
A number of these components anticipate some of the prin- Link State database
Firepath Master Keepalive req/rsp
ciples of modern SDN, especially in using logically central- FMRP protocol
ized state and control. First, all switches are configured with CPN eBGP protocol (inband)
SE PT E MB E R 2 0 1 6 | VO L. 59 | N O. 9 | C OM M U N IC AT ION S OF T HE ACM 95
research highlights
ToRs to keep servers from over-running oversubscribed the proper monitoring and alerting been in place for fabric
uplinks. Fourth, we enabled Explicit Congestion Notification backplane and CPN links.
(ECN) on our switches and optimized the host stack response Component misconfiguration. A prominent miscon-
to ECN signals.2 Fifth, we monitored application bandwidth figuration outage was on a Freedome fabric. Recall that a
requirements in the face of oversubscription ratios and could Freedome chassis runs the same codebase as the CBR with
provision bandwidth by deploying Pluto ToRs with four or its integrated BGP stack. A CLI interface to the CBR BGP
eight uplinks as required. Sixth, the merchant silicon had stack supported configuration. We did not implement lock-
shared memory buffers used by all ports, and we tuned the ing to prevent simultaneous read/write access to the BGP
buffer sharing scheme on these chips so as to dynamically configuration. During a planned BGP reconfiguration of a
allocate a disproportionate fraction of total chip buffer space Freedome block, a separate monitoring system coinciden-
to absorb temporary traffic bursts. Finally, we carefully config- tally used the same interface to read the running config
ured switch hashing functionality to support good ECMP load while a change was underway. Unfortunately, the resulting
balancing across multiple fabric paths. partial configuration led to undesirable behavior between
Our congestion mitigation techniques delivered sub- Freedome and its BGP peers.
stantial improvements. We reduced the packet discard rate We mitigated this error by quickly reverting to the previ-
in a typical Clos fabric at 25% average utilization from 1% ous configuration. However, it taught us to harden our oper-
to < 0.01%. Further improving fabric congestion response ational tools further. It was not enough for tools to configure
remains an ongoing effort. the fabric as a whole; they needed to do so in a safe, secure
and consistent way.
6.2. Outages
While the overall availability of our datacenter fabrics has 7. CONCLUSION
been satisfactory, our outages fall into three categories repre- This paper presents a retrospective on ten years and
senting the most common failures in production: (i) control five generations of production datacenter networks. We
software problems at scale; (ii) aging hardware exposing pre- employed complementary techniques to deliver more
viously unhandled failure modes; and (iii) misconfigurations bandwidth to larger clusters than would otherwise be
of certain components. possible at any cost. We built multi-stage Clos topolo-
Control software problems at large scale. A datacenter gies from bandwidth-dense but feature-limited merchant
power event once caused the entire fabric to restart simul- switch silicon. Existing routing protocols were not easily
taneously. However, the control software did not converge adapted to Clos topologies. We departed from conven-
without manual intervention. The instability took place tional wisdom to build a centralized route controller that
because our liveness protocol (ND) and route computation leveraged global configuration of a predefined cluster plan
contended for limited CPU resources on embedded switch pushed to every datacenter switch. This centralized con-
CPUs. On entire fabric reboot, routing experienced huge trol extended to our management infrastructure, enabling
churn, which, in turn, led ND not to respond to heartbeat us to eschew complex protocols in favor of best practices
messages quickly enough. This in turn led to a snowball from managing the server fleet. Our approach has enabled
effect for routing where link state would spuriously go from us to deliver substantial bisection bandwidth for building-
up to down and back to up again. We stabilized the network scale fabrics, all with significant application benefit.
by manually bringing up a few blocks at a time.
Going forward, we included the worst case fabric reboot Acknowledgments
in our test plans. Since the largest scale datacenter could Many teams contributed to the success of the datacenter
never be built in a hardware test lab, we launched efforts to network within Google. In particular, we would like to
stress test our control software at scale in virtualized envi- acknowledge the Platforms Networking (PlaNet) Hardware
ronments. We also heavily scrutinized any timer values in and Software Development, Platforms Software Quality
liveness protocols, tuning them for the worst case while bal- Assurance (SQA), Mechanical Engineering, Cluster Engineering
ancing slower reaction time in the common case. Finally, (CE), Network Architecture and Operations (NetOps), Global
we reduced the priority of non-critical processes that shared Infrastructure Group (GIG), and Site Reliability Engineering
the same CPU. (SRE) teams, to name a few.
Aging hardware exposes unhandled failure modes. Over
years of deployment, our inbuilt fabric redundancy degrad- References 4. Barroso, L.A., Hölzle, U. The datacenter
ed as a result of aging hardware. For example, our software 1. Al-Fares, M., Loukissas, A., Vahdat, A. as a computer: An introduction to the
was vulnerable to internal/backplane link failures, lead- A scalable, commodity data center design of warehouse-scale machines.
network architecture. In ACM Syn. Lect. Comput. Architect. 4, 1
ing to rare traffic blackholing. Another example centered SIGCOMM Computer Communication (2009), 1–108.
around failures of the CPN. Each fabric chassis had dual Review. Volume 38 (2008), ACM, 63–74. 5. Calder, B., Wang, J., Ogus, A.,
2. Alizadeh, M., Greenberg, A., Maltz, D.A., Nilakantan, N., Skjolsvold, A.,
redundant links to the CPN in active-standby mode. We Padhye, J., Patel, P., Prabhakar, B., McKelvie, S., Xu, Y., Srivastav, S.,
initially did not actively monitor the health of both the ac- Sengupta, S., Sridharan, M. Data center Wu, J., Simitci, H., et al. Windows
TCP (DCTCP). ACM SIGCOMM Comput. Azure storage: A highly available
tive and standby links. With age, the vendor gear suffered Commun. Rev. 41, 4 (2011), 63–74. cloud storage service with strong
from unidirectional failures of some CPN links exposing 3. Barroso, L.A., Dean, J., Holzle, U. consistency. In Proceedings of the
Web search for a planet: The Google Twenty-Third ACM Symposium on
unhandled corner cases in our routing protocols. Both cluster architecture. Micro. IEEE 23, Operating Systems Principles (2011),
these problems would have been easier to mitigate had 2 (2003), 22–28. ACM, 143–157.
Subject Areas
SE PT E MB E R 2 0 1 6 | VO L. 59 | N O. 9 | C OM M U N IC AT ION S OF T HE ACM 97