Escolar Documentos
Profissional Documentos
Cultura Documentos
Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong∗,
Luiz Barroso, Carrie Grimes, and Sean Quinlan
{ford,flab,florentina,mstokely}@google.com, vatruong@ieor.columbia.edu
{luiz,cgrimes,sean}@google.com
Google, Inc.
1
impact under a variety of replication schemes and 100
placement policies. (Sections 5 and 6)
80
Events (%)
• Formulate a Markov model for data availability, that 60
can scale to arbitrary cell sizes, and captures the in-
teraction of failures with replication policies and re- 40
2
We primarily use two metrics throughout this paper. component failures, software bugs, crashes, system re-
The average availability of all N nodes in a cell is defined boots, power loss events, and loss of network connec-
as: tivity. We include in our analysis the impact of software
upgrades, reconfiguration, and other maintenance. These
P planned outages are necessary in a fast evolving datacen-
uptime(Ni )
Ni ∈N
AN = P (1) ter environment, but have often been overlooked in other
Ni ∈N (uptime(Ni ) + downtime(Ni )) availability studies. In this section we present data for
We use uptime(Ni ) and downtime(Ni ) to refer to the storage node unavailability and provide some insight into
lengths of time a node Ni is available or unavailable, re- the main causes for unavailability.
spectively. The sum of availability periods over all nodes
is called node uptime. We define uptime similarly for 3.1 Numbers from the fleet
other component types. We define unavailability as the
complement of availability. Failure patterns vary dramatically across different hard-
Mean time to failure, or MTTF, is commonly quoted ware platforms, datacenter operating environments, and
in the literature related to the measurements of availabil- workloads. We start by presenting numbers for disks.
ity. We use MTTF for components that suffer transient Disks have been the focus of several other studies,
or permanent failures, to avoid frequent switches in ter- since they are the system component that permanently
minology. stores the data, and thus a disk failure potentially results
in permanent data loss. The numbers we observe for disk
uptime and storage subsystem failures, presented in Table 2, are
MTTF = (2)
number failures comparable with what other researchers have measured.
Availability measurements for nodes and individual One study [29] reports ARR (annual replacement rate)
components in our system are presented in Section 3. for disks between 2% and 4%. Another study [19] fo-
cused on storage subsystems, thus including errors from
2.2 Data replication shelves, enclosures, physical interconnects, protocol fail-
ures, and performance failures. They found AFR (annual
Distributed storage systems increase resilience to fail- failure rate) generally between 2% and 4%, but for some
ures by using replication [2] or erasure encoding across storage systems values ranging between 3.9% and 8.3%.
nodes [28]. In both cases, data is divided into a set of For the purposes of this paper, we are interested in
stripes, each of which comprises a set of fixed size data disk errors as perceived by the application layer. This
and code blocks called chunks. Data in a stripe can be re- includes latent sector errors and corrupt sectors on disks,
constructed from some subsets of the chunks. For repli- as well as errors caused by firmware, device drivers, con-
cation, R = n refers to n identical chunks in a stripe, trollers, cables, enclosures, silent network and memory
so the data may be recovered from any one chunk. For corruption, and software bugs. We deal with these er-
Reed-Solomon erasure encoding, RS(n, m) denotes n rors with background scrubbing processes on each node,
distinct data blocks and m error correcting blocks in each as in [5, 31], and by verifying data integrity during client
stripe. In this case a stripe may be reconstructed from any reads [4]. Background scrubbing in GFS finds between
n chunks. 1 in 106 to 107 of older data blocks do not match the
We call a chunk available if the node it is stored on checksums recorded when the data was originally writ-
is available. We call a stripe available if enough of its ten. However, these cell-wide rates are typically concen-
chunks are available to reconstruct the missing chunks, trated on a small number of disks.
if any. We are also concerned with node failures in addition
Data availability is a complex function of the individ- to individual disk failures. Figure 2 shows the distribu-
ual node availability, the encoding scheme used, the dis- tion of three mutually exclusive causes of node unavail-
tribution of correlated node failures, chunk placement, ability in one of our storage cells. We focus on node
and recovery times that we will explore in the second part restarts (software restarts of the storage program running
of this paper. We do not explore related mechanisms for on each machine), planned machine reboots (e.g. ker-
dealing with failures, such as additional application level nel version upgrades), and unplanned machine reboots
redundancy and recovery, and manual component repair. (e.g. kernel crashes). For the purposes of this figure we
do not exclude events that last less than 15 minutes, but
3 Characterizing Node Availability we still end the unavailability period when the system
reconstructs all the data previously stored on that node.
Anything that renders a storage node unresponsive is Node restart events exhibit the greatest variability in du-
a potential cause of unavailability, including hardware ration, ranging from less than one minute to well over an
3
100
Node restarts 0.035
Planned reboots
80 Unplanned reboots
Events (%)
0.030
60
40 0.025
Unavailability (%)
20 Unknown
0.020 Node restarts
0 Planned reboots
1s 10s 1min 15min 1h 6h 1d 7d 1mon Unplanned reboots
Unavailability event duration 0.015
40
0.000
Events per 1000 nodes per day
0 1 2 3
Time (months)
30
Unknown
Figure 4: Storage node unavailability computed with a one
Node restarts
Planned reboots week rolling window, for one example cell.
20 Unplanned reboots
Cause Unavailability (%)
average / min / max
Node restarts 0.0139 / 0.0004 / 0.1295
10
Planned machine reboots 0.0154 / 0.0050 / 0.0563
Unplanned machine reboots 0.0025 / 0.0000 / 0.0122
Unknown 0.0142 / 0.0013 / 0.0454
0
0 1 2 3 Table 1: Unavailability attributed to different failure causes,
Time (months) over the full set of cells.
Figure 3: Rate of events per 1000 nodes per day, for one exam-
ple cell. Table 1 shows the unavailability from node restarts,
planned and unplanned machine reboots, each of which
is a significant cause. The numbers are exclusive, thus
hour, though they usually have the shortest duration. Un- the planned machine reboots do not include node restarts.
planned reboots have the longest average duration since Table 2 shows the MTTF for a series of important
extra checks or corrective action is often required to re- components: disk, nodes, and racks of nodes. The num-
store machines to a safe state. bers we report for component failures are inclusive of
Figure 3 plots the unavailability events per 1000 nodes software errors and hardware failures. Though disks fail-
per day for one example cell, over a period of three ures are permanent and most node failures are transitory,
months. The number of events per day, as well as the the significantly greater frequency of node failures makes
number of events that can be attributed to a given cause them a much more important factor for system availabil-
vary significantly over time as operational processes, ity (Section 8.4).
tools, and workloads evolve. Events we cannot classify
accurately are labeled unknown. 4 Correlated Failures
The effect of machine failures on availability is de-
pendent on the rate of failures, as well as on how long The co-occurring failure of a large number of nodes
the machines stay unavailable. Figure 4 shows the node can reduce the effectiveness of replication and encoding
unavailability, along with the causes that generated the schemes. Therefore it is critical to take into account the
unavailability, for the same cell used in Figure 3. The statistical behavior of correlated failures to understand
availability is computed with a one week rolling window, data availability. In this section we are more concerned
using definition (1). We observe that the majority of un- with measuring the frequency and severity of such fail-
availability is generated by planned reboots. ures rather than root causes.
4
Component Disk Node Rack 14
MTTF 10-50 years 4.3 months 10.2 years 12
0
0 100 200 300 400 500 600
} burst Window size (s)
Time (min) Figure 6: Effect of the window size on the fraction of individual
0 5 10 15 20 25 30 failures that get clustered into bursts of at least 10 nodes.
5
50 500
●
● ●
●● ●
200 ●
●
40 ● ●
●
● ●
●
100
●
●
●
● ●
●
● ● ●
50 ●
●
●
●
●
●
●
●
● ●
30 ●
●
●
● ●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
● ●
●
●
● ●●
●
●
● ●
●
●
●
● ●
● ● ●
●
●
●
●
●
● ●
●
●
●●●●
● ●
●
●
●
● ● ● ●
20 ● ●
●
●
● ● ●
●
●
● ●●●●
●
●
●
● ●
●
●
●
●
● ●
●●
●
● ● ● ● ●
● ●
● ●
●
●
●
●
● ●● ● ● 1 occurrence
●
●
● ●●●
● ●
20 ●
● ● ●●●● ●●
● ●
●
● ● ●
10
● ●●
● ● ●
●● ●●●
● ●
● ● ●●
● ●● ● 10 occurrences
5
10
● 100 occurrences
●
● ●
● ● 2
● ●
● ●
● ● 1000 occurrences
● ●
0 ● 1
Figure 7: Development of failure bursts in one example cell. Figure 8: Frequency of failure bursts sorted by racks and nodes
affected.
6
proximate the metric using simulation of random bursts. 1e+20
RS(20,10)
We choose to compute the metric exactly using dynamic 1e+18 RS(9,4)
programming because the extra precision it provides al- RS(5,3)
1e+16 R=4
lows us to distinguish metric values very close to 1. R=3
5.1 Data replication and recovery 5.2 Chunk placement and stripe unavailability
Replication or erasure encoding schemes provide re- To mitigate the effect of large failure bursts in a single
silience to individual node failures. When a node fail- failure domain we consider known failure domains when
ure causes the unavailability of a chunk within a stripe, placing chunks within a stripe on storage nodes. For ex-
we initiate a recovery operation for that chunk from the ample, racks constitute a significant failure domain to
other available chunks remaining in the stripe. avoid. A rack-aware policy is one that ensures that no
Distributed filesystems will necessarily employ two chunks in a stripe are placed on nodes in the same
queues for recovery operations following node failure. rack.
These queues prioritize reconstruction of3 Unavailable stripesChunks which
in Stripe Given a failure burst, we can compute the expected
200000
individual disks, nodes, and racks. Furthermore, there chunks are affected in a stripe of size n, which is es-
is an explicit design tradeoff in the use of bandwidth
0
0 2500 5500 8500 12000 16000 20000 sential to the Markov model of Section 7. Assuming that
for recovery operations
Seconds fromversus serving
Server/Disk Failure to Chunkclient read/write
Recovery Initiation stripes are uniformly distributed across nodes of the cell,
requests. this probability is a ratio where the numerator is the num-
3 Unavailable Chunks in Stripe
ber of ways to place a stripe of size n in the cell such
Chunk Recoveries
This limit is particularly apparent during correlated burst size. For each class of bursts we calculate the av-
failures when a large number of chunks go missing at the erage fraction of stripes affected per burst and the rate
100000
same time. Figure 9 shows the recovery delay after a fail- of bursts, to get the combined MTTF due to that class.
0
7
despite the fact that they are much rarer. We also see 0.001
Measured
that for small and medium bursts sizes, and large encod- 0.0001 Predicted
1e-09
1e-10
6 Cell Simulation
1e-11
0:00 6:00 12:00 18:00 24:00
This section introduces a trace-based simulation method Time of day
for calculating availability in a cell. The method replays
Figure 11: Unavailability prediction over time for a particular
observed or synthetic sequences of node failures and cal- cell for a day with large failure bursts.
culates the resulting impact on stripe availability. It of-
fers detailed view of availability in short time frames.
For each node, the recorded events of interest are 7 Markov Model of Stripe Availability
down, up and recovery complete events. When all nodes
are up, they are each assumed to be responsible for an In this section, we formulate a Markov model of data
equal number of chunks. When a node goes down it availability. The model captures the interaction of dif-
is still responsible for the same number of chunks until ferent failure types and production parameters with more
15 minutes later when the chunk recovery process starts. flexibility than is possible with the trace-based simula-
For simplicity and conservativeness, we assume that all tion described in the previous section. Although the
these chunks remain unavailable until the recovery com- model makes assumptions beyond those in the trace-
plete event. A more accurate model could model recov- based simulation method, it has certain advantages. First,
ery too, such as by reducing the number of unavailable it allows us to model and understand the impact of
chunks linearly until the recovery complete event, or by changes in hardware and software on end-user data avail-
explicitly modelling recovery queues. ability. There are typically too many permutations of sys-
tem changes and encodings to test each in a live cell. The
We are interested in the expected number of stripes
Markov model allows us to reason directly about the con-
that are unavailable for at least 15 minutes, as a function
tribution to data availability of each level of the storage
of time. Instead of simulating a large number of stripes,
stack and several system parameters, so that we can eval-
it is more efficient to simulate all possible stripes, and use
uate tradeoffs. Second, the systems we study may have
combinatorial calculations to obtain the expected number
unavailability rates that are so low they are difficult to
of unavailable stripes given a set of down nodes, as was
measure directly. The Markov model handles rare events
done in Section 5.2.
and arbitrarily low stripe unavailability rates efficiently.
As a validation, we can run the simulation using the The model focuses on the availability of a representa-
stripe encodings that were in use at the time to see if the tive stripe. Let s be the total number of chunks in the
predicted number of unavailable stripes matches the ac- stripe, and r be the minimum number of chunks needed
tual number of unavailable stripes as measured by our to recover that stripe. As described in Section 2.2, r = 1
storage system. Figure 11 shows the result of such a for replicated data and r = n for RS(n, m) encoded
simulation. The prediction is a linear combination of the data. The state of a stripe is represented by the number of
predictions for individual encodings present, in this case available chunks. Thus, the states are s, s−1, . . . , r, r−1
mostly RS(5, 3) and R = 3. with the state r − 1 representing all of the unavailable
Analysis of hypothetical scenarios may also be made states where the stripe has less than the required r chunks
with the cell simulator, such as the effect of encoding available. Figure 12 shows a Markov chain correspond-
choice and of chunk recovery rate. Although we may ing to an R = 2 stripe.
not change the frequency and severity of bursts in an ob- The Markov chain transitions are specified by the rates
served sequence, bootstrap methods [13] may be used at which a stripe moves from one state to another, due to
to generate synthetic failure traces with different burst chunk failures and recoveries. Chunk failures reduce the
characteristics. This is useful for exploring sensitivity to number of available chunks, and several chunks may fail
these events and the impact of improvements in datacen- ‘simultaneously’ in a failure burst event. Balancing this,
ter reliability. recoveries increase the number of available chunks if any
8
Chunk recovery Stripe unavailable
may be adjusted here to suit the system parameters under
exploration.
2 1 0
Averaging these probabilities over all failures events
gives the probability, pi,j , that a random failure event will
affect i−j out of i available chunks in a stripe. This gives
Chunk failure
a rate of transition from state i to state j < i, of λi,j =
Pr−1
Figure 12: The Markov chain for a stripe encoded using R = 2. λpi,j for s ≥ i > j ≥ r and λi,r−1 = λ j=0 pi,j
for the rate of reaching the unavailable state. Note that
transitions from a state to itself are ignored.
are unavailable. For chunk recoveries, we assume a fixed rate of ρ for
A key assumption of the Markov model is that events recovering a single chunk, i.e. moving from a state i to
occur independently and with constant rates over time. i + 1, where r ≤ i < s. In particular, this means we as-
This independence assumption, although strong, is not sume that the recovery rate does not depend on the total
the same as the assumption that individual chunks fail number of unavailable chunks in the cell. This is justi-
independently of each other. Rather, it implies that fail- fied by setting ρ to a lower bound for the rate of recovery,
ure events are independent of each other, but each event based on observed recovery rates across our storage cells
may involve multiple chunks. This allows a richer and or proposed system performance parameters. While par-
more flexible view of the system. It also implies that re- allel recovery of multiple chunks from a stripe is possi-
covery rates for a stripe depend only on its own current ble, ρi,i+1 = (s − i)ρ, we model serial recovery to gain
state. more conservative estimates of stripe availability.
In practice, failure events are not always independent. As with [12], the distributed systems we study use pri-
Most notably, it has been pointed out in [29] that the time oritized recovery for stripes with more than one chunk
between disk failures is not exponentially distributed and unavailable. Our Markov model allows state-dependent
exhibits autocorrelation and long-range dependence. The recovery that captures this prioritization, but for ease of
Weibull distribution provides a much better fit for disk exposition we do not use this added degree of freedom.
MTTF. Finally, transition rates between pairs of states not
However, the exponential distribution is a reason- mentioned are zero.
able approximation for the following reasons. First, the With the Markov chain thus completely specified,
Weibull distribution is a generalization of the exponen- computing the MTTF of a stripe, as the mean time to
tial distribution that allows the rate parameter to increase reach the ‘unavailable state’ r − 1 starting from state s,
over time to reflect the aging of disks. In a large pop- follows by standard methods [27].
ulation of disks, the mixture of disks of different ages
tends to be stable, and so the average failure rate in a
7.2 Extension to multi-cell replication
cell tends to be constant. When the failure rate is stable,
the Weibull distribution provides the same quality of fit The models introduced so far can be extended to compute
as the exponential. Second, disk failures make up only the availability of multi-cell replication schemes. An ex-
a small subset of failures that we examined, and model ample of such a scheme is R = 3 × 2, where six replicas
results indicate that overall availability is not particularly of the data are distributed as R = 3 replication in each of
sensitive to them. Finally, other authors ([24]) have con- two linked cells. If data becomes unavailable at one cell
cluded that correlation and non-homogeneity of the re- then it is automatically recovered from another linked
covery rate and the mean time to a failure event have cell. These cells may be placed in separate datacenters,
a much smaller impact on system-wide availability than even on separate continents. Reed-Solomon codes may
the size of the event. also be used, giving schemes such as RS(6, 3) × 3 for
three cells each with a RS(6, 3) encoding of the data.
7.1 Construction of the Markov chain We do not consider here the case when individual chunks
may be combined from multiple cells to recover data, or
We compute the transition rate due to failures using ob- other more complicated multi-cell encodings.
served failure events. Let λ denote the rate of failure We compute the availability of stripes that span cells
events affecting chunks, including node and disk failures. by building on the Markov model just presented. Intu-
For any observed failure event we compute the probabil- itively, we treat each cell as a ‘chunk’ in the multi-cell
ity that it affects k chunks out of the i available chunks in ‘stripe’, and compute its availability using the Markov
a stripe. As in Section 6, for failure bursts this computa- model. We assume that failures at different data centers
tion takes into account the stripe placement strategy. The are independent, that is, that they lack a single point of
rate and severity of bursts, node, disk, and other failures failure such as a shared power plant or network link. Ad-
9
ditionally, when computing the cell availability, we ac- accuracy, since it still gives us a powerful enough tool to
count for any cell-level or datacenter-level failures that make decisions, as can be seen in the following sections.
would affect availability. Second, the model can distinguish between failure
We build the corresponding transition matrix that bursts that span racks, and thus pose a threat to availabil-
models the resulting multi-cell availability as follows. ity, and those that do not. If one rack goes down, then
We start from the transition matrices Mi for each cell, without other events in the cell, the availability of stripes
as explained in the previous section. We then build the with R=3 replication will not be affected, since the stor-
transition matrix for
N the combined scheme as the tensor age system ensures that chunks in each stripe are placed
product of these, i Mi , plus terms for whole cell fail- on different racks. For one example cell, we noticed tens
ures, and for cross-cell recoveries if the data becomes of medium sized failure bursts that affected one or two
unavailable in some cells but is still available in at least racks. We expected the availability of the cell to stay
one cell. However, it is a fair approximation to simply high, and indeed we measured MTTF = 29.52E+8 days.
treat each cell as a highly-reliable chunk in a multi-cell The model predicted 5.77E+8 days. Again, the relative
stripe, as described above. error is significant, but for our purposes the model pro-
Besides symmetrical cases, such as R = 3 × 2 repli- vides sufficiently accurate predictions.
cation, we can also model inhomogeneous replication Validating the model for all possible replication and
schemes, such as one cell with R = 3 and one with Reed-Solomon encodings is infeasible, since our produc-
R = 2. The state space of the Markov model is the tion cells are not set up to cover the complete space of
product of the state space for each cell involved, but may options. However, because of our large number of pro-
be approximated again by simply counting how many of duction cells we are able to validate the model over a
each type of cell is available. range of encodings and operating conditions.
A point of interest here is the recovery bandwidth be-
tween cells, quantified in Section 8.5. Bandwidth be- 8.2 Importance of recovery rate
tween distant cells has significant cost which should
be considered when choosing a multi-cell replication To develop some intuition about the sensitivity of stripe
scheme. availability to recovery rate, consider the situation where
there are no failure bursts. Chunks fail independently
with rate λ and recover with rate ρ. As in the previous
8 Markov Model Findings section, consider a stripe with s chunks total which can
survive losing at most s−r chunks, such as RS(r, s − r).
In this section, we apply the Markov models described Thus the transition rate from state i ≥ r to state i − 1 is
above to understand how changes in the parameters of iλ, and from state i to i + 1 is ρ for r ≥ i < s.
the system will affect end-system availability. We compute the MTTF, given by the time taken to
reach state r − 1 starting in state s. Using standard meth-
8.1 Markov model validation ods related to Gambler’s Ruin, [8, 14, 15, 26], this comes
to:
We validate the Markov model by comparing MTTF pre- s−r k
!
1 X X ρi 1
dicted by the model with actual MTTF values observed
in production cells. We are interested in whether the λ i=0
λi (s − k + i)(i+1)
k=0
Markov model provides an adequate tool for reasoning where (a)(b) denotes (a)(a − 1)(a − 2) · · · (a − b + 1).
about stripe availability. Our main goal in using the Assuming recoveries take much less time than node
model is providing a relative comparison of competing MTTF (i.e. ρ >> λ), gives a stripe MTTF of:
storage solutions, rather than a highly accurate predic-
ρs−r
s−r−1
tion of any particular solution. 1 ρ
We underline two observations that surface from val- +O
λs−r+1 (s)(s−r+1) λs−r
idation. First, the model is able to capture well the ef-
fect of failure bursts, which we consider as having the By similar computations, the recovery bandwidth con-
most impact on the availability numbers. For the cells we sumed is approximately λs per r data chunks.
observed, the model predicted MTTF with the same or- Thus, with no correlated failures reducing recovery
der of magnitude as the measured MTTF. In one particu- times by a factor of µ will increase stripe MTTF by a
lar cell, besides more regular unavailability events, there factor of µ2 for R = 3 and by µ4 for RS(9, 4).
was a large failure burst where tens of nodes became un- Reducing recovery times is effective when correlated
available. This resulted in an MTTF of 1.76E+6 days, failures are few. For RS(6, 3) with no correlated failures,
while the model predicted 5E+6 days. Though the rela- a 10% reduction in recovery time results in a 19% reduc-
tive error exceeds 100%, we are satisfied with the model tion in unavailability. However, when correlated failures
10
Policy MTTF(days) with MTTF(days) w/o layer of the storage stack do not significantly improve
(% overhead) correlated failures correlated failures data availability. Assuming R = 3 is used, a 10% re-
R = 2 (100) 1.47E + 5 4.99E + 05 duction in the latent disk error rate has a negligible effect
R = 3 (200) 6.82E + 6 1.35E + 09 on stripe availability. Similarly, a 10% reduction in the
R = 4 (300) 1.40E + 8 2.75E + 12 disk failure rate increases stripe availability by less than
R = 5 (400) 2.41E + 9 8.98E + 15 1.5%. On the other hand, cutting node failure rates by
RS(4, 2) (50) 1.80E + 6 1.35E + 09 10% can increase data availability by 18%. This holds
RS(6, 3) (50) 1.03E + 7 4.95E + 12 generally for other encodings.
RS(9, 4) (44) 2.39E + 6 9.01E + 15
RS(8, 4) (50) 5.11E + 7 1.80E + 16 8.5 Single vs multi-cell replication schemes
Table 3: Stripe MTTF in days, corresponding to various data Table 4 compares stripe MTTF under several multi-cell
redundancy policies and space overhead. replication schemes and inter-cell recovery times, taking
into consideration the effect of correlated failures within
Policy MTTF Bandwidth cells.
(recovery time) (days) (per PB) Replicating data across multiple cells (data centers)
R = 2 × 2(1day) 1.08E + 10 6.8MB/day greatly improves availability because it protects against
R = 2 × 2(1hr) 2.58E + 11 6.8MB/day correlated failures. For example, R = 2 × 2 with 1 day
RS(6, 3) × 2(1day) 5.32E + 13 97KB/day recovery time between cells has two orders of magnitude
RS(6, 3) × 2(1hr) 1.22E + 15 97KB/day longer MTTF than R = 4, shown in Table 3.
This introduces a tradeoff between higher replication
Table 4: Stripe MTTF and inter-cell bandwidth, for various in a single cell and the cost of inter-cell bandwidth. The
multi-cell schemes and inter-cell recovery times.
extra availability for R = 2×2 with 1 day recoveries ver-
sus R = 4 comes at an average cost of 6.8 MB/(user PB)
copied between cells each day. This is the inverse MTTF
are taken into account, even a 90% reduction in recovery
for R = 2.
time results in only a 6% reduction in unavailability.
It should be noted that most cross-cell recoveries will
occur in the event of large failure bursts. This must be
8.3 Impact of correlation on effectiveness of data- considered when calculating expected recovery times be-
replication schemes tween cells and the cost of on-demand access to poten-
Table 3 presents stripe availability for several data- tially large amounts of bandwidth.
replication schemes, measured in MTTF. We contrast Considering the relative cost of storage versus recov-
this with stripe MTTF when node failures occur at the ery bandwidth allows us to choose the most cost effective
same total rate but are assumed independent. scheme given particular availability goals.
Note that failing to account for correlation of node fail-
ures typically results in overestimating availability by at 9 Related Work
least two orders of magnitude, and eight in the case of
RS(8,4). Correlation also reduces the benefit of increas- Several previous studies [3, 19, 25, 29, 30] focus on the
ing data redundancy. The gain in availability achieved failure characteristics of independent hardware compo-
by increasing the replication number, for example, grows nents, such as hard drives, storage subsystems, or mem-
much more slowly when we have correlated failures. ory. As we have seen, these must be included when con-
Reed Solomon encodings achieve similar resilience to sidering availability but by themselves are insufficient.
failures compared to replication, though with less stor- We focus on failure bursts, since they have a large in-
age overhead. fluence on the availability of the system. Previous litera-
ture on failure bursts has focused on methods for discov-
8.4 Sensitivity of availability to component failure ering the relationship between the size of a failure event
rates and its probability of occurrence. In [10], the existence
of near-simultaneous failures in two large distributed sys-
One common method for improving availability is reduc- tems is reported. The beta-binomial density and the bi-
ing component failure rates. By inserting altered failure exponential density are used to fit these distributions in
rates of hardware into the model we can estimate the im- [6] and [24], respectively. In [24], the authors further
pact of potential improvements without actually building note that using an over-simplistic model for burst size,
or deploying new hardware. for example a single size, could result in “dramatic inac-
We find that improvements below the node (server) curacies” in practical settings. On the other hand, even
11
though the mean time to failure and mean time to recov- [11] studied data replication in face of failures, though
ery of system nodes tend to be non-uniform and corre- without considering availability of Reed-Solomon en-
lated, this particular correlation effect has only a limited codings or multi-cell replication.
impact on system-wide availability.
There is limited previous work on discovering patterns 10 Conclusions
of correlation in failures. The conditional probability of
failures for each pair of nodes in a system has been pro- We have presented data from Google’s clusters that char-
posed in [6] as a measure of correlation in the system. acterize the sources of failures contributing to unavail-
This computation extends heuristically to sets of larger ability. We find that correlation among node failures
nodes. A paradigm for discovering maximally indepen- dwarfs all other contributions to unavailability in our pro-
dent groups of nodes in a system to cope with correlated duction environment.
failures is discussed in [34]. That paradigm involves col- In particular, though disks failures can result in per-
lecting failure statistics on each node in the system and manent data loss, the multitude of transitory node fail-
computing a measure of correlation, such as the mutual ures account for most unavailability. We present a simple
information, between every pair of nodes. Both of these time-window-based method to group failure events into
approaches are computationally intensive and the results failure bursts which, despite its simplicity, successfully
found, unlike ours, are not used to build a predictive an- identifies bursts with a common cause. We develop ana-
alytical model for availability. lytical models to reason about past and future availability
Models that have been developed to study the relia- in our cells, including the effects of different choices of
bility of long-term storage fall into two categories, non- replication, data placement and system parameters.
Markov and Markov models. Those in the first category Inside Google, the analysis described in this paper has
tend to be less versatile. For example, in [5] the prob- provided a picture of data availability at a finer granu-
ability of multiple faults occurring during the recovery larity than previously measured. Using this framework,
period of a stripe is approximated. Correlation is intro- we provide feedback and recommendations to the de-
duced by means of a multiplicative factor that is applied velopment and operational engineering teams on differ-
to the mean time to failure of a second chunk when the ent replication and encoding schemes, and the primary
first chunk is already unavailable. This approach works causes of data unavailability in our existing cells. Spe-
only for stripes that are replicated and is not easily ex- cific examples include:
tendable to Reed-Solomon encoding. Moreover, the fac-
tor controlling time correlation is neither measurable nor • Determining the acceptable rate of successful trans-
derivable from other data. fers to battery power for individual machines upon
In [33], replication is compared with Reed-Solomon a power outage.
with respect to storage requirement, bandwidth for write
• Focusing on reducing reboot times, because
and repair and disk seeks for reads. However, the com-
planned kernel upgrades are a major source of cor-
parison assumes that sweep and repair are performed at
related failures.
regular intervals, as opposed to on demand.
Markov models are able to capture the system much • Moving towards a dynamic delay before initiating
more generally and can be used to model both replication recoveries, based on failure classification and recent
and Reed-Solomon encoding. Examples include [21], history of failures in the cell.
[32], [11] and [35]. However, these models all assume
independent failures of chunks. As we have shown, this Such analysis complements the intuition of the design-
assumption potentially leads to overestimation of data ers and operators of these complex distributed systems.
availability by many orders of magnitude. The authors
of [20] build a tool to optimize the disaster recovery ac- Acknowledgments
cording to availability requirements, with similar goals
as our analysis of multi-cell replication. However, they Our findings would not have been possible without the
do not focus on studying the effect of failure characteris- help of many of our colleagues. We would like to
tics and data redundancy options. thank the following people for their contributions to
Node availability in our environment is different from data collection: Marc Berhault, Eric Dorland, Sangeetha
previous work, such as [7, 18, 23], because we study a Eyunni, Adam Gee, Lawrence Greenfield, Ben Kochie,
large system that is tightly coupled in a single administra- and James O’Kane. We would also like to thank a num-
tive domain. These studies focus on measuring and pre- ber of our colleagues for helping us improve the presen-
dicting availability of individual desktop machines from tation of these results. In particular, feedback from John
many, potentially untrusted, domains. Other authors Wilkes, Tal Garfinkel, and Mike Marty was helpful. We
12
would also like to thank our shepherd Bianca Schroeder [13] E FRON , B., AND T IBSHIRANI , R. An Introduction to the Boot-
and the anonymous reviewers for their excellent feed- strap. Chapman and Hall, 1993.
back and comments, all of which helped to greatly im- [14] E PSTEIN , R. The Theory of Gambling and Statistical Logic. Aca-
prove this paper. demic Press, 1977.
[15] F ELLER , W. An Introduction to Probability Theory and Its Ap-
plication. John Wiley and Sons, 1968.
References
[16] G HEMAWAT, S., G OBIOFF , H., AND L EUNG , S.-T. The Google
[1] HDFS (Hadoop Distributed File System) architecture. file system. In SOSP ’03: Proceedings of the 19th ACM Sympo-
http://hadoop.apache.org/common/docs/ sium on Operating Systems Principles (Oct. 2003), pp. 29–43.
current/hdfs_design.html, 2009.
[17] I SARD , M. Autopilot: automatic data center management. ACM
[2] A NDREAS , E. S., H AEBERLEN , A., DABEK , F., GON C HUN , SIGOPS Opererating Systems Review 41, 2 (2007), 60–67.
B., W EATHERSPOON , H., M ORRIS , R., K AASHOEK , M. F.,
AND K UBIATOWICZ , J. Proactive replication for data durability. [18] JAVADI , B., KONDO , D., V INCENT, J.-M., AND A NDERSON ,
In Proceedings of the 5th Intl Workshop on Peer-to-Peer Systems D. Mining for statistical models of availability in large-scale
(IPTPS) (2006). distributed systems: An empirical study of SETI@home (2009).
pp. 1–10.
[3] BAIRAVASUNDARAM , L. N., G OODSON , G. R., PASUPATHY,
S., AND S CHINDLER , J. An analysis of latent sector errors [19] J IANG , W., H U , C., Z HOU , Y., AND K ANEVSKY, A. Are disks
in disk drives. In SIGMETRICS ’07: Proceedings of the 2007 the dominant contributor for storage failures?: a comprehensive
ACM SIGMETRICS International Conference on Measurement study of storage subsystem failure characteristics. In FAST ’08:
and Modeling of Computer Systems (2007), pp. 289–300. Proceedings of the 6th USENIX Conference on File and Storage
[4] BAIRAVASUNDARAM , L. N., G OODSON , G. R., S CHROEDER , Technologies (2008), pp. 1–15.
B., A RPACI -D USSEAU , A. C., AND A RPACI -D USSEA , R. H. [20] K EETON , K., S ANTOS , C., B EYER , D., C HASE , J., AND
An analysis of data corruption in the storage stack. In FAST ’08: W ILKES , J. Designing for disasters. In FAST ’04: Proceedings
Proceedings of the 6th USENIX Conference on File and Storage of the 3rd USENIX Conference on File and Storage Technologies
Technologies (2008), pp. 1–16. (2004), pp. 59–62.
[5] BAKER , M., S HAH , M., ROSENTHAL , D. S. H., ROUSSOPOU -
[21] L IAN , Q., C HEN , W., AND Z HANG , Z. On the impact of replica
LOS , M., M ANIATIS , P., G IULI , T., AND B UNGALE , P. A fresh
placement to the reliability of distributed brick storage systems.
look at the reliability of long-term digital storage. In EuroSys ’06:
In ICDCS ’05: Proceedings of the 25th IEEE International Con-
Proceedings of the 1st ACM SIGOPS/EuroSys European Confer-
ference on Distributed Computing Systems (2005), pp. 187–196.
ence on Computer Systems (2006), pp. 221–234.
[6] BAKKALOGLU , M., W YLIE , J. J., WANG , C., AND G ANGER , [22] M C K USICK , M. K., AND Q UINLAN , S. GFS: Evolution on fast-
G. R. Modeling correlated failures in survivable storage systems. forward. Communications of the ACM 53, 3 (2010), 42–49.
In Fast Abstract at International Conference on Dependable Sys- [23] M ICKENS , J. W., AND N OBLE , B. D. Exploiting availability
tems & Networks (June 2002). prediction in distributed systems. In NSDI ’06: Proceedings of
[7] B OLOSKY, W. J., D OUCEUR , J. R., E LY, D., AND T HEIMER , the 3rd conference on Networked Systems Design & Implementa-
M. Feasibility of a serverless distributed file system deployed on tion (2006), pp. 73–86.
an existing set of desktop PCs. In SIGMETRICS ’00: Proceed- [24] NATH , S., Y U , H., G IBBONS , P. B., AND S ESHAN , S. Sub-
ings of the 2000 ACM SIGMETRICS international conference on tleties in tolerating correlated failures in wide-area storage sys-
measurement and modeling of computer systems (2000), pp. 34– tems. In NSDI ’06: Proceedings of the 3rd conference on Net-
43. worked Systems Design & Implementation (2006), pp. 225–238.
[8] B ROWN , D. M. The first passage time distribution for a paral-
lel exponential system with repair. In Reliability and fault tree [25] P INHEIRO , E., W EBER , W.-D., AND BARROSO , L. A. Failure
analysis (1974), Defense Technical Information Center. trends in a large disk drive population. In FAST ’07: Proceedings
of the 5th USENIX conference on File and Storage Technologies
[9] C HANG , F., D EAN , J., G HEMAWAT, S., H SIEH , W. C., WAL - (2007), pp. 17–23.
LACH , D. A., B URROWS , M., C HANDRA , T., F IKES , A., AND
G RUBER , R. E. Bigtable: a distributed storage system for struc- [26] R AMABHADRAN , S., AND PASQUALE , J. Analysis of long-
tured data. In OSDI ’06: Proceedings of the 7th Symposium running replicated systems. In INFOCOM 2006. 25th IEEE In-
on Operating Systems Design and Implementation (Nov. 2006), ternational Conference on Computer Communications (2006),
pp. 205–218. pp. 1–9.
[10] C HARACTERISTICS , M. F., YALAG , P., NATH , S., Y U , H., [27] R ESNICK , S. I. Adventures in stochastic processes. Birkhauser
G IBBONS , P. B., AND S ESHAN , S. Beyond availability: To- Verlag, 1992.
wards a deeper understanding of machine failure characteristics
[28] RODRIGUES , R., AND L ISKOV, B. High availability in DHTs:
in large distributed systems. In WORLDS ’04: First Workshop on
Erasure coding vs. replication. In Proceedings of the 4th Interna-
Real, Large Distributed Systems (2004).
tional Workshop on Peer-to-Peer Systems (2005).
[11] C HUN , B.-G., DABEK , F., H AEBERLEN , A., S IT, E., W EATH -
ERSPOON , H., K AASHOEK , M. F., K UBIATOWICZ , J., AND [29] S CHROEDER , B., AND G IBSON , G. A. Disk failures in the real
M ORRIS , R. Efficient replica maintenance for distributed stor- world: what does an MTTF of 1,000,000 hours mean to you? In
age systems. In NSDI ’06: Proceedings of the 3rd conference on FAST ’07: Proceedings of the 5th USENIX conference on File
Networked Systems Design & Implementation (2006), pp. 45–58. and Storage Technologies (2007), pp. 1–16.
[12] C ORBETT, P., E NGLISH , B., G OEL , A., G RCANAC , T., [30] S CHROEDER , B., P INHEIRO , E., AND W EBER , W.-D. DRAM
K LEIMAN , S., L EONG , J., AND S ANKAR , S. Row-diagonal par- errors in the wild: a large-scale field study. In SIGMETRICS
ity for double disk failure correction. In FAST ’04: Proceedings ’09: Proceedings of the Eleventh International Joint Conference
of the 3rd USENIX Conference on File and Storage Technologies on Measurement and Modeling of Computer Systems (2009),
(2004), pp. 1–14. pp. 193–204.
13
[31] S CHWARZ , T. J. E., X IN , Q., M ILLER , E. L., L ONG , D. D. E.,
H OSPODOR , A., AND N G , S. Disk scrubbing in large archival
storage systems. International Symposium on Modeling, Analy-
sis, and Simulation of Computer Systems (2004), 409–418.
[32] T., S. Generalized Reed Solomon codes for erasure correction in
SDDS. In WDAS-4: Workshop on Distributed Data and Struc-
tures (2002).
[33] W EATHERSPOON , H., AND K UBIATOWICZ , J. Erasure coding
vs. replication: A quantitative comparison. In IPTPS ’01: Re-
vised Papers from the First International Workshop on Peer-to-
Peer Systems (2002), Springer-Verlag, pp. 328–338.
[34] W EATHERSPOON , H., M OSCOVITZ , T., AND K UBIATOWICZ ,
J. Introspective failure analysis: Avoiding correlated failures in
peer-to-peer systems. In SRDS ’02: Proceedings of the 21st IEEE
Symposium on Reliable Distributed Systems (2002), pp. 362–367.
[35] X IN , Q., M ILLER , E. L., S CHWARZ , T., L ONG , D. D. E.,
B RANDT, S. A., AND L ITWIN , W. Reliability mechanisms for
very large storage systems. In MSS ’03: Proceedings of the 20th
IEEE / 11th NASA Goddard Conference on Mass Storage Systems
and Technologies (2003), pp. 146–156.
14