Escolar Documentos
Profissional Documentos
Cultura Documentos
Reliability
with Error Budgets
and SRE
Liz Fong-Jones (@lizthegrey)
Staff SRE, Google Cloud
@lizthegrey at #LeadDevNewYork
What worries us
about reliability?
@lizthegrey at #LeadDevNewYork
What is SRE?
1
@lizthegrey at #LeadDevNewYork
Software's
long-term
cost
@lizthegrey at #LeadDevNewYork
Incentives aren't aligned.
Developers Operators
Agility Stability
@lizthegrey at #LeadDevNewYork
Reducing product lifecycle friction
Business Process
Aha! Ka-ching!
@lizthegrey at #LeadDevNewYork
interface DevOps
DevOps 5 key areas
is a set of practices, guidelines 1. Reduce organizational silos
and culture designed to break 2. Accept failure as normal
down silos in IT development, 3. Implement gradual changes
operations, architecture, 4. Leverage tooling and automation
networking and security.
5. Measure everything
@lizthegrey at #LeadDevNewYork
The SRE approach
to operations
Use data to guide decision-making.
Treat operations like a software
engineering problem:
● Hire people motivated and capable
to write automation.
● Use software to accomplish tasks
normally done by sysadmins.
● Design more reliable and operable
service architectures from the start.
@lizthegrey at #LeadDevNewYork
SRE in a nutshell
Site Reliability Engineers develop SRE is a job function, a mindset, and
solutions to design, build, and run a set of engineering approaches to
large-scale systems scalably, running better production systems.
reliably, and efficiently.
We approach our work with a spirit of
We guide system architecture constructive pessimism: we hope for
by operating at the intersection the best, but plan for the worst.
of software development and
systems engineering.
@lizthegrey at #LeadDevNewYork
class SRE implements DevOps
DevOps 5 key areas
is a set of practices, guidelines and 1. Reduce organizational silos
culture designed to break down silos in
2. Accept failure as normal
IT development, operations,
architecture, networking and security. 3. Implement gradual changes
4. Leverage tooling and automation
Site Reliability Engineering 5. Measure everything
is a set of practices we've found to work,
some beliefs that animate those
practices, and a job role.
@lizthegrey at #LeadDevNewYork
Error Budgets
The key principle of SRE
2
@lizthegrey at #LeadDevNewYork
How to measure reliability
Naive approach: Relatively easy to measure for a continuous
good time binary metric e.g. machine uptime
Availability =
total time
= which fraction of time the service
is available and working
Much harder for distributed request/response services
– Is a server that currently does not get requests up or down?
Intuitive for humans – If 1 or 3 servers are down, is the service up or down?
@lizthegrey at #LeadDevNewYork
Allowed unavailability window
Availability
level
per year per quarter per 30 days
@lizthegrey at #LeadDevNewYork
“ 100% is the wrong reliability
target for basically everything.”
Benjamin Treynor Sloss, Vice President of 24x7 Engineering, Google
@lizthegrey at #LeadDevNewYork
Error budgets
● Product management defines an
availability target: How much uptime
the service should have per quarter.
@lizthegrey at #LeadDevNewYork
Benefits of error budgets
Common incentive for devs and SREs Dev team becomes self-policing
Find the right balance between innovation and reliability The error budget is a valuable resource for them
Dev team can manage the risk themselves Shared responsibility for system uptime
They decide how to spend their error budget Infrastructure failures eat into the devs’ error budget
@lizthegrey at #LeadDevNewYork
Allowed unavailability window
Availability
level
per year per quarter per 30 days
@lizthegrey at #LeadDevNewYork
Glossary SLI SLO SLA
of terms
service level service level service level
indicator: a objective: a top-line agreement:
well-defined target for fraction consequences
measure of 'good of good
enough' interactions • SLA = (SLO + margin)
+ consequences = SLI
• used to specify • specifies goals + goal + consequences
SLO/SLA (SLI + goal)
• Func(metric) <
threshold
@lizthegrey at #LeadDevNewYork
SLO definition and measurement
Service-level objective (SLO): a target for SLIs Choosing an appropriate SLO is complex.
aggregated over time Try to keep it simple, avoid absolutes,
perfection can wait.
● Measured using an SLI (service-level indicator)
● Typically, sum(SLI met) / window >= target Why?
percentage ● Sets priorities and constraints
for SRE and dev work
● Sets user expectations about
Try to exceed SLO target, but not by much
level of service
@lizthegrey at #LeadDevNewYork
Product lifecycle
Business Process
Aha! Ka-ching!
@lizthegrey at #LeadDevNewYork
class SRE implements DevOps
DevOps 5 key areas
is a set of practices, guidelines and 1. Reduce organizational silos: Share ownership
culture designed to break down silos in 2. Accept failure as normal: Error budgets
IT development, operations, 3. Implement gradual changes
architecture, networking and security. 4. Leverage tooling and automation
5. Measure everything: Measure reliability
Site Reliability Engineering
is a set of practices we've found to work,
some beliefs that animate those
practices, and a job role.
@lizthegrey at #LeadDevNewYork
The practices of
SRE 3
@lizthegrey at #LeadDevNewYork
Areas of practice
@lizthegrey at #LeadDevNewYork
Monitoring & Alerting
Monitoring: automate Alerting: triggers notification Only involve humans
recording system metrics when conditions are detected when SLO is threatened
@lizthegrey at #LeadDevNewYork
Demand
forecasting and
capacity planning
Plan for organic growth
Increased product adoption and usage by
customers.
@lizthegrey at #LeadDevNewYork
Efficiency and
performance
Capacity can be expensive —> optimize utilization
● Resource use is a function of demand (load),
capacity, and software efficiency
● SRE demands prediction and provisioning,
and can modify the software
@lizthegrey at #LeadDevNewYork
Change management
Roughly 70%1 of outages Mitigations: Remove humans from the loop
are due to changes in a with automation to:
Implement progressive
live system
rollouts • Reduce errors
@lizthegrey at #LeadDevNewYork
Pursuing maximum
change velocity
100% is the wrong reliability target for basically
everything
@lizthegrey at #LeadDevNewYork
Emergency
response
“Things break, that’s life”
Few people naturally react well to emergencies,
so you need a process:
@lizthegrey at #LeadDevNewYork
Incident &
postmortem
thresholds
● User-visible downtime or degradation
beyond a certain threshold
● Data loss of any kind
● On-call engineer significant intervention
(release rollback, rerouting of traffic, etc.)
● A resolution time above some threshold
@lizthegrey at #LeadDevNewYork
Postmortem philosophy
The primary goals of writing a postmortem Postmortems are expected after
are to ensure that: any significant undesirable event
@lizthegrey at #LeadDevNewYork
Blamelessness
● Postmortems must focus on identifying the
contributing causes without indicating any
individual or team
@lizthegrey at #LeadDevNewYork
Toil management/operational
work
Why? What?
Because: Work directly tied to running a service that is:
• O(n) with service growth (grows with user count or service size)
@lizthegrey at #LeadDevNewYork
Team skills
Hire good software engineers (SWE)
and good systems engineers (SE).
Not necessarily all in one person.
Try to get a 50:50 mix of SWE
and SE skillsets on team
Everyone should be able to code.
SE != "ops work"
For more detail, see “Hiring Site Reliability Engineers,” by Chris Jones,
Todd Underwood, and Shylaja Nukala, ;login:, June 2015
@lizthegrey at #LeadDevNewYork
Empowering SREs
● SREs must be empowered to enforce the
error budget and toil budget.
@lizthegrey at #LeadDevNewYork
Recap of SRE practices
@lizthegrey at #LeadDevNewYork
class SRE implements DevOps
DevOps 5 key areas
is a set of practices, guidelines and 1. Reduce organizational silos: Share ownership
culture designed to break down silos in 2. Accept failure as normal: Error budgets & blameless
IT development, operations, postmortems
architecture, networking and security. 3. Implement gradual changes: Reduce cost of failure
4. Leverage tooling and automation: Automate common
Site Reliability Engineering cases
is a set of practices we've found to work, 5. Measure everything: Measure toil and reliability
some beliefs that animate those
practices, and a job role.
@lizthegrey at #LeadDevNewYork
How to get
started 4
@lizthegrey at #LeadDevNewYork
SRE solves ● Effortless scale shouldn't meet escalating
operational demands.
@lizthegrey at #LeadDevNewYork
Do these 1. Start with Service Level Objectives.
SRE teams work to a SLO and/or error
four things. 2.
budget. They defend the SLO.
Hire people who write software.
They'll quickly become bored by performing
tasks by hand and replace manual work.
3. Ensure parity of respect with rest of the
development/engineering organization.
4. Provide a feedback loop for self-regulation.
SRE teams choose their work.
SREs must be able to shed work or reduce
SLOs when overloaded.
@lizthegrey at #LeadDevNewYork
You can do ●
●
Pick one service to run according to SRE model
Empower the team with strong executive
this. ●
sponsorship and support
Culture and psychological safety is critical.
● Measure Service Level Objectives & team health.
● Incremental progress frees time for more
progress.
@lizthegrey at #LeadDevNewYork
Spread the ● Spread the techniques and knowledge once you
have a solid case study within your company
@lizthegrey at #LeadDevNewYork
Cover images used with permission. These books can be found on shop.oreilly.com.
@lizthegrey at #LeadDevNewYork
Questions?
@lizthegrey at #LeadDevNewYork