Escolar Documentos
Profissional Documentos
Cultura Documentos
12 April 2013
th
DevOps Friday is proudly supported by ServerDensity. Server Density is a server and website monitoring tool. Were supporting DevOps Friday in its quest to continually publish bite size insights and news into the DevOps world. If you like this, please visit ServerDensity and the ServerDensity blog.
contents 5
application support is perfect for devops
Matt Watson
Finally, organizations can embrace a DevOps approach that improves application support even if they dont have a formal DevOps team. Stackify is the only solution that provides the proper access, tools and intelligence to improve application support efficiency.
6
growing an ops team from one founder
David Mytton
In the early days of 2009, it was just me running the Server Density monitoring infrastructure. Over the last 4 years the service has grown in terms of team members, data volume, customers and infrastructure so here are a few lessons from scaling the ops team and how things are run.
8
Matthew Skelton On software operability
Matthew Skelton
Operability is an engineering term concerning the qualities of a system which make it work well over its lifetime, and software operability applies these core engineering principles to software systems.
10
why devops matterS (to developers)
Benjamin Wootton
DevOps stems from the idea that developers and operations should work more closely together to increase the quality of the systems that we build and operate, but most of the enthusiasm and thought leadership appears to come from the Operations side of the fence.
12
the state of the art monitoring stack
Sandy Walsh
For many of us starting in this area, our concept of monitors consists of top, some apache, mysql and application log files. We were scared off of monitoring by these old monolithic products that required huge licensing fees and armies of professional services people. Thankfully, times have changed.
3
14
the benchmark youre reading is probably wrong
RethinkDB Team
Unfortunately, most benchmarks published online have crucial flaws in the methodology, and since many people make decisions based on this information, software vendors are forced to modify the default configuration of their products to look good on these benchmarks.
Around a year ago, I started a small newsletter summarising the best DevOps related links of the week. Since that time, interest has continued to grow in DevOps. Fantastic blog posts and articles are coming out each week, advancing the state of the art. Conferences are generating massive amounts of interesting content and discussion. Discussion on Twitter and on the podcasts is entertaining and educational. Considering this, I want to take DevOps Friday to the next level, using it as a hub to capture and communicate the best content of each week. If you like this issue, please consider sharing with friends and colleagues and encouraging them to sign up at the DevOps Friday home page.
Keith and Marios Guide to Fast Websites MongoDB Large Scale Data Centric Applications Hiring for the DevOps Toolchain: The Need for Generalists How Badly Set Goals Create a Tug-of-War in Your DevOps Organization Treating Servers as Cattle, Not as Pets Achieving Awesomeness with Opscode Chef (Part 2) Making A Point With SLAs Amazon Cloud A River Runs Through It Using Message Queues in Cloud Applications Are You Unknowingly Replicating Your Failure as a DBA?
application support
Matt Watson @stackify
hen DevOps emerged in 2009, the gap between development and operations teams finally started to get the kind of media and vendor attention it deserved. DevOps gets developers more involved in IT operations so they can more rapidly resolve software issues that arise after deployment. Without access to production applications and servers, even development managers and system admins need help identifying and solving problems, which is horribly inefficient. Some of us have been doing DevOps even before it had a name. At my last company, the lead developers were heavily involved in hardware purchases, setting hardware up, deploying code, monitoring systems and much more. The problem was that only three of the 40 developers had production access. The chosen three (including me) spent an inordinate amount of time helping others troubleshoot and fix application bugs. While I didnt trust the junior developers with the keys to the kingdom I nevertheless would have preferred them to have the ability to fix their own bugs. Because our application support processes werent very efficient, I wasted a lot of my own time fixing bugs instead of building new features. Later, I started Stackify because I believe that more developers should be involved in production application support. That way, a couple of employees like the three of us at my old job dont become a bottleneck. Meanwhile, junior developers, QA and even less technical support people can get server access to view log files and other basic troubleshooting information. Sadly in most companies today, the lead developer or system admin ends up
tracking down a log file or finding some minor bug in another developers app when they should be working on more important projects. Developers should be more involved in the design and support of the infrastructure our applications rely on since we are ultimately responsible for the applications we create. We should be able to deploy our applications, monitor production systems, ensure everything is working properly and be held responsible when our applications fail in production. Finding and fixing bugs is often more difficult than it sounds, however. Just think for a moment. What do your developers need access to? If your team is anything like mine was they need: A database of application exceptions Application and server log files Windows Event Viewer Application and server config files SQL databases to test queries Scheduled jobs history Server monitoring tools Performance monitoring tools and the list goes on.
As nice as it sounds, giving developers access to the information they need has been more difficult than it sounds because: The data resides in many locations Too many tools exist to access different types of information It can be difficult or impossible to control access rights and protect sensitive data Developers should be prevented from making changes It is difficult or impossible to audit what developers access
To overcome the challenges outlined in this post, I and my team at Stackify built a solution that gives developers access to all the information they need to provide effective application support. It also solves the problems that have prevented such information sharing in the past. With Stackify, you can eliminate bottlenecks in development teams and scale application support teams without additional head count. Finally, organizations can embrace a DevOps approach that improves application support even if they dont have a formal DevOps team. Stackify is the only solution that provides the proper access, tools and intelligence to improve application support efficiency.
When a developer is trying to fix a bug, nothing is more frustrating than lacking the details necessary to reproduce or fix the problem. Troubleshooting application problems can require access to a lot of information which in turn involves a lot of screens and a lot of logins. Imagine getting all the information you need in a single screen and then having the ability to drill down into it with a couple of mouse clicks.
n the early days of 2009, it was just me running the Server Density monitoring infrastructure. The service came out of beta in the summer and immediately had a few paying customers which helped to fund the rental of a couple of slices from Slicehost (fancy VPSs). The volume of traffic, simplicity of the service components and small number of servers meant that there were few problems. Over the last 4 years the service has grown in terms of team members, data volume, customers and infrastructure so here are a few lessons from scaling the ops team and how things are run.
things that need doing to run operations that could be outsourced to (trusted) individuals on an ad-hoc basis. Engineers are terrible at valuing their own time and often use the argument: why pay for something I could build/install/configure myself?. Candidates for this are things like running through PCI compliance checklists, setting up centralised logging, reorganising servers (e.g upgrading base OSs), researching CDN providers, integrating CI tools, etc. You always want someone technical managing the project to keep things on track and validate the end results, but these are things you dont need to do yourself.
You have to consider who will take the call when things break: How quickly can they get to a computer they can use to fix things? Are you out drinking on a Friday? What happens if someone falls ill? This could be a minor cold or major emergency. This could be the individual engineer or their family members. Does the on-call have enough phone battery? Can they hear their ringtone? Who is backup if the primary doesnt pick up? This is especially the case with outages. They often happen at inconvenient times and big incidents might require you to work for significant periods of time.
multiple people) who are responsible for the day to day operations. Engineers still engage with the team, can deploy, work on testing and debug problems but things like dealing with a failed disk drive or implementing backups is really outside the remit of devops in a large team.
You know youre there when you can start hiring site reliability engineers!
Hack traveling
As part of the founding team and even as an engineer youre likely to have to travel at some point to conferences, meeting customers, pitching vendors or maybe on holiday! Its relaxing to be uncontactable on the plane but its also scary because you have no idea if everything is still running. On one of my trips to Japan, as soon as I stepped off the 12 hour flight to Tokyo Narita, I had a flood of SMS alerts as one of our MongoDB servers had encountered a problem 4 hours previously. One of our engineers had been assigned on call for my flight and had already worked with the guys at 10gen and resolved the problem. Youll realise you become a slave to connectivity so trips to Japan are fine, but Tajikistan isnt really an option. So you need to be able to get internet access and power anywhere you are tricks such as visiting Starbucks, carrying external hotspots and not running things like updates when youre away!
Dealing with communicating with customers, fixing problems and recovering data can be exhausting especially when theres nobody else to help. The ultimate goal is to build your team so that shift based on-call cover can be provided but its difficult in the beginning with limited resources (both for people and multigeographic redundancy). Nobody is an invested in your service as you and your team Although services like Rackspaces support are helpful in certain situations, theyre never able to know the full story behind your service and how to deal with complex components. For example, MongoDB was a completely new database and didnt have single server durability for some time a bad shutdown could require a lengthy database repair, which was important to take steps to avoid such as by properly shutting it down before powering off the server. Knowing about the weaknesses and how to deal with them is something that requires greater knowledge of your setup that basic vendor support isnt going to provide. These things should be a stopgap or supplement the end goal of growing your own team. The whole point of devops is that its a mixture of engineering and operations so you dont need to hire dedicated sysadmins. This works well for small startup teams but you will eventually want someone (or
Matthew Skelton
Matthew Skelton @matthewskelton
On Software Operability
Matthew has been building, deploying and operating commercial software systems for over 13 years. He has engineered software systems for organisations in finance, insurance, pharmaceuticals, travel and media, as well as for MRI brain scanners and oil and gas exploration. He looks after build and deployment at thetrainline.com, the UKs leading rail ticket vendor which operates one of the countrys busiest web infrastructures.
cheaper to drill a new oil well than to extract a faulty down-hole pressure gauge; these systems had to operate reliably with minimal human intervention. Since then I have too often seen the negative effects of operational features being dropped before go-live, which usually results in significant operational costs and more incidents in Production. There is no good reason in 2013 why businesses should put up with (and pay for) second-rate software which needs arduous human attention every few hours or days just in order to maintain normal operation. In my experience, most modern business software is simple enough (at a systems level at least) that we can significantly reduce operational cost and downtime by introducing software operability as a key concern for software product delivery teams. Ultimately, its about lower cost of ownership, better engineering, and fewer late nights debugging flaky software!
draft operation manual, the software team can demonstrate to the operations folks that either all the major operability concerns have been addressed or that some operability criteria are beyond the expertise of the software team, but at least there will be no nasty surprises when the software is put into operation. The act of having to think about things like backups, time changes, health checks, and clear-down steps in the context of their software tends to mean that the software team members will implement small but crucial changes to the software to provide hooks for monitoring, alerting, backups, failover, etc., which improve the operability of the software.
What are some of the low hanging fruit a software team can tackle to make their software more operable?
The best thing a software team can do to make their software more operable is to write a draft operation manual alongside feature development. The operation manual (aka run book) eventually contains the full details of how the software system is operated in Production. By writing a 8
usually on features? I think one of the most important changes to make is to stop using the term non-functional requirements for things like performance and stability requirements; instead, use the term operational requirements, or even better, operational features, and include these in the product backlog alongside end-user features. This gets away from the artificial (and unhelpful) contrast of functional vs non-functional requirements, and helps to communicate to the business that the operational aspects of the software also require specific features if the business requirements are going to be met. A useful approach (discussed at the excellent DevOpsDays 2013 event in London) is to make the product owner responsible not only for feature delivery but also operational success of the software; after a few early morning Priority 1 call-outs due to the application servers needing a restart, the product owner will probably start to realise the importance of operational features! Making any operational problems more visible is also crucial. If the operations team needs to restart the app servers every night, make this visible, and include the product owner or business sponsor in the email notifications every day. Draw analogies with systems familiar to the product owner: if they had to have their car fixed by a mechanic every two days, theyd soon either buy a new car or pay to have the faulty part replaced. So, dont hide the effort which youre expending on keeping their software product running; make sure they see the cost (and the pain!).
Richard Crowleys talk on Developing Operability at SuperConf 2012 is also worth reading and understanding. I recently began a blog at softwareoperability.com which I plan to turn into a book in late 2013 or early 2014 to help software teams get to grips with software operability. Its worth saying that teams with a DevOps approach will generally produce systems with better operability than teams split into the traditional DevOps silos. Im approaching software operability from this siloed world of DevOps, mainly because this is where most organisations still are today, and in fact,
I hope that by gaining a better understanding of software operability, many engineering teams will move instinctively towards a DevOps model.
More info can be found at softwareoperability.com, @Operability and #operability on Twitter.
evOps stems from the idea that developers and operations should work more closely together communicating, knowledge sharing, and collaborating to increase the quality of the systems that we build and operate.
software bug, ahardware outage or a failed rollout. They wont care if it was a human error or some arbitrary combination of events. All they care about is that they cant use the system as intended. This might be a product of the systems on which Ive worked, but with good unit and integration testing and good QA testing, it is possible to catch most software bugs that would impact a large percentage of the user base. However, where things more typically go wrong is when the system comes into contact with the real world. For instance, we might find our code performs badly under real world load, that a disk fills up, or that users use the application in a way we didnt anticipate as they are prone to do! A DevOps oriented developer or team have a much more stringent focus on these issues and general site reliability. Theyll not only test their code; they will think about failure scenarios and mitigate them before code is even released. Theyll think carefully about detailed testing of their features to minimise the risk of them impacting the broader production system. They will plan and stage their upgrades to de-risk releases, and always have a rollback strategy. They will talk regularly with operations to ensure that they are taking into account their experience with keeping the site available. In short,
Though DevOps is an idea that is finding a lot of success and adoption, most of the enthusiasm and thought leadership appears to come from the Operations side of the fence. This is of course understandable. With Operations teams being on the front line and talking to end users daily, they have an obvious motivation not to upset customers through downtime, and an obvious personal motivation to avoid fire-fighting issues in favour of working on higher value projects. However, as a developer who has always worked at this intersection of the two teams, I feel that developers should also sit up and give more credence to what is coming out of the DevOps community. By opening up communication paths and adopting Operations-like skills and mindsets, we can likely all benefit both as individuals and as teams and in the quality of software that we deliver. Here are some of the reasons why I think this is the case:
DevOps rightly places site reliability front and center, and almost all developers will benefit from this mindset. This focus on site reliability might mean that sometimes we churn out fewer pure lines of code in a day, but it means that we move forward more predictably and reliably keeping the system stable and available.
superior coding skills but without the same degree of production awareness. I believe that as a result of DevOps and other trends, this will continue, i.e. that the best developers will increasingly be those who are the most operationally aware, who can code but also have the knowledge, skills and experience to reliably deliver a working production system over the long haul. This is particularly true in these tough and resource constrained economic times. With fewer people having the luxury or saying its not my job, the generalist will get ahead. (I guess for some people, such as those in startups and small companies, it was ever thus, with developers pitching in on operations type stuff such as deployments and upgrades.)
environments should all be in line, and all infrastructure changes should also be versioned and tested alongside the code assets. Doing this well removes so many unknowns and can lead to massive improvements in efficiency and quality of software development.
Monitoring Stack
Based in Nova Scotia, Canada, Alex Sandy Walsh is the owner of Dark Secret Software. He has been a senior professional developer for nearly 20 years and a Pythonista for 10 years. He is currently a developer on the OpenStack project with Rackspace. You can learn more about him at sandywalsh.com or follow @TheSandyWalsh.
ast week I had the pleasure of attending the first annualMonitoramaconference. This was a conference aimed towards advancing the state of open source monitoring and trending software.
For many of us starting in this area, our concept of monitors consists of top, some apache, mysql and application log files and perhaps an external ping service that tells us when our web site is unavailable. Anything beyond that generally ran into the commercial product realm. We were scared off of monitoring by these old monolithic products that required huge licensing fees and armies of professional services people. Thankfully, times have changed. And our application footprint has grown. No longer are we just deploying web servers and databases. Our application stack starts with our automated testing framework and runs through continuous integration and continuous deployment. Jenkins, Travis, Puppet/Chef, etc... theyre all critical. It also includes our deployment partners... that army ofSaaS applications we use to make our life easier. Any SaaS solution worth its salt has a status API available for tracking availability. Our monitoring needs are now wide and diverse. My first exposure to the next generation of monitoring tools came with the awesome Etsy post Measure anything, measure everything.
The concept ofMeasure Everythingwasnt new to me. Id been working on StackTach for OpenStack around the same time and understood the value of getting a visual representation of the internals of an application. Even from my old management days we used to say you cant manage what you cant measure. I lived this with my Google Analytics experiences from running various web sites and my software development management interests were aiming towards Six Sigma techniques over the hand-wavey agile methods. Essentially, numbers are good. But this was giving us a way to apply those same measurement techniques to running software. It was a lens into the black box. Could the days of parsing log files be over? The first generation of these new monitoring tools included Zenoss, Nagios, RRDtool, Cacti, Munin and Gaglia to name a few. They were built out of necessity and often have some really nasty warts that people just hate. This latest generation of tools have learned from their mistakes. The Etsy tool chain started with statsdwith graphite. This introduced to me the concept of usingUDPpackets for instrumenting the running applications... which was pretty brilliant. For those unfamiliar with statsd and graphite, heres the flow: your application wants to measure something, so it sends a UDP 12
packet to the statsd service. UDP packets are lossy and unreliable but fast for large amounts of data. Most large video networks send via UDP packets. statsd is a node.js in-memory data aggregator (it accumulates received data and every so often sends it to graphite). graphite is a django app that archives received data and gives a funky web interface for presenting and querying the data. There are a number of cool things happening here: Adding statsd integration to an existing application is very easy. No special libraries needed and sockets are available in nearly all languages. Since statsd uses UDP there is very little risk of the production application crashing if statsd fails. The packets just get lost. Since statsd is in-memory, it can process a lot of data very quickly. But rather than take on the task of archiving and disk access, it simply forwards the results to something that can do it better. graphite has an easy REST interface which makes it easily accessible by technical product managers to create their own dashboards and status reports.
Side note: if your application is written in python and you want to experiment with this stuff without touching your existing code base, have a look at the Tach application. This monkeypatches your python application and sends the output to statsd or graphite directly. Pretty cool. Although it was originally written for use with OpenStack, it can work with anything. But the real insight here is a set of atomic, well focused tools that could be put together to create a monitoring stack.The tool chest of theDevOpsteam just expanded. As our experiences with statsd and graphite had grown within the company we also saw where the monitoring stack failed. A UDP-based approach wont work for billing or auditing. For these scenarios you need to have a reliable transport for events. In OpenStack we publish event notifications to AMQP queues for consumption by various other tools. These are important events, often with large payloads. When the StackTach application is unavailable these queues can grow very quickly, and we dont want to drop events. This is manageable for something like OpenStack Compute, but other applications like Storage produce an incredible amount of data across a wide range of servers. Using a notification-based system would be difficult. Instead we needed to look at syslog-based archiving and processing solutions. The new monitoring stack offers tools likeLogStashand, in the OpenStack case,Slogging. Then there is the post-processing. To add value to the raw events we often need to apply other functions to the data such as times series averaging. This can be tricky. We need to wait for all the collected data to arrive before we can start the postprocessing. We may need to ensure proper ordering. Historically this would be done with cron jobs and batch processing, but the new monitoring stack includes tools like Riemann which can do this postprocessing inline.
It seems evident that Nagios isnt going anywhere any time soon, but there are some other tools offering alternatives such asShinkinandSensu. Recently our team has been working on bringing what weve learned with StackTach to the OpenStack-blessed monitoring solution called Ceilometer. Without standing back and looking at the larger monitoring community it would have been very easy to want to recreate an entire monitoring stack on our own. But now its clear that we can focus on the minimal set of missing functionality and augment that from an already powerful set of tools. This is a very attractive proposition for one simple reason: the project has an end in sight. There are lots of fun problems out there to tackle and knowing you dont have to reinvent the wheel is very compelling. There is a cost though. The monitoring stack today consists of a variety of tools all written in different languages and each with different care and and feeding instructions. One could argue that the workload on operations will only increase by mixing and matching. My knee-jerk reaction is to agree, but I know that the greater win is to get familiar with all of these new tools. In production, these monitoring tools need monitoring as well.So we may have to monitor Java, Ruby, Python and C# VMs running bytecode from a potential variety of languages. If this all seems too daunting, perhaps the hosted offerings are a better choice for you. For nearly every open source offering there are hosted offerings. Look at loggly, papertrail, pagerduty, librato, datadog, hostedgraphite, boundary, new relic, etc. This brings me back to Monitorama. The Monitorama conference had a format that worked very well for me for the following reasons: It was only two days long. Day One focused on hearing about the state of the ar t from industr y leaders.
Day Two was tactical with the tools and included a hackathon which let you understand where the realworld pain lived in each of these components. It was small enough so you could actually talk to people and have meaningful conversations.
The Day One talks made it clear that Alert Fatigue (a term borrowed from the medical industry) is a big problem. Too many alerts hitting our inbox. Some are important, most are noise. There are people working on it, but its perhaps the biggest source of angst for operations currently. Side story: for the hackathon I started work on a tool that allowed members of the company to track external events that might affect production. Things like sales events, big holidays, new customer deployments or internal events such as new code deployments, hardware upgrades, etc. The idea was to have these events show up on the spikes in the dashboard graphs so we could say That spike was due to Foo and that ravine was due to Blah. I made some good progress for the day and then one of the other attendees showed me his side projectAnthricite, which does all this and more. The author was sitting in the room next to me. What are the odds? For a while I was getting disillusioned with this space because I saw it dominated with commercial solutions or that the problem was so big it would be a lifetime of work to build as open source. But now I see there are viable open source components and there is enough of the stack available that we can focus on some of the smaller missing pieces. Also, there is a smart community out there facing the exact same problems and actively working on solutions. There is a light at the end of the tunnel. I may not attend Monitorama EU, but definitely the US one next year. But for now, Ive got some products to learn.
13
is probably wrong
RethinkDB Team @RethinkDb
The RethinkDB team is working on a scalable, open-source, distributed document database system that features a pleasant query language, parallelized architecture, and table joins. You can learn more at rethinkdb.com.
ikeal Rogers wrote a blog post on MongoDB performance and durability. In one of the sections, he writes about the request/response model, and makes the following statement: MongoDB, by default, doesnt actually have a response for writes. In response, one of 10gen employees (the company behind MongoDB) made the following comment on Hacker News: We did this to make MongoDB look good in stupid benchmarks. The benchmark in question shows a single graph, which demonstrates that MongoDB is 27 times faster than CouchDB on inserting one million rows. At the first glance, the benchmark immediately looks silly if youve ever done serious benchmarking before. CouchDB people are smart, inserting such a small number of elements is a relatively simple feature, and its almost certain that either they would have fixed something that simple or they had a very good reason not to (in which case the benchmark is likely measuring apples and oranges). Lets do some back of the envelope math. Roundtrip latency on a commodity network for a small packet can range from 0.2ms to 0.8ms. A single rotational drive can do 15000RPM / 60sec = 250 operations per second (resulting in close to 5ms latency in
practice), and a single Intel X25-m SSD drive can do about 7000 write operations per second (resulting in close to 0.15ms latency). The benchmark demonstrates that CouchDB takes an average of 0.5ms per document to insert one million documents, while MongoDB does the same in 0.01ms. Clearly the rotational drives are too slow to play a part in the measurement, and the SSD drives are probably too fast to matter for CouchDB and too slow to matter for MongoDB. However, CouchDB appears to be awfully close to commonly encountered network latencies, while MongoDB inserts each document 50 times faster than commodity network latency. At first observation, it appears likely that the CouchDB client library is configured to wait for the socket to receive a response from the database server before sending the next insert, while the MongoDB client is configured to continue sending insert requests without waiting for a response. If this is true, the benchmark compares apples and oranges and tells you absolutely nothing about which database engine is actually faster at inserting elements. It doesnt measure how fast each engine handles insertion when the dataset fits into memory, when the dataset spills onto disk, or when there are multiple concurrent clients (which is a whole different can of worms). 14
It doesnt even begin to address the more subtle issues of whether the potential bottlenecks for each database might reside in the virtual memory configuration, or the file system, or the operating system I/O scheduler, or some other part of the stack, because each database uses each one of these components slightly differently. What the benchmark likely measures is something that is never mentioned the latency of the network stack for CouchDB, and something entirely unrelated for MongoDB. Unfortunately most benchmarks published online have similar crucial flaws in the methodology, and since many people make decisions based on this information, software vendors are forced to modify the default configuration of their products to look good on these benchmarks. There is no easy solution performing proper benchmarks is very error-prone, time consuming work. Its good to be very skeptical about benchmarks that show a large performance difference but dont carefully discuss the methodology and potential pitfalls. As Brad Pitts character says at the end of Inglourious Basterds,