For reliable applications, the web giant relies on ‘site reliability engineering’: DevOps with an engineering foundation.
Everyone wants to do DevOps these days, but what does a well-humming DevOps environment really look like? What is the vision to strive for?
To date, DevOps has been popping up as a hodgepodge of activities and initiatives across companies, perhaps in the name of doing it because it’s the thing to do. As a result, companies really aren’t seeing the full potential of DevOps. Consistency in DevOps approaches is a rare thing indeed, relates Kurt Marko, citing a recent survey from Computer Economics. The survey finds “that although about a third of organizations dabble in DevOps, almost none do so formally and consistently across the organization or show any semblance of mastering of DevOps practices.”
That’s because DevOps is more than demanding that everyone get together and throwing some new tools into the mix. As Marko puts it: “DevOps is like dieting: it requires changes in values, attitudes, processes and habits. Such changes are hard and must be practiced, not bought. They require education and discipline, not a purchase order.”
The folks at Google/Alphabet, being the trailblazers they always are, are sharing their vision and experience with what they call “site reliability engineering” (SRE), providing some examples of how well-tuned teams of developers and ops people can work together to make things happen. Rest assured, it is baked deeply into the Google culture.
To clarify, SRE is somewhat different from DevOps, but joined at the hip. “Interestingly, the SRE movement emerged separately from the DevOps movement–although there is little doubt that they are part of the same IT spectrum with similar customer value-driven goals,” Jayne Groll observes in DevOps.com. “DevOps focuses on engineering continuous delivery to the point of deployment; SRE focuses on engineering continuous operations at the point of customer consumption. Both domains rely on sharing, culture, metrics and automation. Both require human and automated resources to ensure a seamless value stream and exceptional customer experience.”
A great illustration is provided by Patrick Hill, site reliability engineer with Atlassian:
“Dev teams want to release awesome new features to the masses, and see them take off in a big way. Ops teams want to make sure those features don’t break things. Historically, that’s caused a big power struggle, with Ops trying to put the brakes on as many releases as possible, and Dev looking for clever new ways to sneak around the processes that hold them back. SRE removes the conjecture and debate over what can be launched and when. It introduces a mathematical formula for green- or red-lighting launches, and dedicates a team of people with Ops skills (appropriately called Service Reliability Engineers, or SRE’s) to continuously oversee the reliability of the product.”
In their latest book and videos on the topic, Betsy Beyer, Chris Jones, Jennifer Petoff and Niall Murphy, all with Google, unveil what they have been doing and provide lessons from which every non-Google enterprise can learn. “For sizes between a startup and a multinational, there probably already is someone in your organization who is doing SRE work, without it necessarily being called that name, or recognized as such,” they point out.
SRE “represents a significant break from existing industry best practices for managing large, complicated services,” Beyer and her associates write, noting this is the best way for a software engineer to invest time to accomplish a set of repetitive tasks. At the same time, “it has become much more: a set of principles, a set of practices, a set of incentives, and a field of endeavor within the larger software engineering discipline.”
The Google team explains that they “apply the principles of computer science and engineering to the design and development of computing systems: generally, large distributed ones.” Their tasks range from “writing the software for those systems alongside our product development counterparts;” to building pieces such as “backups or load balancing,” or simply “figuring out how to apply existing solutions to new problems.”
SREs have three missions: reliability, features and operating services.
- Reliability: Reliability is the top priority for SREs. The Google team cites the words of Google’s Ben Treynor Sloss, originator of the term SRE: “Reliability is the most fundamental feature of any product: a system isn’t very useful if nobody can use it.”
- Features: Once suitable levels of reliability are attained, SREs are charged with features and products.
- Operating services: “Finally, SREs are focused on operating services built atop our distributed computing systems, whether those services are planet-scale storage, email for hundreds of millions of users, or where Google began, web search.”