Site Reliability Engineering

Organizations big and small have started to realize just how crucial system and application reliability is to their business. They’ve also learned just how difficult it is to maintain that reliability while iterating at the speed demanded by the marketplace.

Site Reliability Engineering (SRE) is a proven approach to this challenge. This module will introduce you to the principles and practices of SRE. If you’ve had any operations experience (as a sysadmin, IT pro, DevOps practitioner, etc.) or even interest, SRE will prove to be a fascinating subject.

At the end of this module, you’ll have a good understanding of what SRE is and why it matters. You’ll be able to talk to other people about where it came from and how it relates to other operations practices like DevOps. Along the way, you’ll grasp the core principles and some of the practices that help implement these principles. We’ll end with some suggestions on how you can get started with this valuable operations practice.

What is SRE and why does it matter?

The best place to start is often the beginning. Let’s start by just asking the basic question “What is Site Reliability Engineering?” There are a number of answers to this question floating around, including the one often quoted by the person who coined the term (Ben Treynor Sloss at Google), but this is the most practical answer we can offer:

Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products.

Later on, we may bring some other definitions into the picture, but let’s start from here. There are three crucial parts to this definition we need to unpack that will lead us right to the “Why does it matter?” question.

Reliability

At the very heart of (and smack in the middle of the name “SRE”) is the word Reliability. The definition doesn’t say “appropriate level of performance” or “appropriate level of efficiency” or “appropriate level of stability” or even “achieve the appropriate level of income”. It says “appropriate level of reliability”. Why?

Let’s look at a quick demonstration. Here’s a screenshot. What do you think it is showing? Try not to move on until you have an idea or you give up. Note: if it is hard to detect very much detail in the picture below that’s fine, it is rendering perfectly in your browser.

Why are we looking at these examples? Each of them represents an application that potentially took a business huge amounts of time, energy, and resources to create. But if the application isn’t up–if it isn’t operational when a customer needed to access it–if it isn’t reliable–it does no one, especially the business any good. In fact, a lack of reliability can do actual harm (reputational, economic, contractual, morale, and so on) to a business.

This is why reliability is so important and why SRE chooses to focus on it as a fundamental property, perhaps the fundamental property of the service, system, or product. Reliability can encompass a number of things (we’ll talk about this some more later), but lets move on to the second crucial part of the definition.

Appropriate levels of reliability

You may not have caught it the first time you read the definition, but let’s emphasize another important word:

Site Reliability Engineering is an engineering discipline devoted to helping an organization sustainably achieve the appropriate level of reliability in their systems, services, and products.

Why does that word matter so much?

An important observation made by the SRE world is there are very few systems and services that have to be 100% reliable. Life and death situations like aviation, medical devices, etc. are notable exceptions.

In fact, there are very few situations where it is even desirable. The effort and resources (and hence the cost) needed to achieve greater reliability rise at a steep rate as greater reliability is sought. To say it another way, chasing after reliability you don’t need is a waste of time and money. You want to achieve the appropriate level of reliability in your system, services, and products.

The level needs to match the business needs and be pragmatic. For example, if your customers connect to you over a network that isn’t 100% reliable (let’s say it is up 90% of the time), spending the effort and money to make sure your service is 95% reliable is by definition a waste of time and money. You want to achieve the appropriate level of reliability in your system, services, and products.

SRE takes this pragmatism one more step. If we can now think about there being a desirable level of reliability, is there something we should do if we are successful at meeting or surpassing that level? Similarly, what if we don’t achieve it? We’ll answer these questions later in the module.

Sustainably achieve

The final word from our definition that we need to highlight before we move on is sustainable. Sustainably refers to the role of people in all of this. It is crucial we create a sustainable operations practice. Reliable systems, services, and products are built by people. If we don’t do things to make sure that our work is sustainable–if we wake our people up at 3:00 AM every night with a page, if we don’t give them time with their family, if they don’t have the opportunity to spend time taking care of themselves, then there’s no way they’re going to be able to build reliable systems. SRE thinks it’s really important that we implement an operations practice that is sustainable over time, so our people are able to bring their best to the job.