In the article “learn to fail or fail to learn” I briefly touched on the importance of learning from failures. In short: a development team should learn it is inevitable that systems in the public cloud fail and they should be prepared for it. From a business perspective, the Product Owner should give priority to the user stories associated with it. This article explains why SRE is important for your applications.
The role of SRE
When it comes to preparing for failures in production, Site Reliability Engineering (SRE) comes into play. SRE is a relatively new topic which is getting more traction nowadays. One of the most common goals of SRE is to apply a software engineering mindset to system administration topics. This helps to bridge the gap between developers and (server) operators.
System administrators who adopt the principles of SRE do not just “guard the system” to make sure it stays up every single minute of the day. Instead they learn from and act upon failures of the system during the application life-cycle to optimize performance, security, availability and scalability.
Since there are so many “moving parts” and systems in the cloud tend to be so complex, this is a more realistic approach than the traditional view: prevent any system from failing and fight the fires whenever a big disruption occurs. It does not take into account scenarios in which failing is expected.
Let’s take a look at the Service Reliability Hierarchy Pyramid
At the top is the IT product itself. Customers of the IT product expect it fulfils their needs and the application should be accessible in a reliable way. If the IT product fails and customers cannot access their applications – all other layers of the SRE pyramid do not matter. Customers demand changes and improvements to the IT product now and then. SRE is part of the deal to keep up with these changes. Modern cloud infra can fail due to various reasons and applications can also fail due to these changes.
In SRE, failures are expected, since they are inevitable. We can’t ignore them and wait until a critical disruption occurs. It’s a wise idea to be prepared for this and to adjust any application to deal with failures. Even with regular, normal updates and changes to the application, failures occur. Worse still, most incidents happen during releases of new versions of the software; often containing new features or bug fixes for releases done previously. So, improving the app incurs a risk of incidents, that need further improvements; the cycle is never-ending.
In SRE, the concept of an error budget helps manage that cycle of risk. An error budget is the percentage of time the application can be down, without impacting the user significantly. Often, this is amount is the surplus of time above their regular service expectation. For instance, if the user expects the service to be up during the day (say 6am to 11pm), but doesn’t mind a short service outage during the night, we know the error budget is equal to that short period of time at night.
If the application has had incidents and subsequent downtime, equalling that allowed period of time, its error budget is spent. Because we know that most downtime is caused by changes and improvements to the app, we also know that for this measurement period (say, this month), the developers cannot release any new functionality. Bug fixes and security patches are only allowed in exceptions.
The error budget helps to match the application’s uptime and performance with the user’s expectation continuously, while catering for regular changes and improvements that development teams push to production.
Striving for a 100% uptime is infinitely costly and technically impossible. The number of nines, approaching that ideal 100%, are the simplest way of expressing how expensive a service’s availability will be, but also the simplest way of showing how much (or little, as the number of nines increases) change and improvement is allowed (i.e. the error budget).
Even in cloud-native, that allow rolling upgrades across the multiple running instances (or pods, or containers) of the same service, are susceptible to this risk of downtime. New features cause changes to the application and this can increase the risk of failure when the application is in production.
The role of SLIs and SLOs
To quantify the user’s expectation, the SRE methodology uses Service Level Indicators. SLIs are fine-grained indicators of (usually) performance and availability, but other qualitative measures (for, for instance, data integrity, security) are used. Each application is first split into its atomic, or smallest, parts. For each of these components, teams come up with one or more metrics (Indicators) that describe a quantitative and qualitative aspect of that component.
Sounds complex, right? You’re probably more familiar than you think. The response time performance for a web service, for instance. Or the availability during office hours. Or the number of transactions of the shopping cart. It’s important to note that an SLI is just describing what thing to measure using what metric. It doesn’t have a goal, or objective, attached to it.
That’s where the SLO comes in. It codifies the user’s expectation of that component. For example, the response time for the web service mentioned above is 2 seconds, maximum.
SLI can be re-used across different components. For instance, that response time SLI can be used on both the simple, static landing page, as well as the dynamic products page with many images. The SLO for these two different components will be different, for instance 0,5 sec for the landing page, and 2 sec for the products page.
SLIs and SLOs are crucial in determining the error budget of (parts of) the application, as defining the SLO is equal to knowing the error budget, by simple subtraction: 100% minus the SLO = error budget.
A change may result in a break of the production system. Now what?
It’s time to act quick, but in a professional way. Developers should first be trained for this. If not, they will do whatever they think is best. They might not follow the incident response process and start to change things which might not be relevant for the problem. In fact, they can make things worse. Practice the incident response process and share it inside and outside the team. This case, everyone knows what to expect when there is a critical issue.
Keep track of all incidents and classify them. Recurring incidents can reveal a more severe problem. Finding the root cause is king here, since it helps to find the real cause of the problem. Document it, for example in a known error database. Once the root cause of problems is found the team can make the product better and more stable. Remember the error budget: spent it to improve the product.
These examples clarify why it is important to have an error budget in place. It might be needed to constantly change the error budget as the IT product changes over time. This way it helps you keep the applications stable and recover from failures in a quick manner. Ultimately your customers will stay happy.
All of these concepts help to free up the minds and budgets for development teams. This gives space for new features for the product they are working on. New features actually contribute to the satisfaction of customers. That’s why all of this is important.