Learn to fail or fail to learn
Source: https://stocksnap.io/

There are a lot of websites dedicated to DevOps nowadays. The topic is trending,  it is being highlighted in a lot of talks at webinars and conferences all around the world. A lot of companies are in the middle of discussing best practices for organizational challenges. One of the key aspects in Agile and DevOps are the fast feedback loops. These can be defined on multiple levels:

  • Feedback loops on the application level to quickly test and validate the last release
  • Feedback loops on the organizational and team level to improve the collaboration of team-members and stakeholders
Fixing failing applications
Source: https://stocksnap.io/

Why feedback is important

With cloud technology becoming more popular nowadays – teams are able to deploy new versions of their application much faster than before. Infrastructure is getting more complex every day. It’s not just a simple Virtual Machine or a container for a single application. It’s a rich ecosystem of (managed) virtual machines, containers, load balancers, external services, application components, etc.

Modern cloud deployment patterns are “loosely coupled” services. Many services make up a one application. A lot of these services can trigger other actions (events) which act like a starting point of other business logic.

To make this practical, some examples help to understand the concepts:

  • The application consists of multiple micro services which need to talk to each other.
  • Activate the monitoring and alerting services right after the deployment of the application.
  • Scale up the infrastructure as part of a high system load.

All should work together for the deployment to be successful. This increases the importance of the feedback loop.

Failures will happen

This being said, it’s obvious that these complexities can lead to failures very soon. Examples of failures are:

  • The new version of an application is broken but you will only notice this when the other services are deployed.
  • Improper alignment of the configurations of all external systems.
Writing unit tests
Source: https://stocksnap.io/

And what about the connections to other applications and systems – sometimes even in another cloud? When things go wrong it’s difficult to pinpoint the problem.

Risk management

In the cloud era – there is a shift of responsibility. Unlike the good old VM which you know and control yourself – you are dependent of the infrastructure of someone else. You don’t know all of the details on how this is configured.

Risk management will become different. In the past risk managers tend to think to “avoid risks” or to “control them”. Applications should not fail at all and all actions should be taken to secure this. Nowadays teams need to learn how to prepare for failure. The goal is to learn how to make applications recover from failures. It’s best to have risk managers think of these principles. Keeping this in mind aligns with the concepts of Devops and adheres to the principles of “failing fast”.

The role of the product owner

As this is not only a concern for the development team, it’s also a concern for the business.

Product owners might already be aware of the value of (constant) refactoring to optimize their applications. This is all about non functional requirements in the context of DevOps. Nowadays they should also be aware of failures which can happen at the new cloud deployment models.

Fast failure and the opportunity to learn from this help the teams to build resilient infrastructure and applications. It also helps to reduce technical dept. The principles of Infrastructure as Code and immutable infrastructure are a starting point from a technical perspective only. Teams need the buy-in from the business to get the time to prepare their systems.

Support the team

As a product owner you can support the team by adding user stories like these:

  • Define non-happy scenarios and build tests for these.
  • Experiment with network failures: what is the behavior of the application when a network connection is broken?
  • Simulate a sudden spike in traffic coming from another service to see if your application survives.
  • Test different versions of configuration to see if external services still interact correctly with your application.

Show the outcome of these user stories as part of the sprint review to accelerate the principle of “learning from failure” to the rest of the organization.

Fail to learn?

Imagine a scenario in which the team faces a time-out issue which does not reveal much information about the exact cause. Teams do not know how to debug it properly and some aspects go beyond their scope. For example: some AWS services are very hard to debug even when detailed logging is switched on.

Stress to fix issues
Source: https://stocksnap.io/

When teams learn how to fail – they learn how to recover from that. Learning is less stressful when practicing with a “hello world” application rather than a critical workload that requires 24/7 uptime. The price to recover from failures in production is high since it happens at one of the latest stages of the application life-cycle.

Wrap up

Learning to fail should apply for everyone in the organization who embrace highly flexible infrastructure and applications. Fast feedback loops for complex cloud deployment patterns are essential for this. It’s good to be prepared for this.

Better start to learn from failures now – before your working day is ruined 🙂