If you are not doing Chaos Engineering, you are losing money
Everyone knows that the downtime of systems costs money and one would prefer to avoid it at all costs. Chaos Engineering is always mentioned as a possible solution. But is it really worth it? That's why we want to talk about numbers today.
Microservices have now become established. This is undisputed and is no longer questioned. However, with an increasing number of components and interconnections, the complexity of the systems increases and thus the risk of unexpected system behavior.
You’re probably familiar with these issues:
- Unavailable dependencies: The database is suddenly gone in the middle of the night and nobody changes anything
- Unexpected traffic leads to a CPU bottleneck which accidentally breaks down the whole cluster
- The backup... where is the backup?
- Network issues: The response time of the authorization service increases and every single request to the business application slows down also
The good news: You are not alone. See this website which reports downtimes. A lot of large companies struggle with downtime. But we can do something to improve that.
Because one thing is clear: Downtime is very expensive and increases the risk that users and customers will turn away from using the service, especially if outages are very frequent.
As an IT expert you have probably also noticed that a high number of automated tests with a test coverage of >80-90% do not really help to make systems more stable and increase availability either. Nevertheless, there is still a testing gap caused by the architecture of distributed systems we use today. So your tests mainly assume that your system is completely available, but you never know what happens when parts of your system are down or respond slowly. Chaos Engineering has proven to be a useful strategy. It helps to better understand one's own system by deliberately causing failures and anomalies in order to make it more stable afterwards.
To chaos or not to chaos:
The Cost Benefit Analysis
But before you start breaking things in a company, at the latest the CTO (or similar) asks himself the question: What are the costs? And what is the benefit? The answer to this question is of course not simple and it depends on some factors. The basic idea is that by applying Chaos Engineering, you can reduce the number and duration of downtimes and thus save on the costs they incur. Costs here mean that you can of course not sell any goods if the online store fails. In addition, customers may be dissatisfied with the service offered and look elsewhere. The experts at Netflix have developed a formula that calculates the return on investment (ROI):
In general, the ROI is a ratio between net income (over a period) and investment (costs resulting from an investment of some resources at a point in time).
Mapped to chaos engineering use case that means:
The cost of outages depends to a large extent on the size of the company and the impact of IT on the business process. In recent years, digitalization has found its way into almost every company and practically everyone depends on the correctness and stability of their systems. Gartner already said in 2019 that you can expect an average cost of about $5600/min. But there are very many companies out there that have extremely high costs in case of a system failure (e.g. finance, stock exchange) or it can even cost human lives (e.g. transportation).
To simplify matters, let's assume the following example scenario:
- Cost of outages preventable by chaos = 500,000$
- Cost of chaos induced harm = 10,000$
- Cost of effort doing chaos (chaos team) = 250,000$
The costs for the chaos-induced-harm (e.g. 10,000$) must of course also be taken into account here, since the outages caused by chaos engineering also lead to costs in terms of downtime, even if they only occur in a dedicated test system. But one thing is clear: The chaos-induced harm should be smaller than the outage it prevented.
When inserted into the equation, we now obtain an ROI of 92%.
Okay, that was a lot of numbers. But what does an ROI of 92% actually mean? The ROI calculates the profitability of an investment by putting the capital employed in relation to the profit. The capital employed therefore has a payback of 92%, which is great.
This is of course a very simplified representation of the calculation, but it should help to make a decision whether it is worth the investment for the implementation of Chaos Engineering or not. Especially the realistic calculation of the costs of downtime is much more difficult in practice than in theory, but it definitely helps to prioritize it among your other feature requests, technical bugs reported by users and business decisions, when you have an idea about the costs that you need to invest for doing Chaos Engineering.
Things to consider
In addition to the technical benefits of Chaos Engineering, there are several other factors and areas that need to be taken care of. Because, of course, it's not just about quantitative aspects, but also qualitative. Of course, one will ask whether the focus is now more on features or on improving resilience. One thing is certain: Breaking things is a lot of fun, but results need to show up so that the effort put into performing chaos engineering makes sense. One part is the technical implementation of tooling to be able to break things, like slowing down the CPU or affecting network traffic. But many companies in the market do not move past that stage and finally don’t get any value from that methodology. Furthermore experiments have to be prepared, teams informed and errors documented. So collaboration is an essential part of it. But to know if it really shows positive results, things also have to be measurable. For this topic in particular, there is another blog post that goes more into the details. And: The most difficult task is also which experiments to perform and why.
Btw. with steadybit (a resilience engineering platform) you might be able to further reduce your costs drastically, because you don't need your own Chaos Engineering team which implements a complex tool chain. You are simply ready to go, after a quick setup, to run your first experiments. steadybit also helps you to find the right entry point for doing experiments and helps you to find the weak spots in your system much faster, than you might expect.
We at steadybit believe that in the near future every developer can help to improve the resilience of his systems with the right tooling. Because the earlier in the development process the topic already plays a role (shift left), the sooner possible errors are noticed and can be corrected at an early stage with a lower price. In the future, experimenting with chaos will still be important, but in the context of resilience engineering, there are many more and maybe different ways to get better.
Make your systems more resilient and contact us for a demo.
Retries with resilience4J and how to check in your real world environment
Do you know resilience4J? You definitely should, if you like to build fault tolerant applications. This blog post is about the retry mechanism and how to check …Read