✨ [Webinar] "Reliability in the Face of Uncertainty" with Wakaru Oy ✨

Blog

All Blog Posts

Resilience Testing

10.10.2022

Ben Blackmore

5 minutes

read

This article will look closely at continuous verification with Steadybit through resilience testing and how it helped us internally.

In articles and books, you find many resources about chaos engineering, specifically its experimental nature. Suppose you have gained knowledge and insights into your system through these: How do you translate knowledge and insights into confidence? We believe that you can only build confidence over time through continuous learning (significant and relevant knowledge requires continuous investment) and continuous verification (what once was true doesn't have to be true in the future – we need to correct the verification and/or our understanding).

Resilience Testing with Steadybit

Within the wider industry, we apply several testing methodologies for continuous verification. For example, unit, integration, end-to-end and performance/load testing are cornerstones of software engineering practices. These methodologies help us gain confidence and improve reliability in many aspects of our systems. However, these tests are typically either not executed within deployed environments (unit and integration tests) or under turbulent conditions (end-to-end and performance/load tests).

Through resilience testing, we introduce turbulent conditions into existing tests (for example, in k6 load tests or integration tests) or via dedicated resilience tests.

Within Steadybit, you can turn a combination of attacks, checks/probes, load tests and arbitrary actions (through ActionKit) into an experiment. There are no restrictions on the number of steps within an experiment or its complexity. As a result, you can use Steadybit's experiment capability to implement resilience tests.

Resilience Testing Rolling Updates

As part of our recurring internal game days, we discovered that deployments of our product weren't working anymore without short interruptions for our customers. We used a Kubernetes deployment with sufficient replicas and a rolling update strategy, but something was wrong. Within the next iteration, we investigated this regression to identify and fix the cause – which turned out to be an AWS ALB misconfiguration related to sticky sessions.

Let us pause here: How would you write a test to verify your expected rolling update strategy? How do you ensure that rolling updates continue to work? Feel free to let us know via Twitter through @SteadybitHQ, or any other channel you like!

To ensure that the rolling update strategy continues to work in the future, we leveraged Steadybit to write a resilience test.

The image above shows part of Steadybit's experiment designer. Analogous to a movie editor's timeline view, you can leverage the experiment designer to execute steps concurrently and sequentially.

The green box represents the system modification we are executing. In this case, a simulated rolling update through the kubectl rollout restart command injected through Steadybit. As mentioned above, we learned during our last game day that the system misbehaved during rolling updates. Therefore, the rolling update is at the core of the experiment.

The yellow boxes depict invariants, pre-, and post-conditions that need to be maintained. They represent checks that are verified for a certain amount of time. These checks verify how we want the system to behave, and they directly result from the observations from the last game day. Meaning: We turned observations from a game day into checks for continuous verification. Consequently, they result in fast, repeatable and cheap verification that doubles as documentation. To give you an overview, these are the type of checks we leverage:

No pending rollouts: This check is implemented by our Kubernetes extension. It internally leverages kubectl rollout status.
All pods ready: This check is bundled within Steadybit's agents and compares the ready pod count to the desired pod count.
No degraded synthetic checks: We leverage Checkly for synthetic checks (pings/HTTP calls). Checkly exposes a Prometheus scraping endpoint through which we turn synthetic check executions into metrics. We leverage a PromQL query (checkly_check_status{tags=~"production,.*"}) with the help of our Prometheus extension to check that there are no failed synthetic checks.
API calls are successful: We also execute HTTP API calls against our system to verify that it is continually available. This HTTP check is a native capability of Steadybit's agents.

Conclusions

With all the complexity of modern systems, confident development and operation are vital. Resilience testing and continuous verification build confidence in the systems you maintain and help you evolve them. It is essential always to have an up-to-date picture of the system's risks and perform such resilience tests with a high degree of automation. Steadybit can be leveraged to author and execute these resilience tests. Try it yourself; you can get started for free!