Blog

The next Level of Chaos Engineering Experiments

31.08.2021 by Manuel Gerding - 6 min read

The real world is complicated - especially true for distributed systems. This pushed steadybit's experiment engine to their limits when trying to represent the real world (AKA production) in a very simple experiment. In this blog post, we'll unleash a firework of new experiment features that address these shortcomings - and are already loved by many of our users.

Let’s replay real Incidents

As an attentive reader you may be familiar with our other blog posts covering simple experiments to survive an AWS zone outage, testing exception handling of REST endpoints and revealing top 3 kubernetes weak spots. Now we want to test more complicated examples, such as a rippling container outage, slow network combined with package corruption, or a DNS failure that occurs at the exact moment a container restart is performed. In this blog post, we will verify the last point and start by simulating a simple DNS failure. For that, we use the online shopping demo application, which consists of a microservice gateway that periodically requests products from other microservices over HTTP (hotdeals, fashion-bestseller and toys-bestseller). See Shopping Demo's Github repository for more details.

steadybit has discovered our system

Since steadybit has already discovered our system, we can use the known wizard for experiments. Just follow the steps below to create this experiment yourself. If you don't have steadybit in use yet, let's meet in a short demo call.

Step 1: Create DNS Outage Experiment

We start on the steadybit dashboard to create an experiment, which leads us into the familiar wizard.

Start Experiment from Dashboard
  1. We give the experiment a meaningful name (e.g. "Gateway survives DNS outage") and choose the “Global” area (providing you full access to all discovered targets).
  2. We decide to attack containers and specify them by attributes. The query narrows down the DNS outage to the gateway microservice only and leaves the remaining deployments untouched.
Create Experiment - specify Targets via Query
  1. Since we want to achieve a maximum effect, we choose an impact of 100% in the following step of the wizard - we attack the containers of both PODs of the deployment.
  2. In the last step, apply the "DNS Failure" attack from the "network" category and finish the wizard by saving our experiment.

Step 2: The New Experiment Editor

So, now we land for the first time in our new experiment editor! It is pretty neat and it allows us to add additional attacks, checks or actions to the experiment via drag'n drop. We can decide for ourselves whether we want to run multiple attacks one after the other or simultaneously. We make more use of it in step 4.

Create Experiment - the new Editor

Step 3: Run Experiment to check System

For now, we will simply run the experiment we have created and see how our application behaves.

This brings us into the new steadybit execution window (left side of the screenshot), which provides a lot of details about the system. For now, the system looks stable and everything seems to be working. We will take a closer look at the system in step 5 to analyze erroneous behavior.

Run Experiment - System still works fine

However, opening up the demo shop in parallel (right side) we get the same feeling: the shop is still working - we are safe!

But wait, are we really? As you probably know: DNS records are cached. This way, a DNS failure is not so bad as long as the DNS cache is up to date. But what happens to the system when the DNS cache isn’t there yet?

Step 4: Levelling Up the Experiment

Let's extend the scenario by adding more steps. First, we add a wait-step of a few seconds before the "DNS failure" attack and a simultaneous HTTP call from the beginning. This will periodically request the HTTP endpoint of the online shop and ensure that the DNS entries are cached when the DNS failure is simulated.

Edit Experiment - adding HTTP call for verification

The next step is to crash one of the gateway containers (using a simple stop container attack), which is immediately restarted by Kubernetes. We can also check if the POD has been successfully marked as unhealthy and is ready again afterwards by using the POD count checks (see lane 3).

Edit Experiment - adding stop container and POD count checks

All right, ready to go! Let's check how our system behaves and fingers crossed that the user won’t notice any downtime. We'll run the experiment again.

Step 5: Boom! Weak Spot revealed!

Uh-oh...This doesn't look so good. The experiment failed because the HTTP call received some erroneous responses. Let's take a closer look and analyze what happened.

steadybit has discovered our system

The execution view already shows all necessary details to find out what happened here:

  • The execution log shows which step is currently running, which was successful, and which failed. We can see that the HTTP call check was below the default threshold of 100% successful requests and thus failed.
  • Checking the Deployment Replica Count, we can see that a POD was not ready for a few seconds. This is plausible since we terminated the underlying container at the same time (compare the timing with the execution widget).
  • We can also see Kubernetes behavior in the Kubernetes Event Log, which informs us of an unhealthy endpoint and a restart of the gateway container as well as the Instana Event log proving that our monitoring works and detected an increase in the number of erroneous calls!

Putting the pieces of the puzzle together, we can conclude that the restarted container was immediately marked as ready again (as soon as the configured readiness probe was successful). Nevertheless, due to the DNS failure and the missing DNS cache, it was not able to reach any of the other microservices (hotdeals, fashion-bestseller and toys-bestseller) via HTTP.

However, Kubernetes assumes it has two ready PODs which are both working fine and it can schedule the HTTP requests to both. Thus, it sends the requests to the newly started container as well and has no idea that it is not working correctly due to DNS cache errors.

We can even validate the hypothesis by re-running it a few times as the routing of Kubernetes may also choose the healthy gateway-POD to serve the request.

Step 6: Fixing the Weak Spot

Now that the weak spot has been exposed, we have several options to fix it. The simplest approach would be to extend the health endpoint of the gateway to also check the availability of the endpoints of other microservices (hotdeals, fashion-bestseller and toys-bestseller).

However, a better approach would be to rethink the architecture of the shop in terms of a self-contained system. Right now, the gateway has a very tight coupling to other microservices which makes it difficult to get it reliably working. Applying known patterns of eventual consistency and caches may help to reduce the coupling.

Conclusion

So, bottom line: Setting up more complex experiments with steadybit's new experiment features is pretty simple and straightforward. We were able to set up a complex experiment in a few minutes and use it to test our system for specific turbulent conditions. In this way, we were able to uncover a weak spot and think about possible fixes. However, the Experiment Editor has many more features waiting for you. We will present some of them in blog posts in the coming weeks covering e.g. the integration of monitoring solutions like Instana, New Relic, Data Dog and Prometheus.

Written by

Manuel Gerding, Product Manager

Manuel is the youngster in the steadybit family and is constantly hungry for knowledge and new perspectives to broaden his horizons. After working almost a decade as a consultant and software engineer he focusses his perspective on the needs and demands of the user. His mission is to build a great product that really makes customer’s services more stable and valuable to their customers. To regain energy, Manuel loves to read a good book, take a trip with his bike or do a short meditation session.
@manuelgerding Manuel Gerding

Recent articles