The next Level of Chaos Engineering Experiments
The real world is complicated - especially true for distributed systems. This pushed steadybit's experiment engine to their limits when trying to represent the real world (AKA production) in a very simple experiment. In this blog post, we'll unleash a firework of new experiment features that address these shortcomings - and are already loved by many of our users.
Let’s replay real Incidents
As an attentive reader you may be familiar with our other blog posts covering simple experiments to survive an AWS zone outage, testing exception handling of REST endpoints and revealing top 3 kubernetes weak spots.
Now we want to test more complicated examples, such as a rippling container outage, slow network combined with package corruption, or a DNS failure that occurs at the exact moment a container restart is performed.
In this blog post, we will verify the last point and start by simulating a simple DNS failure.
For that, we use the online shopping demo application, which consists of a microservice
gateway that periodically requests products from other microservices over HTTP (
See Shopping Demo's Github repository for more details.
Since steadybit has already discovered our system, we can use the known wizard for experiments. Just follow the steps below to create this experiment yourself. If you don't have steadybit in use yet, let's meet in a short demo call.
Step 1: Create DNS Outage Experiment
We start on the steadybit dashboard to create an experiment, which leads us into the familiar wizard.
- We give the experiment a meaningful name (e.g. "Gateway survives DNS outage") and choose the “Global” area (providing you full access to all discovered targets).
- We decide to attack containers and specify them by attributes.
The query narrows down the DNS outage to the
gatewaymicroservice only and leaves the remaining deployments untouched.
- Since we want to achieve a maximum effect, we choose an impact of 100% in the following step of the wizard - we attack the containers of both PODs of the deployment.
- In the last step, apply the "DNS Failure" attack from the "network" category and finish the wizard by saving our experiment.
Step 2: The New Experiment Editor
So, now we land for the first time in our new experiment editor! It is pretty neat and it allows us to add additional attacks, checks or actions to the experiment via drag'n drop. We can decide for ourselves whether we want to run multiple attacks one after the other or simultaneously. We make more use of it in step 4.
Step 3: Run Experiment to check System
For now, we will simply run the experiment we have created and see how our application behaves.
This brings us into the new steadybit execution window (left side of the screenshot), which provides a lot of details about the system. For now, the system looks stable and everything seems to be working. We will take a closer look at the system in step 5 to analyze erroneous behavior.
However, opening up the demo shop in parallel (right side) we get the same feeling: the shop is still working - we are safe!
But wait, are we really? As you probably know: DNS records are cached. This way, a DNS failure is not so bad as long as the DNS cache is up to date. But what happens to the system when the DNS cache isn’t there yet?
Step 4: Levelling Up the Experiment
Let's extend the scenario by adding more steps. First, we add a wait-step of a few seconds before the "DNS failure" attack and a simultaneous HTTP call from the beginning. This will periodically request the HTTP endpoint of the online shop and ensure that the DNS entries are cached when the DNS failure is simulated.
The next step is to crash one of the gateway containers (using a simple stop container attack), which is immediately restarted by Kubernetes. We can also check if the POD has been successfully marked as unhealthy and is ready again afterwards by using the POD count checks (see lane 3).
All right, ready to go! Let's check how our system behaves and fingers crossed that the user won’t notice any downtime. We'll run the experiment again.
Step 5: Boom! Weak Spot revealed!
Uh-oh...This doesn't look so good. The experiment failed because the HTTP call received some erroneous responses. Let's take a closer look and analyze what happened.
The execution view already shows all necessary details to find out what happened here:
- The execution log shows which step is currently running, which was successful, and which failed. We can see that the HTTP call check was below the default threshold of 100% successful requests and thus failed.
- Checking the Deployment Replica Count, we can see that a POD was not ready for a few seconds. This is plausible since we terminated the underlying container at the same time (compare the timing with the execution widget).
- We can also see Kubernetes behavior in the Kubernetes Event Log, which informs us of an unhealthy endpoint and a restart of the
gatewaycontainer as well as the Instana Event log proving that our monitoring works and detected an increase in the number of erroneous calls!
Putting the pieces of the puzzle together, we can conclude that the restarted container was immediately marked as ready again (as soon as the configured readiness probe was successful).
Nevertheless, due to the DNS failure and the missing DNS cache, it was not able to reach any of the other microservices (
toys-bestseller) via HTTP.
However, Kubernetes assumes it has two ready PODs which are both working fine and it can schedule the HTTP requests to both. Thus, it sends the requests to the newly started container as well and has no idea that it is not working correctly due to DNS cache errors.
We can even validate the hypothesis by re-running it a few times as the routing of Kubernetes may also choose the healthy
gateway-POD to serve the request.
Step 6: Fixing the Weak Spot
Now that the weak spot has been exposed, we have several options to fix it.
The simplest approach would be to extend the health endpoint of the
gateway to also check the availability of the endpoints of other microservices (
However, a better approach would be to rethink the architecture of the shop in terms of a self-contained system.
Right now, the
gateway has a very tight coupling to other microservices which makes it difficult to get it reliably working.
Applying known patterns of eventual consistency and caches may help to reduce the coupling.
So, bottom line: Setting up more complex experiments with steadybit's new experiment features is pretty simple and straightforward. We were able to set up a complex experiment in a few minutes and use it to test our system for specific turbulent conditions. In this way, we were able to uncover a weak spot and think about possible fixes. However, the Experiment Editor has many more features waiting for you. We will present some of them in blog posts in the coming weeks covering e.g. the integration of monitoring solutions like Instana, New Relic, Data Dog and Prometheus.
Manuel Gerding, Product Manager
If you are not doing Chaos Engineering, you are losing money
Everyone knows that the downtime of systems costs money and one would prefer to avoid it at all costs. Chaos Engineering is always mentioned as a possible solut…Read