Blog

How to validate your Kubernetes Liveness Probes with Chaos Engineering

02.11.2021 by Dennis Schulte - 5 min read

In this blog post, we'll take a look at what Liveness Probes are for in the first place and how we can use steadybit to verify that they are working correctly.

Prerequisites

For the hands-on part of this post, you need the following:

  • A Kubernetes cluster. If you need a cluster, you can set up a local one by following these steps.
  • An application that runs on Kubernetes. We use our Shopping Demo showcase application.
  • A tool to inject http errors into the service. We use steadybit for that in our example, but you can also use any other tool that is capable of doing that. Maybe the Chaos Monkey for Spring Boot is a helpful tool in that space, which, by the way, was created by our CEO Benjamin Wilms.

Basics

Liveness and readiness probes are essential for the successful operation of highly available and distributed applications. Over time, services can get into a state where the only solution is to restart the affected service. Liveness probes are there for exactly this case. They perform a specified liveness check. This can be for example the call of a health endpoint or the execution of a check command. If this check is not successful, Kubernetes restarts the affected pod.

Defining a Liveness Probe

But first, let's look at how to define a liveness probe in the first place. The Kubernetes Documentation provides many more details.

apiVersion: apps/v1
kind: Deployment
metadata:
  namespace: steadybit-demo
  labels:
    run: fashion-bestseller
  name: fashion-bestseller
spec:
  replicas: 1
  selector:
    matchLabels:
      run: fashion-bestseller-exposed
  template:
    metadata:
      labels:
        run: fashion-bestseller-exposed
    spec:
      serviceAccountName: steadybit-demo
      containers:
        - image: steadybit/fashion-bestseller
          imagePullPolicy: Always
          name: fashion-bestseller
          ports:
            - containerPort: 8080
              protocol: TCP
          livenessProbe:
            timeoutSeconds: 1
            httpGet:
            path: /actuator/health/liveness
            port: 8080

See here for the complete example.

The example is a Spring Boot application that is deployed to Kubernetes using this manifest. To make Kubernetes continuously check the liveness of the application, we use the Health Actuator Endpoint provided by Spring Boot. If the endpoint stops responding or returns an error, the affected pod is restarted in the hope that everything will work again afterwards. You can quickly run into a restart loop due to buggy code or unforeseen circumstances, which may affect the availability of the entire application. However, you should verify that this works as intended with help of chaotic tests, especially for the critical areas, where the effects cannot be easily verified by static tests.

Validate Liveness Probe with an experiment

Ok, let's start the experiment and define the hypothesis first:

“When the health check of a pod fails, the designated pod will be restarted and serve traffic again.”

Define Experiment

The target of our experiment this time is at the application level. Accordingly, we select it here:

Select Application

This time we attack all instances of the application. Since the application runs on Kubernetes, this corresponds exactly to the number of pods:

Set Blast Radius

For attack, we choose the "Controller Delay", which allows us to inject dedicated latency at the endpoint level. We set the response time value to 2 seconds so that we can later check if Kubernetes detects the faulty service and restarts it. As seen above, the sample timeout (timeoutSeconds) is set to 1 second, which means that all requests exceeding 1 second will result in a restart of the pod.

Select Attack

After starting the experiment we need to wait a few moments:

Execution View

Great, in the steadybit Live View we see that Kubernetes restarted the two affected pods after the liveness probe failed:

Execution View

With the help of this experiment we have very easily proven that the liveness probes are correctly configured and functional. Of course, it would be interesting to do further experiments in the next step to check if everything is still good in this case. Besides the controller delay shown here, there are many other attacks that could be tried. An obvious one would be the injection of errors. What do you think? How does the service behave in this case?

Written by

Dennis Schulte, CTO

Dennis lost his soul to coding two decades ago and earned his living as a developer and software architect, too. He also shared some wisdom as speaker and author and loves to catch a breeze while kitesurfing.
@denschu Dennis Schulte

Recent articles