There are a lot of “what ifs” when it comes to configuring any Kubernetes platform.
Aside from the “how will this work” and “what’s the future of using this tool”, there’s also the conversation of “what happens if this thing fails?”.
Although engineers spend hours on implementation and maintenance, very rarely in the Kubernetes space do we see engineering focused on worst-case scenarios and more importantly, testing those scenarios to prepare for anything that can come our way.
That’s where Resilience and Chaos Engineering come into play.
In this blog post, you’ll learn what resilience engineering is, what chaos engineering is, a few tools to help you on your journey, and how to implement them.
What Is Resilience Engineering
In the days of bare-metal and virtual machines, we saw a lot more Disaster Recovery (DR) implementations. These implementations were primarily around the idea of backups. For example, you’d have a virtual machine with a particular “state”, and that state consisted of ephemeral and non-ephemeral data.
💡 Stateless and stateful applications are different when comparing them to ephemeral and non-ephemeral workloads. In Kubernetes, Stateful applications are workloads that can’t lose the unique identifier associated with Pods. Non-ephemeral workloads are workloads that require a volume (hard drive) to ensure data is persistent.
In the realm of Kubernetes, we lost the mindset around the need for DR. As we saw in a recent post around GCP, which you can find here, even if your data is on Kubernetes in the cloud, having the data backed up is incredibly important.
Taking the same concept as DR, Resilience Engineering is a “no assumption” approach. Much like engineers needing to perform backups and testing said backups, Resilience Engineering is testing how well your Kubernetes environment will perform under extreme circumstances.
The circumstance could be anything from a cluster going down to a containerized application under heavy load.
In Resilience Engineering, engineers need the ability to test these circumstances and vet out the process of fixing the issue if it occurs in production.
Chaos Engineering
If you haven’t heard of Resilience Engineering, you may have heard of Chaos Engineering. In terms of how they differ, they don’t. It’s the same method of practice, just different phrasing. If you hear Resilience Engineering or Chaos Engineering, chances are it’s the same concept.
Tools To Help With Chaos And Resilience
At the time of writing this, the following three tools are what’s used in the Resilience and Chaos Engineering realm today:
- Gremlin
- Chaos Mesh
- LitmusChaos
💡 Each tool has something called “experiments”, and these experiments are the tests that you run to on your cluster to see what breaks.
Gremlin is the tool that’s been around in terms of the enterprise for the longest out of all these tools. Although it’s not free and instead is a paid tool, this may be what you’re looking for in terms of a supported, non-open source tool.
Chaos Mesh is designed for Kubernetes. There are two components, the Chaos Operator and the Chaos Dashboard. The Operator is for orchestrating the tool within Kubernetes (it uses the same style structure that you’d see in other extendable tool designed for Kubernetes). The Dashboard is a web UI for managing, designing, and monitoring experiments Chaos Mesh is a CNCF Incubating project.
LitmusChaos is a CNCF Incubating project. LitmusChaos, much like Chaos Mesh, is designed for Kubernetes and requires a cluster of version v1.17 or later and 20GB of persistent storage (a Volume).
💡 We had a few of the maintainers from LitmusChaos on the Kubernetes Unpacked podcast. You can have a listen here: https://packetpushers.net/podcasts/kubernetes-unpacked/ku035-chaos-engineering-in-kubernetes-and-the-litmus-project/
All said and done, each of these tools more or less do the same thing. Some are paid, some of focuses outside of Kubernetes, and some are designed specifically for Kubernetes.
Let’s see how Chaos Mesha and LitmusChaos work.
Installing Chaos Mesh
Now that you’ve gone through some theory about Resiliance and Chaos Engineering, let’s see how to get a few of the tools up and running in your environment.
You’ll start off with Chaos Mesh.
First, add the Helm Repo.
helm repo add chaos-mesh https://charts.chaos-mesh.org
As you’re installing Chaos Mesh, you have the choice to choose between different environments based on the underlying platform and container runtime.
Go to this link and choose the runtime that works best for you.
For example, if your cluster is running CRI-O, the following installation will be for you.
helm install chaos-mesh chaos-mesh/chaos-mesh \
-n=chaos-mesh --set chaosDaemon.runtime=crio \
--set chaosDaemon.socketPath=/var/run/crio/crio.sock \
--create-namespace
Verify the installation.
kubectl get all -n chaos-mesh
You should see an output similar to the screenshot below.
Installing LitmusChaos
Next, let’s try out LitmusChaos using Helm.
First, add the Helm Repo.
helm repo add litmuschaos https://litmuschaos.github.io/litmus-helm/
Next, install LitmusChaos.
helm install chaos litmuschaos/litmus \
--namespace=litmus \
--set portal.frontend.service.type=NodePort \
--create-namespace
Verify the installation.
kubectl get all -n litmus
To access the LitmusChaos dashboard, you can use the litmusportal-frontend-service
.
kubectl get svc -n litmus
kubectl port-forward svc/chaos-litmus-frontend-service -n litmus 8080:9091
The default credentials are:
- Username: admin
- Password: litmus
A Mention Of Observability
Although monitoring and observability don't fall directly into Resilience and Chaos engineering, and instead fall directly into all aspects of engineering, monitoring and observability play a crucial role.
Understanding how your environment is performing from a tracing perspective helps you see how healthy the application stacks are. The health piece ensures that you can understand where and if applications are failing. As you begin figuring out what applications are having health issues, you can begin to mitigate the risk. However, you cannot mitigate risk without understanding the full extent of what will make the application unhealthy.
Resilience and Chaos Engineering practices can help you with the testing phase to see what stacks aren’t as resilient as you expected and ensure application traces are healthy in production.