Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions.
Essentially, we’re going to break things on purpose… The difference though; chaos engineering is traditionally done to better understand a distributed system, application or service, but I’m going to break my Kubernetes cluster, to better understand its inner workings and features.
In my attempt to improve my skills of troubleshooting a Kubernetes cluster, I’ve written a little script (super basic for now) that randomly create a little bit (sometimes a lot) of chaos in my Kubernetes cluster, it’s then up to me to then go and fix. And so, we learn…
Why would you do this?
Well, in a production environment there’s a lot of benefit to breaking things on purpose. With a large scale production environment, and lots of moving parts, microservices, multiple clusters, multiple regions etc, we can only theorise about what the user experience will be if certain components would fail. But intentionally and proactively injecting failure, in a controlled manner, and building in backup or failsafe mechanisms will guarantee that we:
- better understand our applications/services and how it behaves during times of failure, and this on your terms, and not of during an unanticipated outage.
- encourage confidence in our systems.
- provide a better service for our customers.
But for me, why would I do this? Purely for educational purposes.
How does it work?
It’s a little command-line application written in BASH (for now). It can be added as a cron that executes periodically, and on complete can a push notification. It’s then up to the administrator to log on and fix the issue. Fun, right? :/
Ideally, in a normal Kubernetes chaos engineering script, you would exclude your kube-system namespace, this is, of course, to ensure that if your script decides to start killing pods, you don’t completely break your cluster beyond repair. In our case though, seeing as we’re breaking Kubernetes to learn how to fix it, we can decide to intentionally include the kube-system namespace. Of course, it goes without saying, but I’m going to say it any way. This is not for production use.
What is this chaos-script breaking?
This is something that will constantly be changing, as I’m learning more, and updating the script. The below table will be used for logging changes between versions.
Version Control Table:
|v0.1||Script only affects specific namespace defined in CHAOS_SPACE variable|
|Introduces network failure|
|Kills all objects (i.e. RS, deployments, pod)|
|Git Repo||Code Repository for the kube-chaos script||Git Repo|
|Docker Hub||A small demo containerised website - used as the application to wreck havoc on||Docker Hub|
Sample Deployment Files
--- apiVersion: v1 kind: Namespace metadata: name: chaos
--- apiVersion: apps/v1 kind: Deployment metadata: labels: name: chaos-site namespace: chaos name: chaos-deploy spec: selector: matchLabels: name: chaos-site strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: labels: name: chaos-site spec: containers: - image: philipsmit/kubechaos-site:latest imagePullPolicy: Always name: chaos-site ports: - containerPort: 80
apiVersion: v1 kind: Service metadata: labels: name: chaos-site name: chaos-svc namespace: chaos spec: ports: - name: chaos-http nodePort: 32080 port: 80 protocol: TCP targetPort: 80 selector: name: chaos-site type: NodePort