Kube-Chaos Project

Chaos engineering is the discipline of experimenting on a software system in production to build confidence in the system’s capability to withstand turbulent and unexpected conditions.

Essentially, we’re going to break things on purpose… The difference though; chaos engineering is traditionally done to better understand a distributed system, application or service, but I’m going to break my Kubernetes cluster, to better understand its inner workings and features.

In my attempt to improve my skills of troubleshooting a Kubernetes cluster, I’ve written a little script (super basic for now) that randomly create a little bit (sometimes a lot) of chaos in my Kubernetes cluster, it’s then up to me to then go and fix. And so, we learn…

Why would you do this?

Well, in a production environment there’s a lot of benefit to breaking things on purpose. With a large scale production environment, and lots of moving parts, microservices, multiple clusters, multiple regions etc, we can only theorise about what the user experience will be if certain components would fail. But intentionally and proactively injecting failure, in a controlled manner, and building in backup or failsafe mechanisms will guarantee that we:

  • better understand our applications/services and how it behaves during times of failure, and this on your terms, and not of during an unanticipated outage.
  • encourage confidence in our systems.
  • provide a better service for our customers.

But for me, why would I do this? Purely for educational purposes.

How does it work?

It’s a little command-line application written in BASH (for now). It can be added as a cron that executes periodically, and on complete can a push notification. It’s then up to the administrator to log on and fix the issue. Fun, right? :/

Wrecking Havoc

Ideally, in a normal Kubernetes chaos engineering script, you would exclude your kube-system namespace, this is, of course, to ensure that if your script decides to start killing pods, you don’t completely break your cluster beyond repair. In our case though, seeing as we’re breaking Kubernetes to learn how to fix it, we can decide to intentionally include the kube-system namespace. Of course, it goes without saying, but I’m going to say it any way. This is not for production use.

What is this chaos-script breaking?

This is something that will constantly be changing, as I’m learning more, and updating the script. The below table will be used for logging changes between versions.

Version Control Table:
Version Number Changes
v0.1 Script only affects specific namespace defined in CHAOS_SPACE variable
Introduces network failure
Kills all objects (i.e. RS, deployments, pod)
Deletes services

Resources

Resource Description Link
Git Repo Code Repository for the kube-chaos script Git Repo
Docker Hub A small demo containerised website - used as the application to wreck havoc on Docker Hub
Sample Deployment Files

Namespace

---
apiVersion: v1
kind: Namespace
metadata:
  name: chaos

Deployment

---
apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    name: chaos-site
  namespace: chaos
  name: chaos-deploy
spec:
  selector:
    matchLabels:
      name: chaos-site
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      labels:
        name: chaos-site
    spec:
      containers:
      - image: philipsmit/kubechaos-site:latest
        imagePullPolicy: Always
        name: chaos-site
        ports:
        - containerPort: 80

Service

apiVersion: v1
kind: Service
metadata:
  labels:
    name: chaos-site
  name: chaos-svc
  namespace: chaos
spec:
  ports:
  - name: chaos-http
    nodePort: 32080
    port: 80
    protocol: TCP
    targetPort: 80
  selector:
    name: chaos-site
  type: NodePort