End Result = Automate

Nobody is a master at anything from day one.

Anything that you attempt for the first time, takes a number of iterations and attempts before you’re really good at it. I work in new environments, new networks, and on new systems fairly often, and do work for companies that are based in countries which are always a bit behind the curve of international IT trends. Without any existing integration between systems, and orchestration systems in place, I find that it always takes so much longer to automate from the start. And going hand-in-hand with the first part of this paragraph, if you’re figuring out how systems work together, it’s more often than not, a lot faster to make things work than to automate the functionality straight off the starting blocks.

I have a theory, at the moment it’s just a working theory; Our team’s work can be classified into two distinct ‘waves’. The first being, the ‘figuring it out’ phase. The ‘cheap and nasty’, ‘just get it done’, ‘make the customer happy’ phase. The second is the ‘clean up after the first wave’, ‘document’, ‘refactor’ and ‘automate as much as possible’ phase. I don’t know about you, but when I write documentation on a specific function, it forces me to retrace [my] steps and by doing this I learn a lot in the process.  The goal is for everyone on the team to be on the same page regarding the specific task. It should be clean, clearly documented, and the task should be automated. At least, automated as much as possible.

As a very simple example; If you’ve never written a bash script or an Ansible playbook in your life, and you have to add a few users to, let’s say 10 servers, then it might be quicker to just SSH to every device and add the users manually. But if you have to add and remove users every day, then it’ll be in your best interest to learn how to do it faster. The same goes for other, more complex tasks. Like, expanding your API footprint, and deploying it to a new node, adding it to the load-balancers, adding it to monitoring systems, and finally getting it live. Once you’ve done all of this manually, or a few times for that matter, then you know all the steps and variables needed to achieve this. So the next logical step is to automate the roll-out of the API to another node. Also, once it’s fully automated, we can add logic to when this should occur.

The two-wave structure is fine if you have enough people in your team. But what about, the tasks that you’re solely responsible for? Can this be a feasible way to work for a single person? At first, I thought there’s no way, I’ll never have time for the second wave because I’m always putting out fires. Then, starting small, I carved out time in my diary (30min - depending on how that day is going) just to automate or streamline some of my tasks. Once you are able to continuously automate (or streamline) your objectives, it will have a positive snowball effect on your day-to-day.

Of course, once you’re comfortable in your automation language of choice, these two waves could be executed in somewhat parallel.

So let’s put some names to this and stop calling it ‘streamlining’ or ‘automating’. These are some of the methods I’ve used:

  • Cron a script to monitor key performance indicators (KPI) on specific devices, send a daily email with important KPIs.
    • This can be helpful, but people have the tendency to stop reading emails, especially if it’s a daily monitoring email.
  • Ansible! Ansible all the way.
    • You’ll see me talk about Ansible a lot here. I think it’s the best thing since escalators. I’m a firm believer that Ansible Tower could replace direct SSH access for the majority of users in a number of organisations. Junior SysAdmins will soon, if not already, get a list of playbooks that they have access to, and that’s all that they’re allowed to do. And on the networking front, I predict that an engineer won’t have to know the difference in commands between a Cisco and a Juniper. They will just need to know Ansible commands.
  • Improving on the old monitoring script that sends an email; Telegram bots - are super useful.
    • They can keep you informed about your KPIs, or you can ask them to go check a system for you and report back. Ansible also offers Telegram modules, so it can either report back when a playbook is done, with its results, or use Telegram to execute a playbook when needed. It just depends on how much time you spend making your bot. Look after your bots and they will look after you…

There are of course plenty more, and with everything evolving as quickly as it does, there are sure to be new competing tools to these.

Talking about my team specifically, if we can focus on keeping the customer happy, getting the work out as quickly as possible, we can focus on optimizing it during the second wave. But what I can see happening, is that with more experience in the particular customer’s network, and with integration between the customer’s systems to get the use-case implemented; the efficiency of the first wave improves. And the second wave ends up being purely a document and knowledge sharing wave.

I’m a firm believer in the 1% rule (aka Marginal Gains). Don’t focus on streamlining tasks overall, rather focus on improving every little aspect of your work by 1%. Every bit helps, and soon the accumulated improvements will make a massive difference.