Old MacDonald had a farm… (and what that has to do with Cloud) – A story about Infrastructure as Code (IaC)

Old MacDonald had a farm…

… and on that farm he had a cat

I really love cats, especially my own cat “Pfötchen”. If she’s ill I would spend all my money and do almost anything, if she would feel better afterwards. I probably have the potential of becoming an obsessive male “cat lady” living with a dozen cats.

My life as an IT person was a lot like the life with my cat for many years. If I recognized any hiccup in an application, I spent a lot of time and brain capacity to do research what causes this behavior. When I found a solution, I implemented, tested and rolled it out. I basically cared for those applications as I would for an ill pet. The problem was only that as an IT person you usually have more than one “pet”.

Picture 1: An IT person caring about all their pets

… and on that farm he had some cows

Since I started working in Cloud business my behavior has changed. I recognized that the principal of factory farming, as much as I hate it because it’s not species-appropriate, brings lot of advantages in IT business. Now I don’t treat my “animals” as pets but rather as cattle in a factory farm.  

If a factory farmer recognizes a breakout of a deadly disease amongst their cows, they will try to separate and kill the ill animals, so they don’t endanger their business. Because the space in the barn is now empty, they’ll go to the market and buy a new cow. Since the farmer regularly kills their ill animals, they’ve negotiated a special price in a contract with a market seller, so they can replace any breed of cow immediately.

Picture 2: The ill cow that was killed gets replaced immediately by a new, healthy one

Now imagine, you have coded your server. A script would install and configure all software you need to run your application. After the script has been tested, the productive environment will be setup fully automatically. The data of the application will be stored, backed up and restored automatically. Now, a server has a hiccup, or even worse an outage: the principal of the factory farmer killing all their ill animals comes in handy. The goal in failure situations is to bring up the productive environment again. Analogue to the farmer killing their cow, shut down the faulty server and rebuild it again.

The server must be well described like a cow in the farmers contract with the market seller. The more details of the server are described, the less surprises will occur. A good description would be the installation script mentioned above, which will produce the server automatically.

As a primary requirement for such a behavior, the concept of infrastructure as code (IaC) must be introduced. It doesn’t matter if it has been done with Shell script, Python, Ansible, Salt, Terraform etc. With infrastructure as code, it is possible to have different versions of a datacenter. The time an IT department had to look up for compatible devices, buy and configure all components is over! Even if a mistake has been made, it’s easy to roll back. One click and the old version of the infrastructure gets redeployed.

That all makes it possible to set up and take down a testing environment in minutes. Automated testcases can check if the coded solution satisfies the expectations of an infrastructure change. Well proven methods that have been used in software development e.g. “Test Driven Design” can be transferred to the buildup of a whole datacenter. Quality management will also profit from that, because if an environment can be torn down easily, resource costs for testing can be minimized. In case of full automation, the duration of scheduling changes in infrastructure will also be minimized significantly. That makes it possible to establish fast feedback loops for changes. This fast feedback loop is a key feature of agile DevOps.

Consequently, if you setup anything by the cloud UI, you have already lost! Even administrative tasks should be done by tested coding. Best is to follow the 12 factor app principles even for your infrastructure.

All infrastructure as code tasks could theoretically be implemented with a shell script. The disadvantage of doing that is, that you must develop all structures on your own. So, in my opinion it would be better to use predefined and well tested code snippets, which are available on the internet. But if you’re already at that point, you could also have a look on provision tools like Terraform or Ansible.

For the Infrastructure I personally prefer the usage of Terraform. This has the advantage, that the state of my infrastructure is stored within the provisioning tool itself. If a server has to be redeployed, the healthy components won’t be touched. This behavior comes pre-configured! With a shared state file, it can be ensured that the status of the installation is available on several computers. Which makes it possible to enroll a deployment from different instances.  

The resources that are provided by the Terraform providers help to gain marvelous results in short time. Terraform can be used with a toolchain like for example GitHub Actions. With those features you’ll be able to codify your infrastructure and check it into a repository of a version control system. After the check, the infrastructure will be rolled out fully automatically to the cloud of your choice. Since the cloud providers are in charge of developing the plug-in of their cloud, the functionality of the APIs is optimal.

For installation and configuration of a server I prefer Ansible. Ansible provides many modules for installing, configuring and administrating servers. Out of my perspective Ansible isn’t as good as Terraform for requesting cloud APIs. Don’t get me wrong, it is possible to do API calls to any cloud. But you have to ensure to be idempotent. For tearing down resources in an environment you have to take care yourself.

Ansible uses SSH by default to provision predefined modules. They are the python-based scripts which are executing remotely. When Ansible is executed from outside your cloud environment, you have to find a method to reach your instances. All methods that can be used with SSH are usable with Ansible too.

Ansible looks up the instance definition it executes from a central inventory file. Besides a static inventory, it is also possible to use a dynamic inventory. A dynamic Inventory executes a script which downloads and groups the definitions from your cloud tenant. That makes it possible to run an Ansible script on resources with the tag “webserver”, for example.

… and on that farm he had a monkey?!?

A main task of operators is to keep the environments secure and stable. Operating systems, middleware and applications must be regularly patched. Most of the times a huge amount of time must be invested to organize a downtime for that work. There is a method to do that autonomously. If we stay at the farm analogy: Take your gun, give it to a monkey and let him play with it in your barn. It will shoot cattle randomly. Organized right, the cow will be replaced automatically by a new one delivered from the market.

Netflix does something similar with their production system. They have a tool called “chaos monkey” that randomly terminates instances. You can use this same system for your own application. If your setup is done correctly, the killed instances should be regenerated. This helps making applications tolerate random instance failures. Another advantage is that it can be used to make the environment renew itself. If your installation script does an update or is pulling a new image version of the operating system or software package, the new instance will have the newest version installed.

Picture 3: Chaos Monkey Logo by Netflix

Some people might say that this is inefficient, because someone must start the infrastructure as code tool after a server has been terminated. This sounds like scary manual work! But wait there is also a solution for this, the “auto scaling group “. This feature keeps the level of your instances in a pre-defined scope. If an instance gets terminated, a new one is set up immediately. With an autoscaling group you can also set up or tear down instances based on load measurements, so you’ll always have the right size. So instead of panicking when an instance is down, rely on your auto scaling group, lay back and grab another coffee!

Picture 4: The relaxed IT person is using auto scaling groups

… and on that farm he had some chickens

The installation of an instance takes some time. Depending on the situation and use case you might not always have that time. For example, if you’re owning a web shop and having a Black Friday Sale. Now there is a big run on your shop and it’s running out of capacity. Bringing in additional servers would take you some minutes, but many customers won’t wait and just move one to the competitor. That way you would lose money. In our factory farm example that’s comparable with egg production from caged chickens.
Because the production value of a chicken is way less than those of a cow you don’t want to waste time by bringing it from the seller to the coop. You’ll buy it in a cage that can be replaced as a whole. If the chickens you have don’t produce enough eggs, you can stack the cages on top of each other. Because we are in a factory farm, you of course have several rooms, stacked with chicken cages and a control room.

Picture 5: The stack of chicken cages in a room

A group of chickens living in the same cage is comparable to a set of containers bound to the same network. Each container behaves different than a Linux virtual machine. Technically a container is a single Linux process: There is no kernel booting and the startup will only take a few nano seconds. If your code is ready immediately, all containers will start working very fast.
One of the most common tools for orchestrating containers is Kubernetes (K8S). In analogy of cages, Kubernetes groups containers within Pods and administers them.

In Open Telekom Cloud, the Cloud Container Engine (CCE) implements a Kubernetes Cluster. CCE consists of master nodes (the control room) and worker nodes (the rooms with the cages). The amount of worker nodes is defined when building up a Kubernetes environment. With the help of node pools, this can be changed dynamically during runtime. Analogue to an autoscaling group you can also set up or tear down nodes based on load metrics.

Updating individual containers is even easier than updating Linux instances. During the exchange of a code version of the container, Kubernetes gives you the choice between different strategies how to exchange an instance. One strategy could be to either exchange the pods all at once or in portions with the option to stop, if the previous group has failed. There are many more strategies for exchanging instances with Kubernetes for different Use-Cases.   

If the cattle principle is implemented, this brings enormous advantages in operation, quality management, delivery time and in the end economically. As consequence, company governance of infrastructure with the help of a set of static rules is not sufficient anymore. This principle requires a close coordination between infrastructure and the software architecture. Introducing a DevOps organization will achieve that.

 Greetings Thomas & Pfötchen

Schreibe einen Kommentar

Deine E-Mail-Adresse wird nicht veröffentlicht. Erforderliche Felder sind mit * markiert