Kubernetes upgrades — the boring way

Upgrade a month keeps the problems quite far

5 min readMar 22, 2020

This article is based on my presentation for K8s day in Amsterdam in 2018.

I have worked on Cloud Foundry Container Runtime for almost two years. The main goal of the project is to provide Enterprise-ready Kubernetes. We have been upgrading Kubernetes since the first version. Upgrade pipeline probably was the first one that we had implemented. And as a result, I can honestly say that I upgraded thousands of clusters.

Upgrades of Kubernetes clusters are not that hard. To be honest they quite boring and simple, the only thing that is required is to follow the process. Unfortunately, most operators do not trust in their process and do not upgrade their clusters often enough. That leads to problems in upgrades and slows them down. But let’s get to the point.

How to upgrade a Kubernetes cluster?

First of all, what does the typical cluster consist of? Etcd for data storage, several master nodes with the control plane(kube-apiserver, kube-scheduler, kube-controller-manager) and lots of worker nodes.

The upgrade is a simple as restart all the components with new versions of packages. But before doing the restart you have to do the backup. I have done it with etcdctl snapshot and velero . Both worked fine, the only important thing is to practice restoring the backup and verify its consistency.

And one more thing that you want to do before the upgrade is to verify that cluster works fine.

First, on some installations, you can verify that master components are running successfully by running kubectl get componentstatus command. Unfortunately, there is a bug with a newer CLI and API server that might prevent you from doing that. Also, at some point, AKS was responding with failure to the request, I don’t have the AKS cluster to verify if this is still the case.

The second verification step is to check if all nodes are fine. That is easy to do, just run kubectl get nodes and check that all nodes are ready.

The third step is to check that all pods are running:

kubectl get pods — all-namespaces — field-selector 'status.phase!=Running,status.phase!=Succeeded'

This command will show all the pods are not running or not finished successfully for the jobs. We haven’t added automatic check because it is up to the operator to decide if the crashing pods are acceptable.

After verifying that the cluster is in the upgradable state, we can start the actual upgrade procedure.

Upgrading Etcd

Etcd has quite good documentation on how to upgrade, Kubernetes also has a documentation page about upgrading etcd.

Etcd and Kubernetes have a different upgrade cycle and can be upgraded separately. In my experience, people usually collocate etcd with master components. The benefit is reduced footprint, the disadvantage is an increased coupling. I will suppose that etcd is running separately, but it doesn’t change a thing.

The way I would do it is to delete the old etcd node and add the new one. There are several possibilities. You can attach the disk from the previous node and make the new node use the same name. This is how Cloud Foundry Container Runtime does it.

Alternatively, you can just delete the old etcd node, remove it from the cluster and add the new one. Then wait for some time until it gets all the data. The long-time Kubernetes user told me that their company does it.

Upgrading masters

The master nodes usually have three components running: kube-apiserver, kube-scheduler and kube-controller-manager.

The API server is stateless and serves the data from the etcd database. The scheduler and controller manager connect to the API server and do the work. Only one scheduler and controller manager can be active at any single moment. The re-election happens periodically.

To upgrade the master simply just delete the master node and start the master with the new version. You might experience tiny downtime in API and the scheduling will interrupt for several seconds.

It is important to know that during the minor version upgrade the schema might change, some roles might get removed. It is safer to check the release notes and then add the missing pieces separately.

After upgrading the master you will have a working cluster with a skew version.

Upgrading workers

The next step is to upgrade workers. It is a little bit harder than masters upgrade because you need your applications to stay running during the process.

It is possible, especially for patch upgrades just restart binaries on the nodes. This will keep your applications running without restarts, it is possible even to restart some container runtimes with the containers going down. This, however, introduces complexity and I would suggest recreating the node anyway. You can either create a new node in advance or first delete the old one. This will cause pods to restart on your nodes. So let’s get prepared.

Firstly, the applications have to be configured properly. There are multiple things to do, for example, you can check Cloud Foundry guidelines that I wrote. The main idea is that at any given time at least one replica of the application should be ready to serve traffic.

Before upgrading the node, you first have to drain it. This is the main time-consuming task for upgrades. If your applications are incorrectly configured, your upgrade might be stuck forever(this issue in Cloud Foundry Container Runtime describes the problem). This is the reason, why some installations don’t do drain(I have seen the cases when draining the node in advance prevented upgrade issues) or force timeout during the drain(ie GKE has 1-hour timeout). Drain also cordons the node(prevents it from scheduling more applications). Draining is a concept of the kubectl and not part of Kubernetes API as of now, so to do it programmatically you have to call kubectl drain command.

Now, there is one tricky step that is very valid if you use volumes and on-prem solutions. You have to wait for all disks to detach from the virtual machine. When pods with persisted storage are deleted, the disks stay attached to the VM. This the optimisation for a faster restart, but in the upgrade case, the pod case to start on a different VM, so when this happens, cloud provider requests the disk to be detached from current VM and attached to a new one. This takes time, and if the node is unavailable at the given time, the cloud provider might wait for it. I experienced 30 minutes waiting time for one of the volumes, the waiting usually takes less than a minute.

Now, you can delete the node and repeat it until the process is finished.

As you can see, the process is relatively simple but will take time and requires periodic testing. As I mentioned most Kubernetes providers do that for you, and you just have to relax and repeat the process at least once three months.