Response to Building Large Kubernetes Clusters

Oleksandr Slynko
3 min readDec 17, 2019

Last CNCF mailing list has an article about large clusters. And there are several points that I want to comment on the article.

I spend almost two years on cluster installer — creating clusters and improving the availability of clusters, testing scalability and so on.

First — strange cluster topology.

More etcd nodes make cluster only slower. The API-server returns a successful response only when the majority of etcd nodes write the result. More nodes in the cluster — more calls API server has to make. Also, even the amount of etcd nodes is odd. It does not add availability.

The cluster also has 20 master nodes with 20 schedulers and control managers. But at any given point only single scheduler and control manager can be used. So rest 19 schedulers just participate in leader election and not doing any actual work. It would be better to have two groups — one bigger group with just API servers and one much smaller with API servers and schedulers. (The question that pops up in my head is if all the API servers were needed.)

The cluster creation is important. But creation happens only once. There are more important aspects — operability, usability, durability. They have deployed 1000 VM cluster but only a single deployment. They haven’t tried to deploy multiple services, they haven’t tried to deploy all the additional components such as FluentBit or Prometheus or Ingress. (I suppose they deploy additional components since Kubernetes is a just a part of a system.) There was no investigation around DNS, no investigation about how pods would be spread, no investigation about the number of namespaces/CRDs and time to upgrade the pod.

Then there should be day 2. What happens during deployment upgrade — how long does it take to upgrade several deployments in parallel, are there any obvious problems, what is the load on API servers, how pods are spread around nodes, is there a way to autoscale the cluster, how long does full backup take and restore.

Then there is disaster recovery. How long does it take for the cluster to find out that one of the nodes goes away? Will the node get recreated? When the pods get recreated? What happens when one of the AZs goes down? What happens when there is a networking problem in the connection between AZs in such a huge cluster?

And then my favourite — upgrade. It depends on the type of the installer. For Cloud Foundry Container Runtime, it takes about five minutes to recreate a virtual machine and some time to properly drain the node. The draining is unpredictable and depends on the number of pods on the single node and attached volumes. I saw it took 10 minutes, I saw it took more than one hour. So how long will it take for 1000 VM cluster to upgrade? How to organize the upgrade? It depends on the amount of extra computing power available. But if there is enough power for let’s say 50 VMs then the upgrade would take from 5 hours to one day. So there should be a way to partially upgrade the cluster. That is the saddest part — the operability is not investigated in this expensive and time-consuming experiment.

And this leads me to the main question. Is the one huge cluster even needed? What is the problem with multiple small clusters? It is much simpler to operate multiple smaller clusters. It is much easier to upgrade a small cluster. The small clusters are much better tested and much better documented. But the article about using multiple clusters won’t get discussed.

--

--

Oleksandr Slynko

Reading code for a long time, writing code for even longer.