Experiences with upgrading using Helm

https://upload.wikimedia.org/wikipedia/commons/thumb/0/07/A_Late_Iron_Age_Helmet_reused_as_a_Cremation_Vessel_%28FindID_526999%29.jpg/800px-A_Late_Iron_Age_Helmet_reused_as_a_Cremation_Vessel_%28FindID_526999%29.jpg

I spent more than two years on projects to deploy Kubernetes (CFCR and PKS). Kubernetes is quite a complicated system and our focus was production-ready deployment. One of the key attributes for such deployments is to be reproducible and updatable.

Three months ago I have switched to project Eirini which allows scheduling Cloud Foundry applications on top of Kubernetes. We also use project Quarks that allows us to deploy the whole Cloud Foundry on Kubernetes. Cloud Foundry is a very complicated system with multiple moving parts. Quarks team have chosen to use Helm to deploy it. We decided to add one more dependent Helm chart that will be deployed with it.

Project Eirini was in a bootstrap state for a long time, but recently IBM and SUSE added alpha Eirini support to their installations. It is close to production, and I suggested that we should have a long-lived application that should be deployed on our long-lived acceptance environment. To raise the stakes, we need to have something that we actually care about. So we have deployed Postfacto — the application that we use for retro.

Initial deployment with Helm was easy. It just worked.

Last week have upgraded version of Cloud Foundry configured by Quarks team. It failed in our CI environment.

ROLLING BACKError: timed out waiting for the condition

The sad part was that upgrade hasn’t even failed. Eventually, all pods have started and the system was usable. We checked the documentation and saw timeout flags that wait for the hooks. As a result, we decided to increase the timeout. (Now I see that the job that we run after the update should not even be a hook).

Error: UPGRADE FAILED: watch closed before UntilWithoutRetry timeout

I hate this error with all of my heart. I saw it almost every day in CI when I worked on CFCR until we removed using watches and moved to manual rechecks. It has been fixed in Kubernetes (I would like to believe).

But maybe it is just a similar error. Anyway, I get used to proper deployments on my CFCR days. If something flakes, retriever it. So we triggered upgrade once more and saw this error.

ROLLING BACKError: no PodSecurityPolicy with the name "scf-psp-nonprivileged" found

Ok, let’s try to recreate this pod security policy. Lets first check what do we have on the cluster. Maybe we can clone some existing policy and then modify it.

Really?

Maybe, we can delete it? And try again.

Oh. I see the pattern. Five minutes later I have this lazy script.

But now it complains about the actual workload.

But guess what? It is there! It seems like helm can’t upgrade and tries to rollback.

This annoyed us so much, we decided to take a break. Fifteen minutes later when we came back, we decided to google the error. We saw that similar issue has been fixed in next release of helm 2.14 and we tried to upgrade it.

Whatever let's try to do this. What could go wrong?

Oh, this is very wrong. There was a tiny issue in YAML because formatting multi-document YAML is hard. And I totally get the issue, I fixed it immediately. But there is no way to fix it. I tried manually modify YAML spec and update dependency, but it didn’t work. I tried to manually modify existing deployment and remove the flag, but it didn’t help. PR with the fix has been merged to helm.

What is even worse, the behaviour has changed since 2.13 and 2.14. Now, force upgrade deletes all the pods before the upgrade, so our application is down. But, the system is not updatable. Redeploying application takes only several minutes, so we decided to reinstall the whole system. Unfortunately, the initial deployment takes too much time and fails.

And the next upgrade fails as well.

Error: a release named scf is in use, cannot re-use a name that is still in use

We have spent several hours more fixing this issue, just in time for the team retrospective. Where I got the action to write the article about our helm adventures.

Reading code for a long time, writing code for even longer.