I spent more than two years on projects to deploy Kubernetes (CFCR and PKS). Kubernetes is quite a complicated system and our focus was production-ready deployment. One of the key attributes for such deployments is to be reproducible and updatable.
Three months ago I have switched to project Eirini which allows scheduling Cloud Foundry applications on top of Kubernetes. We also use project Quarks that allows us to deploy the whole Cloud Foundry on Kubernetes. Cloud Foundry is a very complicated system with multiple moving parts. Quarks team have chosen to use Helm to deploy it. We decided to add one more dependent Helm chart that will be deployed with it.
Project Eirini was in a bootstrap state for a long time, but recently IBM and SUSE added alpha Eirini support to their installations. It is close to production, and I suggested that we should have a long-lived application that should be deployed on our long-lived acceptance environment. To raise the stakes, we need to have something that we actually care about. So we have deployed Postfacto — the application that we use for retro.
Initial deployment with Helm was easy. It just worked.
Last week have upgraded version of Cloud Foundry configured by Quarks team. It failed in our CI environment.
UPGRADE FAILEDROLLING BACKError: timed out waiting for the condition
The sad part was that upgrade hasn’t even failed. Eventually, all pods have started and the system was usable. We checked the documentation and saw timeout flags that wait for the hooks. As a result, we decided to increase the timeout. (Now I see that the job that we run after the update should not even be a hook).
Error: watch closed before UntilWithoutRetry timeoutError: UPGRADE FAILED: watch closed before UntilWithoutRetry timeout
I hate this error with all of my heart. I saw it almost every day in CI when I worked on CFCR until we removed using watches and moved to manual rechecks. It has been fixed in Kubernetes (I would like to believe).
But maybe it is just a similar error. Anyway, I get used to proper deployments on my CFCR days. If something flakes, retriever it. So we triggered upgrade once more and saw this error.
UPGRADE FAILEDROLLING BACKError: no PodSecurityPolicy with the name "scf-psp-nonprivileged" found
Ok, let’s try to recreate this pod security policy. Lets first check what do we have on the cluster. Maybe we can clone some existing policy and then modify it.
kubectl get psp
NAME PRIV CAPS SELINUX RUNASUSER FSGROUP SUPGROUP READONLYROOTFS VOLUMES
ibm-anyuid-hostaccess-psp false SETPCAP,AUDIT_WRITE,CHOWN,NET_RAW,DAC_OVERRIDE,FOWNER,FSETID,KILL,SETUID,SETGID,NET_BIND_SERVICE,SYS_CHROOT,SETFCAP RunAsAny RunAsAny RunAsAny RunAsAny false *
scf-psp-nonprivileged false RunAsAny RunAsAny RunAsAny RunAsAny false configMap,secret,emptyDir,downwardAPI,projected,persistentVolumeClaim,nfs
scf-psp-privileged true * RunAsAny RunAsAny RunAsAny RunAsAny false configMap,secret,emptyDir,downwardAPI,projected,persistentVolumeClaim,nfs
scf-psp-withsysresource false SYS_RESOURCE RunAsAny RunAsAny RunAsAny RunAsAny false configMap,secret,emptyDir,downwardAPI,projected,persistentVolumeClaim,nfs
Maybe, we can delete it? And try again.
Error: no PodSecurityPolicy with the name "scf-psp-privileged" found
Oh. I see the pattern. Five minutes later I have this lazy script.
kubectl delete secrets -n scf deployment-manifest
kubectl delete psp scf-psp-withsysresource
kubectl delete psp scf-psp-nonprivileged
kubectl delete psp scf-psp-privileged
kubectl delete -n scf serviceaccounts --all
kubectl delete clusterroles scf-cluster-role-node-reader-role
kubectl delete clusterroles scf-cluster-role-nonprivileged
kubectl delete clusterroles scf-cluster-role-privileged
kubectl delete clusterroles scf-cluster-role-withsysresource
kubectl delete clusterrolebindings scf-default-nonprivileged-cluster-binding
kubectl delete clusterrolebindings scf-privileged-privileged-cluster-binding
kubectl delete clusterrolebindings scf-default-privileged-nonprivileged-cluster-binding
kubectl delete clusterrolebindings scf-secret-generator-nonprivileged-cluster-binding
kubectl delete clusterrolebindings scf-default-privileged-privileged-cluster-binding
kubectl delete clusterrolebindings scf-withsysresource-privileged-privileged-cluster-binding
kubectl delete clusterrolebindings scf-garden-runc-node-reader-role-cluster-binding
kubectl delete clusterrolebindings scf-withsysresource-privileged-withsysresource-cluster-binding
kubectl delete clusterrolebindings scf-garden-runc-privileged-cluster-binding
kubectl delete clusterrolebindings scf-withsysresource-withsysresource-cluster-binding
kubectl -n scf delete rolebinding node-reader-configgin-role-binding
kubectl -n scf delete rolebinding default-privileged-configgin-role-binding
kubectl -n scf delete rolebinding privileged-configgin-role-binding
kubectl -n scf delete rolebinding withsysresource-configgin-role-binding
kubectl -n scf delete rolebinding garden-runc-configgin-role-binding
kubectl -n scf delete rolebinding secret-generator-configgin-role-binding
kubectl -n scf delete rolebinding withsysresource-privileged-configgin-role-binding
But now it complains about the actual workload.
Error: no StatefulSet with the name "cf-usb-group" found
But guess what? It is there! It seems like helm can’t upgrade and tries to rollback.
This annoyed us so much, we decided to take a break. Fifteen minutes later when we came back, we decided to google the error. We saw that similar issue has been fixed in next release of helm 2.14 and we tried to upgrade it.
Whatever let's try to do this. What could go wrong?
Error: failed decoding reader into objects: error validating "": error validating data: ValidationError(Deployment.spec.template.spec.initContainers): unknown field "restartPolicy" in io.k8s.api.core.v1.Container
Oh, this is very wrong. There was a tiny issue in YAML because formatting multi-document YAML is hard. And I totally get the issue, I fixed it immediately. But there is no way to fix it. I tried manually modify YAML spec and update dependency, but it didn’t work. I tried to manually modify existing deployment and remove the flag, but it didn’t help. PR with the fix has been merged to helm.
What is even worse, the behaviour has changed since 2.13 and 2.14. Now, force upgrade deletes all the pods before the upgrade, so our application is down. But, the system is not updatable. Redeploying application takes only several minutes, so we decided to reinstall the whole system. Unfortunately, the initial deployment takes too much time and fails.
And the next upgrade fails as well.
UPGRADE FAILEDError: a release named scf is in use, cannot re-use a name that is still in use
We have spent several hours more fixing this issue, just in time for the team retrospective. Where I got the action to write the article about our helm adventures.