Kubernetes can often be a complex beast with behaviors that are difficult to understand completely. This post is about one of those behaviors.
The story started when I was deploying concourse (a very interesting CI/CD, with a lot of pros and cons which will be the subject of a later post, check it out if you haven’t) on Kubernetes using Helm, which also had a postgres chart along the deployment.
Concourse initially went up just fine and dandy. A couple of days later, the concourse web pod crashes, with the culprit seemingly being the postgres pod, which for some reason was locked on Pending status. Both concourse and postgres pods use a Persistent Volume Claim to acquire a volume from Kubernetes and GCP. At first I suspected that this might be the cause, because we had some similar problems before with attaching and detaching volumes, but it seemed that was not the case here. The pod was healthy, the logs were clean.
Upon SSHing into the postgres pod, everything was just fine. It responded to requests by port forward and the command that was issued by the Liveness/Readiness probe was working perfectly. Still, Kubernetes was reluctant to change the status on the pod to Ready. There was something weird going on.
I was hesitant to kill the pod. The last time something like this happened I force killed the pod then all hell broke loose. The GCP disk was locked in attached state and wouldn’t detach for the life of me. I eventually recreated the concourse deployment and hoped something like this wouldn’t happen again, until today.
Some side story first, for some reason, we were hesitant to upgrade our Kubernetes nodes to the latest version, and our cluster that previously had 12 machines running, now had 11 and complained about a Node version unsupported.
We dismissed it and figured we would migrate to the new version when we had the chance. We were using a version which wasn’t supported anymore by the Kubernetes Engine, but felt it wasn’t a big deal to update the version, so we moved it down the priority list.
I had a hunch that something might be going on with the nodes, but wasn’t entirely sure of what it was specifically. Since we couldn’t get the 12th machine to boot up properly, I thought it would be just fine to set it to use the 11 machines we already had, hoping something might change.
Lo and behold! Everything was stable again.
From what I could understand, the postgres pod crashed, Kubernetes tried to provision it on a new node, didn’t have enough resources and GCP triggered an autoscale.
GCP on the other hand, wasn’t able to provision a machine with that particular node version anymore, so it provisioned a more recent version instead, which still accepted some requests from Kubernetes Master, but was unable to properly join the cluster and register its Liveness/Readiness Probes.
I’m hoping eventually to figure out the exact cause of this crash. I’m not sure if this is intended behavior, or even if it’s documented anywhere, but It was definitely interesting to debug (albeit slightly frustrating since in the end the “solution” seemed odd). Conclusion
Kubernetes and GCP have a lot of moving parts, and sometimes it’s useful to step back to see everything as a whole to try to understand what the hell is going on. Always upgrade your Kubernetes version (responsibly, of course) as often as you possibly can.
Update: After properly updating nodes and master to the latest version (1.10.2 as of today), everything was smooth and we got our 12 machines back.