If you make a mistake and do not correct it, this is called a mistake. [Confucius]
Today we faced a problem with our Docker Swarm which was caused by a permanently restarting service. The focused service was Prometheus, which we use for the monitoring of our Docker environment.
The story starts in the middle of the last week, as the new Prometheus (version 2) was released. In the configuration of our docker-swarm.yml which we use for the Prometheus service, we stupidly still used prometheus:latest. Did you noticed the latest? We have been warned (at Docker Con) to not use this. Yes there are a lot of examples on the internet which are exactly using this, but it is a very bad idea. latest literally means unknown because, you will not now, which image is referenced by the latest tag. latest is only a convention, not a guarantee! Therefore, pin the version of the image which you really want to use by pin-pointing it, eg. prometheus:1.1.1.
In our case, caused by an unplanned service update, the Prometheus image was freshly pulled (you now latest) and corrupted the Prometheus database. Furthermore, the configuration of the Prometheus changed between the version which in turn caused a permanent restart of the service. That happened on the weekend, which wouldn’t be bad, but it caused the container engine to get stressed.
This is documented in this Github issue. The result of this bug is, that the syslog get spammed up with a lot of pointless messages. However this will fill up your log partition after some time (maybe hours, but it will get filled up).
At this point it gets icky. In a default Ubuntu setup for example, the /var partition will contain the log directory and of course the lib/docker directory too. If the /var partition of the system is full, also Docker cannot write its Raft data anymore and your Docker Swarm will be nearly dead.
In our case we had a configuration mistake, because we used four Docker Swarm manager nodes, not three and not five. Now we come to the ugly level. Bad luck, the filled up /var partition killed two of our Docker Swarm managers. The containers continued to work, but the cluster orchestration was messed up, because two out of four nodes where dead. No quorum anymore, no consensus, no manageability.
But, no panic, there are ways to bring back online all services with some Linux voodoo (truncating syslog files, …). To sum it up, what are the lessons learned?
Thats all for today.
-M
(Image by Walter Grassroot Wikipedia)