Reviving a near-death Docker Swarm cluster

Estimated reading time: 3 mins

or why a storage update can hurt your cluster badly.

Today, shortly before our working day has ended, one of our Docker Swarm Clusters, the test environment cluster, nearly died. This wasn't a Docker Swarm fault, but a coincidence of several different causes. At this point, we would like to share, how we are encountering such situations. Therefore, here comes the timeline!


At this time a developer contacted us that he had a problem with deploying his application in our Docker Swarm Test Cluster. The application container hasn't started correctly and he didn't know why.


So we had a look at the service the developer mentioned with the docker service ps command, to get some additional information.

ID                  NAME                IMAGE NODE                        DESIRED STATE       CURRENT STATE             ERROR                         PORTS
0cnafdmxdrvf        wmc_wmc.1           ...   xyz123                      Running             Assigned 8 minutes ago                                  
qd2muj3pdy1d         \_ wmc_wmc.1       ...   abc123                      Remove              Running ... ago                                 
me85ue4xii3f         \_ wmc_wmc.1       ...   3pp5zmsfe3jz2n5o54azylgtf   Shutdown            ... ago                   "task: non-zero exit (143)"   
ssbthaef0093         \_ wmc_wmc.1       ...   smqghxgmbkyxi5dn9odd9r39v   Shutdown            ... ago                                 

The service hang on remove! That's never a good sign, as the remove of the container should be done pretty fast. If something like this happens, strange things are going on.


A look into journalctl --since 2h -u docker.service shows that around 16:22 the Docker Swarm raft was broken. Now the question was - why? In the logs, we saw, that at this point in time, a Docker swarm deployment was running. Which is OK, since we are using GitLab as our GitOps/CI tool.


On the host were the remove wasn't finished, we found an additional information in the journalctl --since 2h -u docker.service log.

Jun 03 16:22:14 abc123 dockerd[1400]: time="2019-06-03T16:22:14.014844062+02:00" level=error msg="heartbeat to manager {} failed" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" method="(*session).heartbeat" module=node/agent sessionID=cgpntgnefhtqqj5f5yrci4023

This log messages says that the Docker host can't connect to the Docker swarm cluster manager what's not really good but pointed us to the next systemd - DNS.


A look at the DNS logs of our coreDNS showed many of the following log messages.

2019-06-03T16:22:18.168+02:00 [ERROR] plugin/errors: 0 NS: context deadline exceeded
2019-06-03T16:22:23.347+02:00 [ERROR] plugin/errors: 0 A: context deadline exceeded
2019-06-03T16:22:23.410+02:00 [ERROR] plugin/errors: 0 A: context deadline exceeded 

So DNS wasn't working correctly. Our database for coreDNS is our etcd cluster…


The etcd cluster logs showed the following.

2019-06-03 16:22:08.156352 W | etcdserver: read-only range request "key:\"/dns-internal/com/example/sm/\" range_end:\"/dns-internal/com/example/sm0\" " took too long (3.243964451s) to execute
2019-06-03 16:22:17.059472 E | etcdserver/api/v2http: got unexpected response error (etcdserver: request timed out)

Our etcd cluster couldn't read the data from the disk/storage. A quick phone call to our storage colleagues informed us that they are making planned firmware upgrades and therefore the storage controllers had to make planned “failovers”.


We are using a dns name to join the Docker worker hosts to the Docker manager hosts in our Ansible scripts and the dns information is stored somewhere in the Docker raft database. We are also not using a DNS caching mechanism in our coreDNS installation which can cause really bad outages in this situation because the dns name isn't resolvable. Our Docker Swarm Test Cluster had an inconsistent state at this point and we had to restart our 3 managers one after another so that they where able to rebuild/validate themselves with the cluster information of the running stacks/services/tasks. With Docker 18.09.3 this works pretty well and we had our cluster control back again in less than half an hour. (including the analysis) No running services where effected except the one with the deploy problem at the beginning.

Posted on: Mon, 03 Jun 2019 15:39:06 +0200 by Mario Kleinsasser , Bernhard Rausch
  • Docker
  • Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). My motto is "𝗜𝗺𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻 𝗶𝘀 𝗺𝗼𝗿𝗲 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝘁𝗵𝗮𝗻 𝗸𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲. [Einstein]". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!
    CloudArchitect/SysOpsEngineer; loves to get things ordered the right way: "A tidy house, a tidy mind."; configuration management fetishist; loving backups; impressed by docker; Always up to get in contact with interesting people - do not hesitate to write a comment or to contact me!