This is a follow up post to Reviving a near-death Docker Swarm cluster where we showed, that a Docker Swarm cluster can be hurt badly if DNS does not work (because of a storage hiccup). Therefore it was obvious that we had to enable caching in our coreDNS servers.

A short recap about the situation: We use coreDNS with ETCD as storage backend for the DNS records. This is a common use case - it is the same as in Kubernetes. We use the same concept as Kubernetes does but for slightly different purposes because the ETCD has an easy to implement API. We started to use coreDNS way before it came to Kubernetes as a DNS service. We also helped to implement the APEX records and we also did some bug triage in the past.

Enabling caching in coreDNS is simple, just add the cache statement to the Corefile as documented in the plugin.

The problem

So yes, we enabled caching and some minutes later our monitoring system showed several systems which were not able to do the Puppet agent run anymore. This happened on Tuesday afternoon around 3pm. After the monitoring alerted the problem, we already guessed, that it has something to do with the shortly before enabled cache in our coreDNS instances. A rollback would be possible without any problems, because we use the coreDNS inside containers, their image is build via GitLab CI/CD and the Docker run is issued by Puppet on the given hosts. So a rollback is pretty easy! But we didn’t roll back because only some of our hosts had an DNS resolve error, the rest (hundreds) were running fine!

Analyzing the problem

We suspended the rollback to a previous Corefile (coreDNS config) and took a closer look at the affected hosts. Shortly after we knew that only older Linux OS’s were hit by this problem, Bernhard started to search on the internet about this specific problem because we also got the following log output from a rsync run (and a similar one from the Puppet agent):

rsync: getaddrinfo: rsync.example.com 873: No address associated with hostname
rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.9]

We found two GitHub entries, this one in coreDNS / this one in Kubernetes. And in addition a Stack Overflow post article.

Ok, there was something strange going on with some old clients. We decided to share our information inside the issues above. You can read the issues and the pull requests if you like to read the full details. In short, I mailed with Miek Gieben who is one of the coreDNS maintainers privately after we chatted on GitHub to share some tcpdumps with him. DNS is really something you won’t mess around that deep. It’s ugly and I am feeling great respect for those who are working in this field, like Miek does! Kudos to all of you!

The result

After chatting via e-mail we shortly came to the point, that the switching of the authoritative/non-authoritative flag - that is one(!) single bit in the header of the UDP datagram of a DNS query response - confuses older clients, because at first they get an authoritative answer and on the next query (within the TTL of the record) they get a non-authoritative answer. Some older DNS client code is screwed up at this point.

Miek provided a PR for this, I opened up an issue and on Thursday morning I did a manual build of the coreDNS including this PR and everything worked fine. As mitigation in between, Bernhard rolled out a hosts entry for our Puppet master domain on all affected hosts! Thanks! But some hosts with quite old software were still affected. Therefore this PR works much better.

Thank you!

We would like to say “Thank you!” to all who are working in Open Source and in this case especially to Chris O’Haver, Miek Gieben and Stefan Budeanu! This shows why Open Source is unbeatable when it comes to problems. Of course you have to have know how to do work together like we did in this case, but you have the opportunity to do it! Don’t be afraid and try it! Getting a fix for a problem within 48 hours is absolutely impressive and stunning! I am sure that this is not possible with closed source.

Why Open Source is great - Fix an issue in under 48 hours

Table of contents

The problem

Analyzing the problem

The result

Thank you!