Docker Swarm Network – Down the Rabbit Hole

Estimated reading time: 5 mins

Last week we tracked down a recurring problem with our Docker Swarm, more exactly with the Docker overlay network. To anticipate it, there is a merge which might fix this, but not for Docker-CE 18.03 . The pull mentioned is also not included in Docker-CE 18.06.1 but it is already merged into Moby and part of Docker-CE 18.09.0-ce-tp5 which means that the fix should be available with Docker-CE 18.09.

Description of the problem

If you try to start a container or if you have Docker Swarm which starts containers for you, you might see that the containers cannot start on specific hosts. If you take a look into the log files, you find lines like this:

1
fails: error="starting container failed: subnet sandbox join failed for \"10.0.0.0/24\": error creating VXLAN interface: file exists" module="node/agent/taskmanager" task.id=lns5iscwnbvh0yjeutq6tsj4q

This means, that a network VXLAN interface for a new container which would like to join a overlay network already exists.

Explanation

The next sentences are not deeply scientific, they are more a sum up of multiple information an experience. As I understand, the startup sequence of a container (driven by dockerd) which uses a overlay network is as follows:

1) Create a VXLAN interface which uses the VXLAN id of the associated Docker network (docker network create --driver=overlay ...) - at this point the VXLAN interface is visible on the host (ip -d link show) 2) Then dockerd puts the VXLAN interface into the namespace of the container - at this point the VXLAN interface is not visible anymore on the host 3) When the container stops, the device is given back to the host 4) The device is deleted by the dockerd

Between 3) and 4) a race condition happens and the network device is not deleted.

The important hint to find out more was given by the user gitbensons on github - Kudos to him! He pointed out, that it is possible to find the already existing VXLAN device by running strace against the dockerd process. Here is the strace command to use just before starting an affected container.

THIS COMMAND IS DANGEROUS! IF YOU RUN IT FOR TOO LONG, YOU WILL PROBABLY KILL THE DOCKERD!!!

1
2
3
strace -f -p 52990 | grep vx-
13887 sendto(13, "\240\0\0\0\20\0\5\6k\305\22\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\3\0vx-00106c-clblt\0\10\0\4\0\252\5\0\0\10\0\r\0\0\0\0\0\\\0\22\0\t\0\1\0vxlan\0\0\0L\0\2\0\10\0\1\0l\20\0\0\5\0\5\0\0\0\0\0\5\0\6\0\0\0\0\0\5\0\7\0\1\0\0\0\5\0\v\0\1\0\0\0\5\0\f\0\0\0\0\0\5\0\r\0\1\0\0\0\5\0\16\0\1\0\0\0\6\0\17\0\22\265\0\0", 160, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, 12 <unfinished ...>
13887 recvfrom(13, "\264\0\0\0\2\0\0\0k\305\22\0\205V\0\0\357\377\377\377\240\0\0\0\20\0\5\6k\305\22\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\24\0\3\0vx-00106c-clblt\0\10\0\4\0\252\5\0\0\10\0\r\0\0\0\0\0\\\0\22\0\t\0\1\0vxlan\0\0\0L\0\2\0\10\0\1\0l\20\0\0\5\0\5\0\0\0\0\0\5\0\6\0\0\0\0\0\5\0\7\0\1\0\0\0\5\0\v\0\1\0\0\0\5\0\f\0\0\0\0\0\5\0\r\0\1\0\0\0\5\0\16\0\1\0\0\0\6\0\17\0\22\265\0\0", 4096, 0, {sa_family=AF_NETLINK, pid=0, groups=00000000}, [12]) = 180

In the output of the previous command, you can see, that the affected device has the name vx-00106c-clblt. The last five characters of the device name, in this example clblt are specifying the affected overlay network id (short). Login to a Docker manger, run docker network ls | grep clblt and you can find the name of the affected overlay network.

1
2
docker network ls | grep clblt
clbltptv9x08        blug-stg_net                     overlay             swarm

At this point we know, which VXLAN device is still there but shouldn’t. In the next step, just list all vx-* devices on the affected host by doing:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
ip -d link show | grep vx
302: vx-001026-clblt: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4134 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
5958: vx-00103a-rtu2q: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4154 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
1869: vx-001048-tzm7a: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4168 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
7014: vx-00100d-cajmu: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4109 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
124: vx-00101d-muh5b: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4125 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
177: vx-001011-dr80h: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4113 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
1230: vx-001012-qyjw6: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4114 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
9438: vx-001057-r4tv3: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4183 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
1252: vx-00100f-drzik: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4111 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
9447: vx-00106c-qsonr: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4204 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 
245: vx-00100b-307v6: <BROADCAST,MULTICAST> mtu 1450 qdisc noop state DOWN mode DEFAULT group default 
    vxlan id 4107 srcport 0 0 dstport 4789 proxy l2miss l3miss ageing 300 addrgenmode eui64 

Ups. Now we have a problem. All of this devices are dead (state DOWN) but where not deleted! This means that on this Docker host, it will not be possible to start containers which would like to join one of the affected overlay networks (look at the id’s).

Solution

After finding the problematic device, you can delete it with ip link delete vx-00100f-drzik for example. Maybe it would be a good practice to delete all devices and to monitor your hosts if there are such devices, as is an indicator that something happens which will prevent starting further containers for the affected networks.

Summary

From the Urban dictionary: Rabbit Hole: Metaphor for the conceptual path which is thought to lead to the true nature of reality. Infinitesimally deep and complex, venturing too far down is probably not that great of an idea.

It is hard to accept, that the error message does not write which file is already existing. I know the cause is found in golang, because if you only print err, you will not get any information about which file is already existing. Writing which interface is already existing would be nice, or deleting it automatically on container start would be even nicer :-) But I won’t dig deeper, as there is already a merge … don’t forget Rabbit Holes are dangerous ;-)

Posted on: Tue, 04 Sep 2018 08:54:49 +0200 by Mario Kleinsasser
  • Docker
  • Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). My motto is "𝗜𝗺𝗮𝗴𝗶𝗻𝗮𝘁𝗶𝗼𝗻 𝗶𝘀 𝗺𝗼𝗿𝗲 𝗶𝗺𝗽𝗼𝗿𝘁𝗮𝗻𝘁 𝘁𝗵𝗮𝗻 𝗸𝗻𝗼𝘄𝗹𝗲𝗱𝗴𝗲. [Einstein]". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!