Docker, block I/O and the black hole

Last week, Bernhard and I had to investigate high disk I/O load reported by one of our storage colleagues on our NFS server which serves the data for our Docker containers. We still have high disk load, because we are running lots of containers and therefore this post is will not resolve a load issue but it will give you some deep insights about some strange behaviors and technical details which we discovered during our I/O deep dive.

The question: Which container (or project) generates the I/O load?

First of all, the high I/O load is not the problem per se. We have plenty reserves in our storage and we are not investigating any performance issues. But the question asked by our storage colleagues was as simple as to ask which container (or project) generates the I/O load?

Short answer: We do not know and we are not able to track it down. Not now and not with the current setup. Read ahead to get used to the “why”?

Finding 1: docker stats does not track all block I/O

No, really, it doesn’t track all block I/0. This took us some time to understand, but lets do it step by step. The first thing you will think about when triaging block I/O loads is to run docker stats which is absolutely correct. And that’s where you reach the end of the world because Docker and to be more precise, the Linux Kernel, does not see block I/0 which is served over a NFS mount! You don’t believe it? Just look at the following example.

First, create and mount a file system over a loop device. Mount a NFS share onto a folder inside this mount and monitor the block I/O on this device to see what happens, respectively what you cannot see.

At this point, open a second console. In the first console enter a dd command to write a file into /mnt/testmountpoint/nfsmount and in the second console, start the iostat command to monitor the block I/O on your loop device.

Here is an output from this run and as you can see, iostat does not recognize any block I/O because the I/O never touches the underlying disk. If you do the same test without using the mounted NFS share, you will see the block I/O in the iostat command as usual.

The same is true, if you are using docker volume NFS mounts! The block I/O is not tracked and it’s fully logical because this block I/O never touches a local disk. Bad luck. Even it is true with any other mount type that will not be written to local disks like Gluster (FUSE) and many more.

Finding 2: docker stats tracks block I/O not fully correct

We think we will open an issue for this, because docker stats counts the block I/O wrong. You can test this by starting a container, run a deterministic dd command and watch the docker stats output of the container in parallel. See the terminal recording to get an idea.

As the recording shows, the first dd write is completely unseen by the docker stats command. This might be OK, because there are several buffers for write operations involved. But, as the dd command is issued second time, to write additional 100 megabytes, the docker stats command shows a summary of 0B / 393MB megabytes, roughly 400 megabytes. The test wrote 200 megabytes, but docker stat shows the doubled amount of data written. Strange buy why does this happen?

At this point, more information is needed. Therefore it is recommended to query the docker api to retrieve more detailed information about the container stats. This can be done by using an actual version of curl which would generate the following output.

Now, search for io_service_bytes_recursive in the json output. There will be something like this:

Ups, there are three block devices here. Where are they coming from? If the totals are summed up, we get the 393 megabytes we have seen before. The major and minor numbers identify the device type. The documentation of the Linux kernel includes the complete list of the device major and minor numbers. The major number 8 identifies a block device as SCSI disk device and this is correct, as the server uses sd* for the local devices. The major numner 253 refers to RESERVED FOR DYNAMIC ASSIGNMENT which is also correct, because the container get a local mount for the write layer. Therefore there are multiple devices: The real device sd* and the dynamic device for the writeable image layer, which will write the data to the local disk. That’s way the block I/O is counted multiple times!

But we can dig even deeper and we can inspect the cgroup information used by the Linux kernel to isolate the resources for the container. This information can be found under /sys/sys/fs/cgroup/blkio/docker/<container id> eg /sys/fs/cgroup/blkio/docker/195fd970ab95d06b0ca1199ad19ca281d8da626ce6a6de3d29e3646ea1b2d033. The file blkio.throttle.io_service_bytes contains the information what data was really transferred to the block devices. For this test container the output will be:

There we have the correct output. In SUM Total we have roughly 250 megabytes. 200 megabytes were written by the dd commands and the rest would be logging and other I/O stuff. This would be the correct number. You can test this by yourself by running a dd command and watching the blkio.throttle.io_service_bytes content.

Conclusion

The docker stats command is really helpful to get an overview about you block device I/O, but it does not show the full truth. But it is useful to monitor containers that are writing local data, which may indicate, that something is not correctly configured regarding the data persistence. Furthermore, if you use network shares to allow the containers to persist data, you cannot measure the block I/O count on the Docker host the container is running on. The ugly part is, if you are using one physical device (large LVM for example) on your network share server, you will only get one great number of block I/O but you will not be able to assign the I/O to a container, a group of containers or a project.

Facts:
– If you use NFS (or whatever shares) which are backed by a single block device on the NFS server, you can only get the sum of all block I/O and you cannot assign this block I/O to a concrete project or container
– Use separate block devices for your shares
– Even if you use Gluster, there will be the exactly same problem – FUSE mounts are also not seen by the Linux kernel

Follow up

We are currently evaluating a combination of thin allocated LVM devices in combination with Gluster to report the block I/O via iostat (the json out) to Elastic search. Stay tuned for more about this soon!

-M


Writing a Docker Volume Plugin for CephFS

Currently we are evaluating Ceph for our Docker/Kubernetes on-premise cluster for persistent volume storage. Kubernetes officially supports CephRBD and CephFS as storage volume driver. Docker does not offer a Docker Volume plugin for CephFS currently.

But there are some plugins available online. A Google search comes up with a handful plugins that supports the CephFS protocol but the results are quite old (> 2 years) and outdated or they are using too much dependencies like direct Ceph cluster communication.

This blog post will be a little longer, as it is necessary to provide some basic facts about Ceph and because there are some odd pitfalls during the Plugin creation. Without the great Docker Volume Plugin for SSHFS which is written by Victor Vieux it won’t be possible for me to get the clue about the Docker Volume Plugin structure! Thank you for your work!

Source code of the Docker Volume Plugin for CephFS can be found here.

About Ceph

Basically Ceph is a storage platform that provides three types of storage: RBD (Rados Block Device), CephFS (Shared Filesystem) and ObjectStorage(S3 compatible protocol). Beside this, Ceph offers some API interfaces to operate the Ceph storage remotely. Usually the mounting of the RBD and CephFS is enabled by installing the Ceph client part into your Linux machine via APT, YUM or whatever available. This client side software will install a Linux kernel module which can be used for a classic mount command like mount -t ceph .... Alternatively the use of fuse is also possible. The usage of the client side bindings can be tricky, when different versions of the Ceph Cluster (eg Minic release) and Ceph Client (eg Luminous) are in use. This may lead to the situation where someone creates a RBD device which has a newer feature set than the client which may lead to a non mountable file system.

RBD devices are meant to be exclusively mounted by exactly one end system, like a container which is pretty clear as you would also never share a physical device between two end systems. RBD block devices therefore cannot be shared between multiple containers. Most of the RBD volume plugins are able to create such a device during the creation of a volume if it does not exist. This means that the plugin must be able to communicate with the Ceph Cluster either via the installed Ceph Client software on the server or via the implementation of one of the Ceph API libraries.

CephFS is a shared filesystem which is backed by the Ceph cluster and which can be shared between multiple end systems like any other shared file system you may know. It has some nice features like file system paths which can be authorised separately.

The Kubernetes Persistent Volume documentation contains a matrix about the different file systems and which modes (ReadWriteOnce, ReadOnlyMany, ReadWriteMany) they support.

Docker Volume Plugin Anatomy

Due to the great work of Victor Vieux I was able to get used to the anatomy of the Docker Volume plugin as the official Docker documentation is a little bit, uhm, short. I’am not a programmer ( Especially the docker GitHub go-plugin-helpers repository contains a lot of useful stuff and in sum I was able to copy/paste/change the plugin within a day.

The api.go file of the plugin helper contains the interface method description which needs to be implemented by a plugin.

Some words about the interface:

Get and List are used retrieve the information about a volume and to list the volumes powered by the volume plugin when someone executes docker volume ls.

Create creates the volume with the volume plugin but it will not call the mount command at this time. The volume is only created and nothing more.

Mount is called when a container, which will use created volume, starts.

Path is used to track the mount paths for the container.

Unmount is called when the containers stops.

Remove is called, when the deletion of the volume is requested.

Capabilities is used to describe the needed capabilities of the Docker volume plugin, for example net=host if the plugin needs network communication.

Beside this, every plugin contains a config.json file which describes the configuration (and capabilities) of the plugin.

The plugin itself must use a special file structure, called rootfs!

Howto write the plugin

OK, I admit, I just copied the Docker Volume SSHFS plugin 🙂 and after that I did the following (beside learning the structure):

1) I changed the config.json of the plugin and removed all the things that my plugin does not need
2) I changed the functions mentioned above to reflect the needs of my plugin
3) I packed together everything, test it, uploaded it.

For point 1) and 2), this is just programming and configuring. But 3) is more interesting because the are the pitfalls an this pitfalls are described in the following section.

The pitfalls

Pitfall 1 Vendors

The first thing I did during the development was to refresh the vendors. And this was also my first problem, at it was not possible to get the Plugin up and running. There is a little bug in the api.go of the helper. The CreatedAt cannot be JSON encoded if it empty. There is already a GitHub PR for it, which simply adds the needed annotations to the config. You can use the PR or you just add the needed annotations to the struct like this:

Pitfall 2 Make

The SSHFS Docker Volume is great! Make yourself life easier and use the provided Makefile! You can create the plugin rootfs with it ( make rootfs) and you can easily create the plugin with it ( make create)!

Pitfall 3 Push

After I’ve done all the work I uploaded the source code to GitLab and created a pipeline to push the resulting Docker Container image to Docker Hub so everyone can use it. But this won’t work. After fiddling around an hour, I had the eye opener. The command docker plugin has a separate push function. So you have to use docker plugin push to push a Docker plugin to Docker Hub!

Be aware: The Docker push repository must not exist before your fist push! If you create a repository manually or you push a Container into it, it will be flagged as Container repository and you can never ever push a plugin to it! The error message will be denied: requested access to the resource is denied.

To be able to push the plugin, it must be installed (at least created) in your local Docker engine. Otherwise you cannot push it!

Pitfall 4 Wrong Docker image

Be aware that you use the correct Docker image if you are writing a plugin. If you build your binary with Ubuntu, you might not be able to run it inside your final Docker Volume Plugin container because the image you use is based on Alpine (or the other way around)

Pitfall 5 Unresolved dependencies

Be sure to include all you dependencies in your Docker image build process. For example: If you need the gluster-client, you will have to install them in your Dockerfile to have the dependencies in place when the Docker Volume Plugin image is loaded by the container engine.

Pitfall 6 Linux capabilities

Inside the Docker Plugin configuration, you have to specify all Linux capabilities you need for your plugin. If you miss a capability, the plugin will not do what you like that it does. Eg:

Debug

A word about debugging a Docker Volume Plugin. Beside the information you get from the Docker site (debug via docker socket), I found it helpful to just use the resulting Docker Volume image as a normal Container via docker run. This gives you the ability to test if the Docker image is including all the stuff that you can do what you want with your plugin later. If you go this way, you have to use the correct docker run options with all the capabilities, devices and the privileged flag. Yes, Docker Volume Plugins run privileged! Here is a example command: docker run -ti --rm --privileged --net=host --cap-add SYS_ADMIN --device /dev/fuse myrootfsimage bash. After this, test if all features are working.

Thats all! If you have questions, just contact me via the various channels.
-M


Docker Swarm Network – Down the Rabbit Hole

Last week we tracked down a recurring problem with our Docker Swarm, more exactly with the Docker overlay network. To anticipate it, there is a merge which might fix this, but not for Docker-CE 18.03 . The pull mentioned is also not included in Docker-CE 18.06.1 but it is already merged into Moby and part of Docker-CE 18.09.0-ce-tp5 which means that the fix should be available with Docker-CE 18.09.

Description of the problem

If you try to start a container or if you have Docker Swarm which starts containers for you, you might see that the containers cannot start on specific hosts. If you take a look into the log files, you find lines like this:

This means, that a network VXLAN interface for a new container which would like to join a overlay network already exists.

Explanation

The next sentences are not deeply scientific, they are more a sum up of multiple information an experience. As I understand, the startup sequence of a container (driven by dockerd) which uses a overlay network is as follows:

1) Create a VXLAN interface which uses the VXLAN id of the associated Docker network ( docker network create --driver=overlay ...) – at this point the VXLAN interface is visible on the host ( ip -d link show)
2) Then dockerd puts the VXLAN interface into the namespace of the container – at this point the VXLAN interface is not visible anymore on the host
3) When the container stops, the device is given back to the host
4) The device is deleted by the dockerd

Between 3) and 4) a race condition happens and the network device is not deleted.

The important hint to find out more was given by the user gitbensons on github – Kudos to him! He pointed out, that it is possible to find the already existing VXLAN device by running strace against the dockerd process. Here is the strace command to use just before starting an affected container.

THIS COMMAND IS DANGEROUS! IF YOU RUN IT FOR TOO LONG, YOU WILL PROBABLY KILL THE DOCKERD!!!

In the output of the previous command, you can see, that the affected device has the name vx-00106c-clblt. The last five characters of the device name, in this example clblt are specifying the affected overlay network id (short). Login to a Docker manger, run docker network ls | grep clblt and you can find the name of the affected overlay network.

At this point we know, which VXLAN device is still there but shouldn’t. In the next step, just list all vx-* devices on the affected host by doing:

Ups. Now we have a problem. All of this devices are dead ( state DOWN) but where not deleted! This means that on this Docker host, it will not be possible to start containers which would like to join one of the affected overlay networks (look at the id’s).

Solution

After finding the problematic device, you can delete it with ip link delete vx-00100f-drzik for example. Maybe it would be a good practice to delete all devices and to monitor your hosts if there are such devices, as is an indicator that something happens which will prevent starting further containers for the affected networks.

Summary

From the Urban dictionary: Rabbit Hole: Metaphor for the conceptual path which is thought to lead to the true nature of reality. Infinitesimally deep and complex, venturing too far down is probably not that great of an idea.

It is hard to accept, that the error message does not write which file is already existing. I know the cause is found in golang, because if you only print err, you will not get any information about which file is already existing. Writing which interface is already existing would be nice, or deleting it automatically on container start would be even nicer 🙂 But I won’t dig deeper, as there is already a merge … don’t forget Rabbit Holes are dangerous 😉


Testing Remote TCP/IP Connectivity – IBM i (AS400)

Preface

I’m used to run telnet to do a quick check if a remote server is reachable and listening on a specific port. When trying this on i5/OS with telnet CMD you may get headache!
After some research I ended up with openssl in PASE to succeed my task on IBM i (AS400).

telnet vs openssl syntax

On Telnet 5250 Command Line you first have to enter PASE using

Then run

instead of

as it is not installed in PASE.

Excamples

Success, with server using SSL

with openssl

with telnet

Failure

with openssl

with telnet

Success, with server not using SSL

with openssl

with telnet


Docker South Austria meets DevOps & Security Meetup in Vienna

Just to let you know, I will give a talk at the DevOps & Security Meetup Vienna. The topic of my talk will be GitOps: DevOps in Real Life.

Here is the featured text of the talk in German:

Container-Technologien in einem On-Premises Umfeld einzuführen bringt viele Veränderungen mit sich. Besonders die Evolution in der Zusammenarbeit zwischen den Teams, sowie der ständige Wandel und die Weiterentwicklung der Technologien, führen zu immer neuen Herausforderungen aber auch zu immer neuen, kreativen und innovativen Lösungen. Seit dem Beginn der Veränderungen vor zwei Jahren haben wir uns ständig verbessert und die GitOps Methode eingeführt. Viele dieser Entwicklungen finden häufig in Public Clouds statt, stehen jedoch ebenso On-Premises zur Verfügung. Dass der Einsatz dieser Technologien On-Premises möglich ist und ausgezeichnet funktioniert, werde ich in meinem Talk zeigen. Special Guests: GitLab, Puppet, Prometheus, Elastic, Docker, CoreDNS, Kubernetes und viele mehr 🙂

If you like, please join us in Vienna on the 3th of July!

-M


Kubernetes the roguelike way

It has been a while since the last post, but we had a busy time. Together with my colleagues I wrote a large documentation called Kubernetes the roguelike way over the last few weeks.

The documentation is about our setup of Kubernetes which we use on premise. We will continue the documentation as we are moving forward in our progress. Have a lot of fun and if you have suggestions, please open an issue in the GitLab project.


Older blog entries...