Docker, block I/O and the black hole

Last week, Bernhard and I had to investigate high disk I/O load reported by one of our storage colleagues on our NFS server which serves the data for our Docker containers. We still have high disk load, because we are running lots of containers and therefore this post is will not resolve a load issue but it will give you some deep insights about some strange behaviors and technical details which we discovered during our I/O deep dive.

The question: Which container (or project) generates the I/O load?

First of all, the high I/O load is not the problem per se. We have plenty reserves in our storage and we are not investigating any performance issues. But the question asked by our storage colleagues was as simple as to ask which container (or project) generates the I/O load?

Short answer: We do not know and we are not able to track it down. Not now and not with the current setup. Read ahead to get used to the “why”?

Finding 1: docker stats does not track all block I/O

No, really, it doesn’t track all block I/0. This took us some time to understand, but lets do it step by step. The first thing you will think about when triaging block I/O loads is to run docker stats which is absolutely correct. And that’s where you reach the end of the world because Docker and to be more precise, the Linux Kernel, does not see block I/0 which is served over a NFS mount! You don’t believe it? Just look at the following example.

First, create and mount a file system over a loop device. Mount a NFS share onto a folder inside this mount and monitor the block I/O on this device to see what happens, respectively what you cannot see.

1
2
3
4
5
6


dd if=/dev/zero of=testdisk bs=50M count=100
mkfs.ext4 testdisk 
mkdir /mnt/testmountpoint
mount -o loop=/dev/loop0 testdisk /mnt/testmountpoint
mkdir /mnt/testmountpoint/nfsmount
mount -t nfs <mynfsserver>:/myshare /mnt/testmountpoint/nfsmount

At this point, open a second console. In the first console enter a dd command to write a file into /mnt/testmountpoint/nfsmount and in the second console, start the iostat command to monitor the block I/O on your loop device.

1
2


# First console
dd if=/dev/zero of=/mnt/testmountpoint/nfsmount/testfile bs=40M count=100

1
2


# Second console
iostat -dx /dev/loop0 1

Here is an output from this run and as you can see, iostat does not recognize any block I/O because the I/O never touches the underlying disk. If you do the same test without using the mounted NFS share, you will see the block I/O in the iostat command as usual.

The same is true, if you are using docker volume NFS mounts! The block I/O is not tracked and it’s fully logical because this block I/O never touches a local disk. Bad luck. Even it is true with any other mount type that will not be written to local disks like Gluster (FUSE) and many more.

Finding 2: docker stats tracks block I/O not fully correct

We think we will open an issue for this, becausedocker stats counts the block I/O wrong. You can test this by starting a container, run a deterministic dd command and watch the docker stats output of the container in parallel. See the terminal recording to get an idea.

As the recording shows, the first dd write is completely unseen by the docker stats command. This might be OK, because there are several buffers for write operations involved. But, as the dd command is issued second time, to write additional 100 megabytes, the docker stats command shows a summary of 0B / 393MB megabytes, roughly 400 megabytes. The test wrote 200 megabytes, but docker stat shows the doubled amount of data written. Strange buy why does this happen?

At this point, more information is needed. Therefore it is recommended to query the docker api to retrieve more detailed information about the container stats. This can be done by using an actual version of curl which would generate the following output.

1
2
3


docker run -ti -v /var/run/docker.sock:/var/run/docker.sock nathanleclaire/curl sh
curl --unix-socket /var/run/docker.sock http:/containers/195fd970ab95/stats
{"read":"2018-09-30T07:18:13.870693255Z","preread":"0001-01-01T00:00:00Z","pids_stats":{"current":1},"blkio_stats":{"io_service_bytes_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":144179200},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":144179200},{"major":8,"minor":0,"op":"Total","value":144179200},{"major":253,"minor":0,"op":"Read","value":0},{"major":253,"minor":0,"op":"Write","value":104857600},{"major":253,"minor":0,"op":"Sync","value":0},{"major":253,"minor":0,"op":"Async","value":104857600},{"major":253,"minor":0,"op":"Total","value":104857600},{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":144179200},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":144179200},{"major":8,"minor":0,"op":"Total","value":144179200}],"io_serviced_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":100},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":100},{"major":8,"minor":0,"op":"Total","value":100},{"major":253,"minor":0,"op":"Read","value":0},{"major":253,"minor":0,"op":"Write","value":50},{"major":253,"minor":0,"op":"Sync","value":0},{"major":253,"minor":0,"op":"Async","value":50},{"major":253,"minor":0,"op":"Total","value":50},{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":100},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":100},{"major":8,"minor":0,"op":"Total","value":100}],"io_queue_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":0},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":0},{"major":8,"minor":0,"op":"Total","value":0}],"io_service_time_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":3145508615},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":3145508615},{"major":8,"minor":0,"op":"Total","value":3145508615}],"io_wait_time_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":2942389316},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":2942389316},{"major":8,"minor":0,"op":"Total","value":2942389316}],"io_merged_recursive":[{"major":8,"minor":0,"op":"Read","value":0},{"major":8,"minor":0,"op":"Write","value":0},{"major":8,"minor":0,"op":"Sync","value":0},{"major":8,"minor":0,"op":"Async","value":0},{"major":8,"minor":0,"op":"Total","value":0}],"io_time_recursive":[{"major":8,"minor":0,"op":"","value":94781197}],"sectors_recursive":[{"major":8,"minor":0,"op":"","value":281600}]},"num_procs":0,"storage_stats":{},"cpu_stats":{"cpu_usage":{"total_usage":323236434,"percpu_usage":[138755707,5333466,2227292,2923874,24996475,16587971,129569530,2842119],"usage_in_kernelmode":280000000,"usage_in_usermode":20000000},"system_cpu_usage":10647668950000000,"online_cpus":8,"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"precpu_stats":{"cpu_usage":{"total_usage":0,"usage_in_kernelmode":0,"usage_in_usermode":0},"throttling_data":{"periods":0,"throttled_periods":0,"throttled_time":0}},"memory_stats":{"usage":107589632,"max_usage":118456320,"stats":{"active_anon":487424,"active_file":0,"cache":104857600,"dirty":0,"hierarchical_memory_limit":9223372036854771712,"hierarchical_memsw_limit":0,"inactive_anon":0,"inactive_file":104857600,"mapped_file":0,"pgfault":6829,"pgmajfault":0,"pgpgin":57351,"pgpgout":31632,"rss":487424,"rss_huge":0,"total_active_anon":487424,"total_active_file":0,"total_cache":104857600,"total_dirty":0,"total_inactive_anon":0,"total_inactive_file":104857600,"total_mapped_file":0,"total_pgfault":6829,"total_pgmajfault":0,"total_pgpgin":57351,"total_pgpgout":31632,"total_rss":487424,"total_rss_huge":0,"total_unevictable":0,"total_writeback":0,"unevictable":0,"writeback":0},"limit":16818872320},"name":"/testct","id":"195fd970ab95d06b0ca1199ad19ca281d8da626ce6a6de3d29e3646ea1b2d033","networks":{"eth0":{"rx_bytes":49518,"rx_packets":245,"rx_errors":0,"rx_dropped":0,"tx_bytes":0,"tx_packets":0,"tx_errors":0,"tx_dropped":0}}}

Now, search for io_service_bytes_recursive in the json output. There will be something like this:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24


{
  "blkio_stats": {
    "io_service_bytes_recursive": [
      {
        "major": 8,
        "minor": 0,
        "op": "Total",
        "value": 144179200
      },
      {
        "major": 253,
        "minor": 0,
        "op": "Total",
        "value": 104857600
      },
      {
        "major": 8,
        "minor": 0,
        "op": "Total",
        "value": 144179200
      }
    ]
  }
}

Ups, there are three block devices here. Where are they coming from? If the totals are summed up, we get the 393 megabytes we have seen before. The major and minor numbers identify the device type. The documentation of the Linux kernel includes the complete list of the device major and minor numbers. The major number 8 identifies a block device as SCSI disk device and this is correct, as the server uses sd* for the local devices. The major numner 253 refers to RESERVED FOR DYNAMIC ASSIGNMENT which is also correct, because the container get a local mount for the write layer. Therefore there are multiple devices: The real device sd* and the dynamic device for the writeable image layer, which will write the data to the local disk. That’s way the block I/O is counted multiple times!

But we can dig even deeper and we can inspect the cgroup information used by the Linux kernel to isolate the resources for the container. This information can be found under /sys/sys/fs/cgroup/blkio/docker/<container id> eg /sys/fs/cgroup/blkio/docker/195fd970ab95d06b0ca1199ad19ca281d8da626ce6a6de3d29e3646ea1b2d033. The file blkio.throttle.io_service_bytes contains the information what data was really transferred to the block devices. For this test container the output will be:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11


8:0 Read 0
8:0 Write 144179200
8:0 Sync 0
8:0 Async 144179200
8:0 Total 144179200
253:0 Read 0
253:0 Write 104857600
253:0 Sync 0
253:0 Async 104857600
253:0 Total 104857600
Total 249036800

There we have the correct output. In SUM Total we have roughly 250 megabytes. 200 megabytes were written by the dd commands and the rest would be logging and other I/O stuff. This would be the correct number. You can test this by yourself by running a dd command and watching the blkio.throttle.io_service_bytes content.

Conclusion

The docker stats command is really helpful to get an overview about you block device I/O, but it does not show the full truth. But it is useful to monitor containers that are writing local data, which may indicate, that something is not correctly configured regarding the data persistence. Furthermore, if you use network shares to allow the containers to persist data, you cannot measure the block I/O count on the Docker host the container is running on. The ugly part is, if you are using one physical device (large LVM for example) on your network share server, you will only get one great number of block I/O but you will not be able to assign the I/O to a container, a group of containers or a project.

Facts: - If you use NFS (or whatever shares) which are backed by a single block device on the NFS server, you can only get the sum of all block I/O and you cannot assign this block I/O to a concrete project or container - Use separate block devices for your shares - Even if you use Gluster, there will be the exactly same problem - FUSE mounts are also not seen by the Linux kernel

Follow up

We are currently evaluating a combination of thin allocated LVM devices in combination with Gluster to report the block I/O via iostat (the json out) to Elastic search. Stay tuned for more about this soon!

-M