Podman with VXLAN Overlay Network Deep Dive

Estimated reading time: 14 mins

What’s the difference between users and engineers? Mostly, engineers are not satisfied with “It just works”. Over the last couple of days I did a deep dive into the ocean of container networking. Maybe you know that most of my posts are usually covering Docker topics but why I am writing about Podman today? Simply because Podman does not have the possibility to build overlay networks out of the box (in the Kubernetes area this is the task of a CNI plugin) and furthermore there are no articles about this specific topic available online. Beyond this, Docker already does all the following steps out of the box for you - without Kubernetes and CNI. But if you are running a quite small Docker Swarm setup and if you would like to migrate it to Podman without the Kubernetes thing (including the network layer), this might be a thing for you (if you are a gearhead)… 🤗!

In general, I am always curious about how the different solutions are build and with the release of RedHat’s OpenShift 4 I was triggered to get to know how things are working in detail there - because most of the time everyone cooks with the same water. 😂 And there is the kube-ovn and the question was: “How is this all stitched together?”

The background of the featured image of this post comes from the movie “Apollo 13”, from a special scene where Jim Lovell (who’s actor was Tom Hanks) did some quick manual calculations - you can re-watch this scene here. For me thats the imagination of the difference between users and engineers. Engineers can go on when users run out of possibilities, because it is their job to know whats going on in detail.

Important note!

In the following post I am using a lot of information from other people work and I will reference every source! All (or at least most of) the information for this post is already out on the internet but it is fragmented, shattered into tiny pieces and distributed over the internet.

List of the sources

To honor all the work I will start with the list of sources before the post:

Index Article Author Article Link
1 How to Install Podman on Ubuntu Josphat Mutai https://computingforgeeks.com/how-to-install-podman-on-ubuntu/
2 DEEP DIVE 3 INTO DOCKER OVERLAY NETWORKS – PART 3 Laurent Bernaille https://blog.d2si.io/2017/08/20/deep-dive-3-into-docker-overlay-networks-part-3/
3 Linux: Network Namespace Arie Bregman https://devinpractice.com/2016/09/29/linux-network-namespace/
4 OpenStack Networking: Open vSwitch and VXLAN introduction Artem Sidorenko https://www.sidorenko.io/post/2018/11/openstack-networking-open-vswitch-and-vxlan-introduction/
5 VXLANs FAQ Open vSwitch(http://docs.openvswitch.org/en/latest/faq/vxlan/) http://docs.openvswitch.org/en/latest/faq/vxlan/
6 Encrypt Open vSwitch Tunnels with IPsec Open vSwitch(http://docs.openvswitch.org/en/latest/howto/ipsec/) http://docs.openvswitch.org/en/latest/howto/ipsec/
7 RedHat VETH RedHat https://developers.redhat.com/blog/2018/10/22/introduction-to-linux-interfaces-for-virtual-networking/#veth
8 RedHat FORWARD and NAT Rules RedHat https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/4/html/Security_Guide/s1-firewall-ipt-fwd.html
9 tcpforward Benjamin Rokseth https://github.com/digibib/tcpforward
10 kube-ovn Mengxin Liu https://github.com/alauda/kube-ovn

Prerequisites

Here is the plan what we would like to do in this longer post. For a better overview I’ve made a sketch which you can click on the left. The focus will be the network layer - but I have to say that I am not a networking expert 🙈 - so please be patient if I am not that correct on all topics. The goal is to build a VXLAN overlay network which spans over two hosts, run at least two containers on this network AND to get traffic from the outside to the internal containers - this is an important point! Most of the documentations out there stop at the point when the containers can communication to the external network. But IMHO I think that this is not enough because most of the time users need to communicate with the services running inside an overlay from the external network. And yes, we will do this all manually.

A minimal reasonable setup will require two virtual hosts. You can do this all on a singular machine too but hey, we would like to build a cluster setup! I will do it with two/three virtual machines.

Install Podman on all virtual machines

I use Ubuntu 18.04 for this demo because we are using Ubuntu massively and we already use Ansible for the installation. Therefore it is easy for me to setup them quickly. The installation of Podman itself is easy as well - I’ve taken my information from [1]. The summary of the linked article is:

> apt update
> apt -y  install software-properties-common
> add-apt-repository -y ppa:projectatomic/ppa
> apt -y install podman
> mkdir -p /etc/containers
> curl https://raw.githubusercontent.com/projectatomic/registries/master/registries.fedora -o /etc/containers/registries.conf
> curl https://raw.githubusercontent.com/containers/skopeo/master/default-policy.json -o /etc/containers/policy.json

Install Podman on all of your hosts.

Install Open vSwitch

Open vSwitch can be installed easily by using the Ubuntu package management. If you would like to see a Kubernetes CNI plugin which is working with this technology, take a look at [10].

> apt install openvswitch-switch

Configure Open vSwitch and create the overlay network

For this section most of the information is coming from the article referenced in [2] and [3]. These are really, really great articles and if you have the time, do yourself a favor and read it!

Creating the namespace for the Open vSwitch bridge an create the VXLAN connection

All of the following commands will be executed on all of our virtual machines.

First of all, we will create a separate network namespace where our Open vSwitch bridge will be placed in. We will name the namespace like we name our Open vSwitch bridge later.

> ip netns add ovsbr0

Now we create the Open vSwitch bridge inside the already created namespace:

> ip netns exec ovsbr0 ovs-vsctl add-br ovsbr0

After that, bring up the interface of the our ovsbr0 Open vSwitch:

> ip link set ovsbr0 up

Next we will create a VXLAN interface. I will do this with the Open vSwitch capabilities, because it is more commonly used today (like it is used in OpenShift). For most of the parts, [4] was enormous helpful! Ok, let’s start! The following command will establish an interface, called podvxlanA that will be connected to the remote virtual machine. For this task the default port for VXLAN, port 4789, is used [5]. This will be the VXLAN for our inter-Pod connection. The key keyword is important as it is used to identify the VXLAN. The psk option is used for the encryption of the VXLAN traffic. There are different encryption methods available, for this example I choose the simplest method. More information about it can be found under [6].

> ovs-vsctl add-port ovsbr0 podvxlanA -- set interface podvxlanA type=vxlan options:remote_ip=10.0.0.3 options:key=5000 options:psk=swordfish

Use the same commands on the other hosts to connect to the first one (use the ip address from the first one):

> ovs-vsctl add-port ovsbr0 podvxlanA -- set interface podvxlanA type=vxlan options:remote_ip=10.0.0.2 options:key=5000 options:psk=swordfish

If you are submitting a netstat command after the correct use of the previous commands, you should see the default port up and running on each host.

> netstat -ntulp | grep 4789
udp        0      0 0.0.0.0:4789            0.0.0.0:*                           -                   
udp6       0      0 :::4789                 :::*                                -

Create a POD, add VETH interfaces and plug it into the vSwitch

At first we start a Pod, without network interfaces (beside the lo adapter) and after the Pod is started, we have to lookup the namespace of the Pod as we will need it later.

> podman run -d --net=none alpine sh -c 'while sleep 3600; do :; done'

After we have started the pod without a network, we will lookup its namespace. This can be done with the lsns command for example and in fact we are interested in the net namespace. The number in the middle, in this example 9444, is the namespace number which we need in the next steps to assign an interface to the Pod.

> lsns
4026532727 net         2  9444 root            sh -c while sleep 3600; do :; done

You can verify the current ip configuration of the Pod with the nsenter command:

> nsenter -t 9444 -n ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever

No we create the VETH pair. A VETH pair acts like a virtual patch cable. You can plug in one side into the vSwitch and the other one into the namespace of the Pod [7]. If you create the VETH pair like in the following example, both ends will be placed in the default namespace[3].

> ip link add dev podA-ovs type veth peer name podA-netns

And now the fun starts. First, we will put the end of our virtual patch cable which is named podA-netns into the namespace of our running Pod.

> ip link set netns 9444 dev podA-netns

Verify it by running the nsenter command.

> nsenter -t 9444 -n ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
13: podA-netns@if14: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default qlen 1000
    link/ether 0e:19:51:5f:6a:fe brd ff:ff:ff:ff:ff:ff link-netnsid 0

Hooray, our Pod has a network device now! And now, plug the other end of our virtual patch cable into the vSwitch!

> ovs-vsctl add-port ovsbr0 podA-ovs

As you can see in the above ip a output, the state of the VETH device ends is DOWN - we can change this, from inside the pod by running an additional nsenter command (nsenter is really powerful):

> nsenter -t 9444 -n ip link set podA-netns up
> ip link set podA-ovs up

Now as both sides of the VETH devices are in UP state, we can assign a ip address to the interface inside the Pod and verify that it is set.

> nsenter -t 9444 -n ip addr add 172.16.0.10/12 dev podA-netns
> nsenter -t 9444 -n ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
13: podA-netns@if14: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 0e:19:51:5f:6a:fe brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 172.16.0.10/12 scope global podA-netns
       valid_lft forever preferred_lft forever
    inet6 fe80::c19:51ff:fe5f:6afe/64 scope link 
       valid_lft forever preferred_lft forever

Next we will do the same on the other virtual machine, I will paste only the command here in summary:

> podman run -d --net=none alpine sh -c 'while sleep 3600; do :; done'
> lsns
4026532727 net         2  5239 root            sh -c while sleep 3600; do :; done
> ip link add dev podB-ovs type veth peer name podB-netns
> ip link set netns 5239 dev podB-netns
> ovs-vsctl add-port ovsbr0 podB-ovs
> nsenter -t 5239 -n ip link set podB-netns up
> ip link set podB-ovs up
> nsenter -t 5239 -n ip addr add 172.16.0.11/12 dev podB-netns

Check the connection

Afterwards you should be able to ping from inside your Pods the other Pods.

> ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
20: podB-netns@if21: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether a6:2e:fe:d8:f4:c3 brd ff:ff:ff:ff:ff:ff
    inet 172.16.0.11/12 scope global podB-netns
       valid_lft forever preferred_lft forever
    inet6 fe80::a42e:feff:fed8:f4c3/64 scope link 
       valid_lft forever preferred_lft forever
/ 
> ping 172.16.10
PING 172.16.10 (172.16.0.10): 56 data bytes
64 bytes from 172.16.0.10: seq=0 ttl=64 time=1.070 ms
--- 172.16.10 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 1.070/1.070/1.070 ms

Egress and Ingress traffic

Ok, our Pods are now able to communicate which each other through different host via the VXLAN tunnel. This is great, but they cannot communicate to the outside world. Therefore we have to setup an additional interface, which will be connected to the Open vSwitch on one side and to our host on the other side. Currently the host itself does not know anything about the overly network which we build via the Open vSwitch because the host does not have an interface on the overlay network and vice versa.

Egress

On every host we create the same interface combination to allow the host to talk to the vSwitch and to allow the traffic flow to our Pods on the overlay network!

> ip link add pod-gatewayhost type veth peer name pod-gatewayovs
> ovs-vsctl add-port ovsbr0  pod-gatewayovs
> ip link set pod-gatewayovs up
> ip link set pod-gatewayhost up
> ip addr add 172.16.0.1/12 dev pod-gatewayhost

If you check the route information on the host afterwards, there will be the correct entry created automatically for the overlay network. Also check the vSwitch if the port is plugged in.

> ip r
default via 10.200.60.10 dev ens160 proto static 
...
172.16.0.0/12 dev pod-gatewayhost proto kernel scope link src 172.16.0.1
ovs-vsctl show
0ef0dbc7-2d98-4554-801c-d4fc36f246dc
    Bridge "ovsbr0"
        Port podvxlanA
            Interface podvxlanA
                type: vxlan
                options: {key="5000", psk=swordfish, remote_ip="10.0.0.3"}
        Port pod-gatewayovs
            Interface pod-gatewayovs
        Port podB-ovs
            Interface podB-ovs
        Port "ovsbr0"
            Interface "ovsbr0"
                type: internal
    ovs_version: "2.9.2"

From inside the Pod on this host, you can ping the traffic gateway now:

ping -c 1 172.16.0.1
PING 172.16.0.1 (172.16.0.1): 56 data bytes
64 bytes from 172.16.0.1: seq=0 ttl=64 time=0.388 ms

--- 172.16.0.1 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.388/0.388/0.388 ms

As we now have a gateway, apply a default gateway route to the Pod:

nsenter -t 5239 -n ip route add default via 172.16.0.1 dev podB-netns

Verify the setup from inside the Pod:

> ip r
default via 172.16.0.1 dev podB-netns 
172.16.0.0/12 dev podB-netns scope link  src 172.16.0.11

From now on it should be possible to ping your hosts ip address - the ip address of the host this Pod is running on!

> ping -c 1 10.0.0.2
PING 10.0.0.2 (10.0.0.2): 56 data bytes
64 bytes from 10.0.0.2: seq=0 ttl=64 time=0.528 ms

--- 10.0.0.2 ping statistics ---
1 packets transmitted, 1 packets received, 0% packet loss
round-trip min/avg/max = 0.528/0.528/0.528 ms

But we cannot update the Alpine image, because there is no routing setup up on the host! Therefore enable ip forwarding in the host, if it is not already set! Caution We are not using any iptable rules here, therefore everything can be forwarded through this host!

> sysctl -w net.ipv4.ip_forward=1

Next, add a dns server name for the Pod from inside the Pod - if needed use your on-premise DNS server here!

> echo 'nameserver 8.8.8.8' > /etc/resolv.conf

And last but not least, we have to enable IP masquerading[8] as otherwise the communication back from the external network to the Pod is not possible:

> iptables -t nat -A POSTROUTING -o ens160 -j MASQUERADE 
> iptables -t nat -L -n -v

And finally try to apk update your server!

apk update
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/main/x86_64/APKINDEX.tar.gz
fetch http://dl-cdn.alpinelinux.org/alpine/v3.10/community/x86_64/APKINDEX.tar.gz
v3.10.2-80-g68e4e4a13a [http://dl-cdn.alpinelinux.org/alpine/v3.10/main]
v3.10.2-80-g68e4e4a13a [http://dl-cdn.alpinelinux.org/alpine/v3.10/community]
OK: 10336 distinct packages available

Congratulation - Egress traffic is established. Now you have to do the same on the second host for the other Pod. Remember to change the ued namespace in the nsenter commands!

Ingress

OK, and now the last part. How we can get traffic to the pods? Docker is using docker-proxy for this task, Podman uses the conmon and Kubernetes is going to use kube-proxy for this task by default. We have started our Podman Pod without an network and therefore the conmon process is not forwarding anything to the Pod from our host machine. But there is a cool tool on the internet, called tcpforward [9]. You can just git clone the repository from Github and run a go build on it.

To enhance the current setup, we add a second ip address to our LAN interface. This will enable us to use multiple ip addresses on one host and therefore we can use the SSL port multiple times later in an enhanced setup!

> ip addr add 10.0.0.5/16 dev ens160

Then we run the compiled binary to forward tcp traffic to one of our Pods through the second ip address created before:

 > ./tcpforward -l 10.0.0.5:1234 -r 172.16.0.11:9000
tcpforward: : 2019/09/27 12:15:35 Accepted connection from 10.0.0.250:51020
tcpforward: : 2019/09/27 12:15:42 Forwarding from 10.0.0.5:1234 to 172.16.0.11:9000

And there you are! Start a nc -l -p 9000 inside the connected Pod and run a nc 10.0.0.5 1234 somewhere on your network. Type in some characters and see them echoed inside the Pod!

YES! 🎉

Summary and outlook

As you can see, the shown result isn’t excatly what you have seen in my sketch at the beginning of this post but you can easily extend the show solution by yourself. Just take the commands used during this simplier setup and span it on a third node. The post itself shows the technical background on how Docker Swarm overlay networks and overlay networks in general can be build. Podman is a cool solution but with the lack of a build in overlay network solution it is only a starter kit. With the information from this post you could build a Kuberentes like solution on your own (if you take the time) 🤣

I would like to say thank you to all of the people out on the internet who are sharing their experiences - without them a lot of information would be still not available today.

If you like, reach out for me on Twitter!

Mario

Posted on: Wed, 02 Oct 2019 00:00:00 UTC by Mario Kleinsasser

  • Podman
  • Overlay
  • VXLAN
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

My Paper Notebook Part 1

Estimated reading time: 3 mins

Currently I am writing a larger blog post with the topic Podman with VXLAN Overlay Network - the definitive guide. I think I will finish it during this week but today I would like to share some experience about Paper Notebooks. You know, this are this kind of books with a lot of empty pages inside where you have to write with a pen onto sheets of paper 😂

Nevertheless, I love real physical notebooks. They help me to focus on my thoughts and they help me to remember things more easily. And I love this empty white space to build out ideas and concepts.

This post is named Part 1 and maybe, there will be a part two - we will see. The pictures on the left side are the scans from my sketches of the upcoming blog post.

My paper notebook

I am using a XL (19x25 cm) sized black softcover Moleskine notebook with plain (blanco) pages. The quality of the pages is really good, but you should not write with ink or ballpens on it - it might be that you can see your writings on the back of the page. It is best used with simple pencils, normal or, if you like it, color pencils. Maybe I will buy a Leuchtturm notebook next time, because they offer numbered pages.

Usage Tip 1 - Don’t use it for work items

At first, I’ve used my notebook also for work items. For me, this was a bad idea. Every time I opened my notebook to write down some project thoughts I saw my open work tasks popping up. I was disturbed immediately - the thought of the idea was often gone. Therefore I stopped using it for work tasks. Work tasks are work tasks and creative thinking is creative thinking.

Usage Tip 2 - Number your pages

At some point during the work with your notebook you will have to write an index so you can find your entries back. A notebook will have somewhat between 120 to 180 pages and you will use it for 24 month or more. So numbered pages will help you to find things. 🧐 You can find some really good tips about numbering pages here.

Usage Tip 3 - How to reference internet URLs on paper?

I have found this one by myself 😎! I am using a URL shortener, one with the shortest URLs - https://u.nu. If you look at the featured image of this post, you will notice some rectangles there. Inside this rectangles are alphanumeric codes - this are the hashes for the URL shortener. For example, the combination wkd- for me is https://u.nu/wkd- and this will lead to https://blog.oddbit.com/post/2014-08-11-four-ways-to-connect-a-docker/ - simple and efficient!

Conclusion

Try it yourself and have fun sketching your ideas!

Mario

Posted on: Mon, 30 Sep 2019 00:00:00 UTC by Mario Kleinsasser

  • Paper Notebook
  • Culture
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

Kubernetes and Weave - a no go per default

Estimated reading time: 5 mins

The default Weave deployment for Kubernetes is not as secure as it should and could be. What does this mean? In fact it does mean, that if you setup a Kubernetes cluster via kubeadm and install Weave Works overlay network via the recommended installation guide, any host that installs Weave manually can join this overlay network!

Details

We are reviewing Kubernetes every couple of month, because there is some (small) kind of uncertainty if Docker Swarm will stay alive or not. Therefore it is always good to know if there are alternatives, like Kubernetes, and how they fit in into an existing environment. Docker Swarm is unmatched simply to manage, we drive more than 2300 containers today, but it is not as hyped as Kubernetes. So we have to stay up-to-date how we can use Kubernetes to replace Docker Swarm if we have to. And every single day when we have a look at it, our review boils down to two main problems, networking and storage. Today we will focus on the network integration, CNI for detail.

There are are lot of Kubernetes CNI plugins out there, but some of them are widely used: Calico, Weave and Flannel. Weave is, of course, a nice solution to get up and running with CNI capabilities inside Kubernetes easily. But there is a major drawback.

The problem

If you are following the default installation guide for Weave within Kubernetes there is no password used to protect your overlay network against suspicious network peers!

In the following output, the host atlxkube474 is not part of the Kubernetes cluster. But it can join the created Kubernetes Weave network easily by specifying one of the main peers during weave launch.

Kubernetes cluster nodes:

[19:52 atlxkube471 ~]# kubectl get nodes
NAME          STATUS   ROLES    AGE   VERSION
atlxkube471   Ready    master   12d   v1.15.3
atlxkube472   Ready    <none>   11d   v1.15.3
atlxkube473   Ready    <none>   11d   v1.15.3

Suspicious Weave host:

19:54 atlxkube474 ~]# kubeclt

Command 'kubeclt' not found, did you mean:

  command 'kubectl' from snap kubectl (1.15.3)

See 'snap info <snapname>' for additional versions.

[19:55 atlxkube474 ~][127]# weave launch 10.x.x.1
... truncated output ...
INFO: 2019/09/16 17:55:22.483698 sleeve ->[10.x.x.2:6783|c6:cf:11:33:a1:ca(atlxkube472)]: Effective MTU verified at 1438
[19:56 atlxkube474 ~]# eval $(weave env)
[19:56 atlxkube474 ~]# weave status

        Version: 2.5.2 (up to date; next check at 2019/09/17 01:10:02)

        Service: router
       Protocol: weave 1..2
           Name: da:09:b8:ee:77:3e(atlxkube474)
     Encryption: disabled
  PeerDiscovery: enabled
        Targets: 1
    Connections: 3 (3 established)
          Peers: 4 (with 12 established connections)
 TrustedSubnets: none

        Service: ipam
         Status: ready
          Range: 10.32.0.0/12
  DefaultSubnet: 10.32.0.0/12

        Service: dns
         Domain: weave.local.
       Upstream: 10.x.x.50, 10.x.x.51, 10.x.x.52
            TTL: 1
        Entries: 0

        Service: proxy
        Address: unix:///var/run/weave/weave.sock

        Service: plugin (legacy)
     DriverName: weave

And now, everyone can run a container that joins the singe weave overlay network and can do anything that is possible:

[20:00 atlxkube471 ~]# kubectl get pods -n deployment-v1 -o wide
NAME                                READY   STATUS    RESTARTS   AGE    IP          NODE          NOMINATED NODE   READINESS GATES
nginx-deployment-5754944d6c-z486x   1/1     Running   0          6d9h   10.44.0.1   atlxkube472   <none>           <none>

[19:59 atlxkube474 ~]# docker run --name a1 -ti weaveworks/ubuntu
root@a1:/# ping 10.44.0.1
PING 10.44.0.1 (10.44.0.1) 56(84) bytes of data.
64 bytes from 10.44.0.1: icmp_seq=1 ttl=64 time=2.00 ms
64 bytes from 10.44.0.1: icmp_seq=2 ttl=64 time=0.672 ms
^C
--- 10.44.0.1 ping statistics ---
2 packets transmitted, 2 received, 0% packet loss, time 1001ms
rtt min/avg/max/mdev = 0.672/1.337/2.002/0.665 ms
root@a1:/#

This example shows that any host can really, really easily join the Weave overlay network! This is not a secure by default design!

Fix it

I know, that it is possible to set a password for Weave, which is used to encrypt the network traffic and to protect unknown host to join the Kubernetes created Weave overlay network. This is described here.

Lets do this for our Kubernetes installation right now. Thanks to summerswallow-whi for opening a KOps issue which is already addressing this. The issue is still open (May 2018), 🙁, but it provides a lot of information how you can iron your Weave setup.

I tried it on my own and the following steps are enough to bring some kind of protection to your Weave overlay.

First, create a password for your Weave overlay and save it to a file:

# < /dev/urandom tr -dc A-Za-z0-9 | head -c16 > weave-password

Now create a Kubernetes secret:

# kubectl create secret -n kube-system generic weave-password --from-file=./weave-password

Add this setting to the Weave Kubernetes daemonset by editing it (under the weave-net container spec):

kubectl create secret -n kube-system generic weave-password --from-file=./weave-password
...
  template:
    metadata:
      creationTimestamp: null
      labels:
        name: weave-net
    spec:
      containers:
      - command:
        - /home/weave/launch.sh
        env:
        - name: HOSTNAME
          valueFrom:
            fieldRef:
              apiVersion: v1
              fieldPath: spec.nodeName
        - name: WEAVE_PASSWORD
          valueFrom:
            secretKeyRef:
              key: weave-password
              name: weave-password
       image: docker.io/weaveworks/weave-kube:2.5.2

If you now try to join the Weave overlay network, you will see the following failure:

[20:19 atlxkube474 ~]# weave launch 10.x.x.1
24ef2192c30e3c5d9372469eb1fb456e348cdd9efe4b8cda27c3ba1e756ba73c
[20:19 atlxkube474 ~]# docker logs weave
...
INFO: 2019/09/16 18:19:50.677943 ->[10.x.x.1:6783] connection shutting down due to error during handshake: no password specificed, but peer requested an encrypted connection

Yes! This is want we want! And nothing less!

Conclusion

Weave is using, I my opinion, a non secure default setup. This violates the GDPR article 25 at least. Encryption is a default today! This is one of the points where Docker Swarm is much, much better. Docker swarm creates a VXLAN overlay network for each service per default (not just one single overlay for anything like Weave does)! 😎 Furthermore you cannot(!) join a Docker Swarm without knowing the join token and therefore you cannot infiltrate a existing Docker Swarm overlay network! It is secure by default! No additional screws needed!

It is hard to believe, but even Weave Works itself does not provide a Kubernetes documentation about how to enhance your Kubernetes Weave setup (Kubernetes Secrets, Kubernetes Daemonset,…) Even the actual edition of the book Kubernetes: Up and running, Second Edition does not mention something like this in any way.

I don’t like to know, how many insecure setups are out there - hopefully everyone trust his or her own network. 🙄

Sometimes we think, that the whole industry is just fueling up the whole Kubernetes thing to sell more and more consulting services…

But, lets see…

Posted on: Mon, 16 Sep 2019 01:39:06 +0200 by Mario Kleinsasser

  • Kubernetes
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

Multi-Project Pipelines with GitLab-CE

Estimated reading time: 9 mins

This year at the DevOps Gathering 2019 conference at Bochum, Alexander and I met Peter Leitzen who is a backend engineer at GitLab and together we chatted about our on-premises GitLab-CE environment and how we are running GitLab Multi-Project Pipelines without GitLab-EE (GitLab Enterprise Edition). We promised him, that we will write a blog post about our setup but as often, it took some time until we were able to visualize and describe our setup - sorry! But now, here we go.

We will not share our concrete implementation here, as it makes no sense because everyone will have a different setup, different knowledge or is using another kind of programming language. Nevertheless we will describe what we are doing (the idea) and not how we are doing it - you can use whatever programming language you like to communicate with the GitLab API (because )it doesn’t matter).

Some background story

We have a lot of projects in our private GitLab and of course a lot of users, at the time of writing approximately 1500 projects with around 400 users, since we use GitLab since more than five years. Not only developers are using the GitLab, also colleagues who just want to version their configuration files and much more. With GitLab-EE it is possible to run multi-project pipelines - but this is a premium feature (Silver level) which costs $19 per GitLab user per month. Only some of our 400 GitLab enabled users need GitLab multi-project pipelines but sadly there is no way to just subscribe only some users to GitLab premium. 🙄

Back then we were sure (and it is still a fact today) that we will need multi-project pipelines for our new Docker Swarm environments (we started more than two years ago). Together with the developers we decide that we would like to have classic source-code Git repositories and deploy Git repositories for various reasons. Here are some reasons:

You can image the latter as the marriage of a cars body and his chassis - after the pipeline run you will have the sum of the source code and the deploy repository - the container image.

The idea

Let’s show the idea (click on the picture to the left). We have numbered the individual steps to explain them more in detail. Lets start with some additional information regarding the environment in general.

Overview

On the left side of the picture is our GitLab server installation which is responsible for all the repositories. We are fully committed to use a 100% GitOps approach since we started our container journey. Nothing happens to our applications within our Docker Swarm environments without a change in the corresponding Git repository. Later you will see that we have build some similarities to Kubernetes Helm but much simpler. 😂 Let’s deep dive into the details.

One

Point one shows a public(!) repository which we are using as source for the GitLab CI/CD templates. GitLab has introduced CI/CD templates with the release of GitLab v11.7 and we use them. Inside the templates Alex has build up a general structure for the different CI/CD stages. For example, the .gitlab-ci.yml file from one of our real world project just looks like this:

include: 'https://<our-gitlab-server>/public-gitops-ci-templates/gitops-ci-templates/raw/master/maven-api-template.yml'

image: <our-gitlab-registry>/devops-deploy/maven:3.6-jdk-12-se3

Yes, you see right! Thats all! Thats the whole .gitlab-ci.yml from a productive Git project. All the magic is centrally managed by public GitLab templates. In this case we look at a project of our developers. The nice thing about GitLab templates is, that values can be overwritten, like the image variable in this example! Simple but not complex - in any case it is possible to take a look in the template to know what is going on, because it is not hidden abstraction magic. By the way: there is absolutely no sensitive information in the templates! Everything that is project specific is managed by the affected project itself. Thats the link to point number two.

Two

As shown, the .gitlab-ci.yml in the source code Git repository imports one of the global templates which are centrally maintained. And as said before, all relevant parameters for the templates where provided by the source code repository. There are a lot of secret variables there and therefore Alex made a special pipeline which bootstraps a new Git project if needed. This is not part of this post because this is very specific and it does depend on your needs. But it is important to know (and for the explanation), that during the creation of a new source code project, two GitLab project variables called SCI_DEPLOY_PRIVATE_KEY and SCI_DEPLOY_PRIVATE_KEY_ID are created and initialized.

Now, the GitLab CI/CD of the source repository is transmitted to the GitLab Runner and the job is started. The job itself uses a self made service which kicks in the GitLab Multi-Project Pipelines functionality. Therefore, we have to head over to number three - the self written GitLab Helper Rest Service.

Three (Part 1)

In the step before, the GitLab CI/CD pipeline of the source repository was triggered and at this point we imagine that the deploy job from the source repository is called. The deploy stage is part of the .gitlab-ci.yml template mentioned before. Now, in the template the following happens (this is the hardest part of all)

First of all, a SCI_JWT_TOKEN is generated to ensure a solid level of trust. The communication between the pipeline runner and our GitLab Helper Rest Service is TLS encrypted but we would like to be sure that only valid calls are processed. Take a look at the script line below. Here is the point, where the SCI_DEPLOY_PRIVATE_KEY comes back again. The little tool /usr/local/bin/ci-tools is part of the the Docker image that is used during the GitLab deploy pipeline run. It is self written it does nothing more than to generate and sign a JWT token. It is written in go - simple and efficient. The tool needs some important parameters

In summary, we have a signed JWT token which includes the currently running job number of this specific project which is referenced by its ID.

...
.deploy-template: &deploy-template
  script:
    ...
    - export SCI_JWT_TOKEN=$(echo "$SCI_DEPLOY_PRIVATE_KEY" | /usr/local/bin/ci-tools crypto jwt --claims "jobId=$CI_JOB_ID" --claims "projectId=$CI_PROJECT_ID" --key-stdin)
...

After the SCI_JWT_TOKEN is generated, it is used to call the GitLab Helper Rest Service via curl (TLS encrypted). Please notice the deployKey REST method in the REST call. It is important to also see, that a variable called SCI_DOCKER_PROJECT_ID is used as REST parameter. The variable SCI_DOCKER_PROJECT_ID references the Docker Deploy project, this is number four in our overview! The GitLab Helper Rest Service now creates and enables a GitLab DeployKey if it isn’t already enabled there. That’s the trick to enable GitLab Multi-Project Piplines automatically! The GitLab Helper Rest Service verifies that the jobId=$CI_JOB_ID transmitted with the JWT token is valid by looking up the CI_JOB_ID via the GitLab API.

...
    - curl -i -H "Authorization:Bearer $SCI_JWT_TOKEN" -H "Consumer-Token:$SCI_INIT_RESOURCE_REPO_TOKEN" -X POST "${SCI_GITLAB_SERVICE_URL}/deployKey/enable/${SCI_DOCKER_PROJECT_ID}/${SCI_DEPLOY_PRIVATE_KEY_ID}" --fail --silent --connect-timeout 600
...

Sadly the GitLab Helper Rest Service has to have administrative global permissions inside the GitLab installation to handle this tasks - thats the price to pay.

But we are not finished yet - here comes more cool stuff!

Three (Part 2)

Now, the source code repository pipeline has done the setup for the deploy pipeline. Based on the template variable configuration, the .gitlab-ci.yml for the deploy repository pipeline is automatically generated! You can imagine this as some kind of Kubernetes Helm, just for GitLab and simply integrated inside the GitLab repositories - GitOps! All and everything is stored inside the repositories! The last thing to do is, to push the generated config into the deploy repository. This is easy, because the GitLab deployKey was setup just before. 😇

And now some additional cool magic. Due to the commit into the deploy repository in the last step, the GitLab Helper Rest Service would be able to just run the pipeline which would be automatically created but then we would loose the information about who has trigger the pipeline. Therefore, an additional GitLab Helper Rest Service REST call is issued (see below). This one reads out the user who has created the jobId=$CI_JOB_ID in the source code repository. After this is done, the GitLab Helper Rest Service impersonates itself a exactly this user and creates the deploy pipeline in the deploy repository as this user and runs it 😎 - the deploy pipeline runs with the same user as the pipeline in the source code repository. Nice!

...
    - curl -i -H "Authorization:Bearer $SCI_JWT_TOKEN" -H "Consumer-Token:$SCI_INIT_RESOURCE_REPO_TOKEN" -X POST "$SCI_GITLAB_SERVICE_URL/pipeline/create/$SCI_DOCKER_PROJECT_ID" --data "branch=$SCI_DOCKER_SERVICE_BRANCH_NAME" --fail 
...

In addition, this is also a security feature because the user who runs the source code pipeline must also be a member of the deploy repository. Otherwise the pipeline cannot be created by the GitLab Helper Rest Service. This enables us to have developers who are able to push to the source code repository but are not able to run deploys - simply, they are not a member of the deploy repository.

Four

This is the deploy Git repository. It is used to run the deploys and it is part of our way to run GitLab Multi-Project Pipelines.

Five

The deploy pipeline takes the Docker Swarm config which is generated by the source code pipeline run and pushed to the deploy repository by our GitLab Helper Rest Service (like Kubernetes Helm) and updates the Docker Swarm Stack (and Docker Services)

Conclusion

This blog post gives you an idea how to create a Multi-Project Pipeline functionality with only GitLab-CE (on-premises). The idea enables you to have

The cost of this is, that you have to create an external service which will have administrative GitLab permissions. Warning: Do not write such a service if you are unfamiliar how to do it in a secure manner!

Hopefully GitLab will enable Multi-Project Pipelines also for GitLab-CE users for free in the future!

If you have questions or would like to say thank you, please contact us! If you like this blog post, please share it! Our bio-pages and contact information are linked below!

Alex, Mario

Posted on: Sun, 15 Sep 2019 01:39:06 +0200 by Alexander Ortner , Mario Kleinsasser

  • GitLab
Alexander Ortner
Alexander O. Ortner is the team leader of a software development team within the IT department of the STRABAG BRVZ construction company. Before joining the STRABAG SE he obtained a DI from the Department of Applied Informatics, Klagenfurt in 2011 and another DI from der Department of Mathematics, Klagenfurt in 2008. He is a software engineer since more than 10 years and beside the daily business mainly responsible for introducing new secure cloud ready application architecture technologies. He is furthermore a contributor to introduce a fully automated DevOps environment for the highly diversity of different applications.
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!

Why Open Source is great - Fix an issue in under 48 hours

Estimated reading time: 4 mins

This is a follow up post to Reviving a near-death Docker Swarm cluster where we showed, that a Docker Swarm cluster can be hurt badly if DNS does not work (because of a storage hiccup). Therefore it was obvious that we had to enable caching in our coreDNS servers.

A short recap about the situation: We use coreDNS with ETCD as storage backend for the DNS records. This is a common use case - it is the same as in Kubernetes. We use the same concept as Kubernetes does but for slightly different purposes because the ETCD has an easy to implement API. We started to use coreDNS way before it came to Kubernetes as a DNS service. We also helped to implement the APEX records and we also did some bug triage in the past.

Enabling caching in coreDNS is simple, just add the cache statement to the Corefile as documented in the plugin.

The problem

So yes, we enabled caching and some minutes later our monitoring system showed several systems which were not able to do the Puppet agent run anymore. This happened on Tuesday afternoon around 3pm. After the monitoring alerted the problem, we already guessed, that it has something to do with the shortly before enabled cache in our coreDNS instances. A rollback would be possible without any problems, because we use the coreDNS inside containers, their image is build via GitLab CI/CD and the Docker run is issued by Puppet on the given hosts. So a rollback is pretty easy! But we didn’t roll back because only some of our hosts had an DNS resolve error, the rest (hundreds) were running fine!

Analyzing the problem

We suspended the rollback to a previous Corefile (coreDNS config) and took a closer look at the affected hosts. Shortly after we knew that only older Linux OS’s were hit by this problem, Bernhard started to search on the internet about this specific problem because we also got the following log output from a rsync run (and a similar one from the Puppet agent):

rsync: getaddrinfo: rsync.example.com 873: No address associated with hostname
rsync error: error in socket IO (code 10) at clientserver.c(122) [sender=3.0.9]

We found two GitHub entries, this one in coreDNS / this one in Kubernetes. And in addition a Stack Overflow post article.

Ok, there was something strange going on with some old clients. We decided to share our information inside the issues above. You can read the issues and the pull requests if you like to read the full details. In short, I mailed with Miek Gieben who is one of the coreDNS maintainers privately after we chatted on GitHub to share some tcpdumps with him. DNS is really something you won’t mess around that deep. It’s ugly and I am feeling great respect for those who are working in this field, like Miek does! Kudos to all of you!

The result

After chatting via e-mail we shortly came to the point, that the switching of the authoritative/non-authoritative flag - that is one(!) single bit in the header of the UDP datagram of a DNS query response - confuses older clients, because at first they get an authoritative answer and on the next query (within the TTL of the record) they get a non-authoritative answer. Some older DNS client code is screwed up at this point.

Miek provided a PR for this, I opened up an issue and on Thursday morning I did a manual build of the coreDNS including this PR and everything worked fine. As mitigation in between, Bernhard rolled out a hosts entry for our Puppet master domain on all affected hosts! Thanks! But some hosts with quite old software were still affected. Therefore this PR works much better.

Thank you!

We would like to say “Thank you!” to all who are working in Open Source and in this case especially to Chris O’Haver, Miek Gieben and Stefan Budeanu! This shows why Open Source is unbeatable when it comes to problems. Of course you have to have know how to do work together like we did in this case, but you have the opportunity to do it! Don’t be afraid and try it! Getting a fix for a problem within 48 hours is absolutely impressive and stunning! I am sure that this is not possible with closed source.

Posted on: Fri, 14 Jun 2019 04:39:06 +0200 by Mario Kleinsasser

  • Culture
Mario Kleinsasser
Doing Linux since 2000 and containers since 2009. Like to hack new and interesting stuff. Containers, Python, DevOps, automation and so on. Interested in science and I like to read (if I found the time). Einstein said "Imagination is more important than knowledge. For knowledge is limited." - I say "The distance between faith and knowledge is infinite. (c) by me". Interesting contacts are always welcome - nice to meet you out there - if you like, do not hesitate and contact me!