Site Reliability Engineering
Estimated reading time: 2 mins
The two weeks Holidays vacation is over, we are back at work and the Docker Swarms ran fully unattended without a single outage during this period! For every IT engineer reliability may be something different because it depends upon which goals you have to achieve with your team.
Last year at the same time we ran roughly 450 containers in our Docker Swarm’s, this year we already had more than 1500 containers. Almost three times more than last year.
For me, the Yin and Yang symbol is not the worst symbol to reflect the idea of Site Reliability Engineering because there are always some kind of trade offs you have to accept between the ultimate reliable system and the infinite time it would take to implement such system. Therefore different and often fully contrary needs have to work together to still create a system that fulfills all needs, like Yin and Yang.
The monitoring and alerting system observed the Docker Swarms autonomous and today we reviewed the data tracked. The only thing that happened was a failure of a single Docker worker node which did a reboot. The Docker Swarm automatically started the missing containers on the remaining Docker workers and nothing more happens.
I think that we did a great job, because we had two full vacation weeks without any stress. The Docker Swarm doesn’t break, all services were always up and running and the system has handled a failing Docker worker as expected.
Times like these are always exciting because they proof if the systems are working even without people who are watching it. Happy new year and happy hacking!