Back

IP Network

Planned Maintenance

Simon Woodhead

Simon Woodhead

26th February 2026

As well as the “annual price increase” emails, my inbox is busy at the moment with competitors busily planning maintenance. We monitor this to enhance our support (so we have insight as to why when ported numbers stop working!) and also to ensure we don’t plan anything disruptive which coincides – we recognise customers use multiple carriers and don’t want to be responsible for breaking this planned resilience.

Rather like so-called “RFOs”, I find these notifications really insightful as they demonstrate quite a few things about other people’s priorities and actual resilience, as they infer how things are set up.

Firstly, there is the operator who does maintenance very rarely. In one case, they only seem to do it every time an SSL certificate needs updating. That is ok, but if that is all they do maintenance for then how are systems being kept patched and up to date? How are bugs being fixed? How is basic compliance with the Telecom Security Regulations being achieved? If they schedule an outage for something as trivial as updating an SSL certificate (something long since automated here), then I assume CI/CD (continuous integration/continuous deployment) is as foreign a term as “uptime”. That is scary.

Then there is the operator who schedules maintenance to deploy patches and new features. This is more positive than the once a year SSL update being the only change but is still odd. These notifications tend to cover multiple services, suggesting that the whole idea of microservices and discrete units of interoperable code is a foreign concept. Moreover though, why do these changes cause an outage at all, regardless of whether it is planned? That suggests these services, as well as being critically co-dependent, are a massive single point of failure (SPoF). We have dozens of instances of every micro-service all around the network, and multiple Availability Zones (AZ), each of which can provide full service. We routinely roll updates around to these and we don’t need to plan an outage because traffic migrates to working instances as others are updated. The idea of just having one, and not being able to migrate load away from it even manually, is pretty terrifying.

This is affirmed by notifications regarding “server cable patching”. Are they seriously running multiple services on a single server with no active standby? Even this blog has a few dozen standby instances, let alone our APIs and production voice stack. Granted, “the server” might not be a single instance, it could be a blade chassis on which a virtualisation cluster is running, but even so the fact that so many services are dependent on one cable in one location and even if a standby exists, using it is so hard and so rarely practiced that scheduling an outage is preferable, is pretty terrifying to me.

We know a number of competitors rely on virtualisation, often vmWare. This was great in the early noughties because it offered easy features such as migration of load between physical tin and at the higher end fault tolerance which could restart VMs on new tin if the running load failed. We made good use of all these features but it was stressful because there was so much shared fate, be it the shared storage (something I have PTSD from!), shared ‘vcenter’ controller or simply shared physical location. The biggest issue though was that one was still managing a “server” be it virtual or otherwise, with all of the issues that brings such as patching and ensuring config is consistent. I often tell the story of the customer whose entire business ran on a single VM with us in one AZ and after much persuasion he ordered another one in another AZ. Our intention and wish was that he used the second as a hot standby but instead he simply used the new one for new customers, creating two SPoFs rather than eliminating one, and creating two servers he couldn’t patch or update without planning an outage for his customers. That is scarily easy to do in a virtualised stack without significant discipline. I’d suggest anyone using virtualisation in 2026 for their platform – or worse aspiring to use it more – is not disciplined and there is a train-wreck of stacked SPoFs and shared fate waiting to happen.

We abandoned virtualisation in about 2015 in favour of containerisation. It meant we could run load directly on tin (improving performance), and the container images were small and discrete, with automated build and testing. Config mismatches were impossible as the config could not be changed in production – the image was changed, tested and redeployed. This imposed discipline and control was far and away the biggest win of the project. SPoFs and shared fate were eliminated by us having literally dozens of every element deployed to every AZ. Security was improved not only by continuous patching of the image but by each image having the ability to push its own firewall rules to the edge of the network (and beyond if/where supported), those rules themselves being held in version control. Shared services make heavy use of anycast so internal consumers hit the closest instance, but have fallback to dozens of others, automatically. We expect containers to fail and we routinely destroy them to redeploy updated versions – we do not need to plan an outage. We expect host “servers” to fail as well and, again, services are deployed on many others even in the same datacentre so it is no drama when they do. 

Nobody likes to crow about uptime, but we last had a service affecting outage in 2018. That’s why we can offer a 25x 100% SLA and why you will never hear us planning an outage to update an SSL certificate or patch a cable! It is also why we can recreate our entire stack in a new AZ in about 30 minutes, for example, when threatened with Russian DDoS. Try doing that when you need to build a vmWare cluster, then install virtual servers, then deploy your code to them, then configure them by hand, manage firewalls etc. etc. – it is just as hard as it was when that was the way to do things 25 years ago.

Perhaps it is time for a carrier who invests in your reputation rather than plays roulette with it?

Related posts