Maintenance windows tell you a lot!

By Simon Woodhead

There are different approaches to maintenance around our industry and they are very revealing, if not concerning, as to architecture and investment. Ultimately this translates into risk to the lives of your customers and the existence of your business – Ofcom fines can be up to 10% of relevant turnover. It therefore matters that you understand how the operators you depend on work, and the priorities they exhibit, and jump ship if there’s a mismatch.

If they run magic boxes (a.k.a. SBCs), with customers allied to specific pairs, then it stands to reason that a given pair is going to need updating periodically and that will be impacting for customers allied there. This is why those operators who believe in magic boxes will ally customers with more than one pair and not do maintenance on them all at the same time. It’s an approach, necessitated by a choice we wouldn’t make, but a valid approach nonetheless and the one taken by the largest operators in our space. It is also an approach the regulatory machine is comfortable with and understands.

If you don’t run magic boxes and, like us, your entire network is one highly available assembly of smaller parts, these types of maintenance windows are far rarer. Ours tend to relate to network maintenance where they serve as an ‘at risk’ notice for a given site, with traffic/service silently and seamlessly moved over to another. We haven’t had ‘virtual machines’ since 2016-odd, with our entire stack being containerised and there being somewhere between dozens and hundreds of any one container running, and thousands of containers running across the network. This requires management and process and for us, updating and deployment is a continuous process that doesn’t need a pre-assigned window – it is happening all day every day! Even when we did have virtual machines (pre 2016), we had several of each, in several clusters across multiple availability zones. If we needed to perform maintenance on one we’d still fail service over to another so there was no interruption. That feels like something of a minimum!

Then there’s a third class of operators who are either still running virtual machines or bare metal installs, and likely a combination of both. It is astounding to see them scheduling maintenance windows to upgrade an SSL certificate on “the” instance. Astounding not just that there is only one but astounding that if this is the level of disruptive notifiable maintenance, what else is not being maintained? Surely, if the virtual machine is down for an SSL cert upgrade, when is the entire host platform down for a version upgrade? It is more astonishing to see that the virtual machine isn’t just running the website or a single service but several. It is an approach which takes me back to the late 90s with n-tier architectures of fat services, running many ‘sites’ in the case of the web server tier. Even then though we managed to put load-balancers in front of them and assure high-availability because, you know, stuff breaks or otherwise needs turning off! This also explains why that third class of operators issue RFOs essentially saying “we don’t know what went wrong, but we turned it off and back on again and now it is ok”. That isn’t an RFO, but it exposes why the Government felt it needed to introduce the Telecommunications Security Act and nanny-state security requirements.

In 2024, it is remarkable how such operators can be in any way compliant and I trust after further recent service affecting (and notifiable) outages Ofcom is taking a keen interest. On that note, as a customer of a wholesale operator such as ourselves or others, we’d remind you of your obligation to report outages to Ofcom within 3 days (or 3 hours, if severe enough).

If you feel uncomfortable being wholly dependent on 1990’s architecture, with the risks to life and fines etc that go with it, give Amy a call.