The Slough incident that wasn’t

Some of you will have noticed an entry on our status page on 15 June: “Reduced network redundancy in Slough availability zone.” What you won’t have noticed – because there was nothing to notice – is any impact on your service. This post is about why that is, because I think it matters.

At 10:17 UTC, our monitoring flagged that the optical link from our Slough AZ back to central London was down. Not degraded, hard down. Traffic rerouted in well under a second via alternative paths while we worked with our fibre provider to identify the fault. Engineers were on-site at the affected intermediate location by 13:30, found multiple fibre pairs damaged at the ODF there, replaced them, and by 15:17 the link was restored and all alarms were clear. Five hours, start to finish.

Zero calls were dropped. Zero customers were affected.

This isn’t the first time we’ve written about an incident that had no customer impact whatsoever. In May 2024 we had a double fibre break on our ring through Volta – both directions simultaneously due to both pairs being cut – which should by any reasonable measure have been catastrophic for the London AZ. It wasn’t, because we had sufficient alternative paths that it barely registered as a blip. The Slough situation is the same story with a different postcode.

I keep writing about these non-events because the gap between “link down” and “outage” is exactly what separates carriers who take this seriously from those who don’t, and most of the time that gap is invisible. When everything works, you never know how close you came to it not working. The only way to show our work is to point at moments like this and say: here, this could have been bad, and it wasn’t, because of decisions made years ago.

The decisions in question are expensive and unglamorous. Multiple diverse fibre routes between sites. Optical rings with genuine spare capacity, not theoretical spare capacity. An architecture where entire availability zones can fail and the network continues – we could lose both US AZs and two of three UK AZs and still carry full network load on a single remaining AZ. None of this was cheap. None of it shows up on a product comparison sheet – our last actual outage was in 2018, which you definitely won’t find on one. And when it works, it looks like nothing happened – which is exactly the point.

There’s a tendency in this industry to treat resilience as a cost centre until something goes wrong, at which point it becomes a crisis. We’ve never done that, and I hope this is a useful reminder of what that looks like in practice. Not a proud announcement of uptime statistics, but a concrete example: fibre damaged, traffic rerouted, nobody noticed, fault repaired. Exactly as designed.

If your current carrier can say the same, great. If you’re not sure they can – worth asking.