The outage that wasn’t!

1st May 2024

By Simon Woodhead

I made a post on LinkedIn last night which is usually an event! This time the wokerati were nowhere to be seen, sensible people didn’t need to pretend they hadn’t seen it and there were no lawyers’ emails. It is hard being me a lot of the time, but this wasn’t one of those times! I was touched by the lovely reactions from the great and good and the kind words from long-standing customers. This was the post:

As you can see, we had a massive outage yesterday, or could have done. Losing the fibres going in both directions on a ring is a big deal; normally you’d expect to lose one side and traffic flow the other way but it has nowhere to go when both sides are cut. In a ring as significant as ours between London datacentres, and for an affected site which houses all our compute for the London Availability Zone, that could be catastrophic. Thankfully, we have multiple other levels of connectivity in both our national ring and direct fibres between datacentres meaning that in this case, with the loss of two paths, we still had two remaining. As a result, it was a bit of a non-event.

Had it been an event, we have other Availability Zones. The network is designed precisely anticipating an entire zone failing. In fact, the network is specified such that we could lose both US AZs, and two of the three UK AZs and carry the entire network load on a single one. So an AZ failing is planned for and expected, but that doesn’t mean we should rest on our laurels.

There’s a really important lesson here for those appraising carriers and those of you building your own networks. Believe it or not, the lowest probability of an outage comes when you have a single server on someone else’s network. Really. When you add a second, what you potentially do is reduce the impact of that outage but actually you double the likelihood of one. Think about it, there’s two machines now so it is twice as likely one of them will fail. Make it 100 and you’re 100 times more likely to have one fail. The trick is ensuring that the potential reduction in impact is actualised and you’d hope by the time you got near 100, the impact of any one failing should be close to zero. That is easier said than done though.

Fun anecdote from prehistoric times when we ran on virtual machines; remember them!? A long distant colleague had built a fax service in a single site. I remember insisting he added redundancy to the service and provisioned him a new VM in a different site. He promptly configured one for inbound fax and one for outbound fax – double the likelihood of failure, no reduction whatsoever in the impact of one. Massive face-palm which I don’t mind sharing because it is so far away from where we are today and have been for a long-time.

We’ve been fully containerised for getting on 10 years, with 1,000s of containers around the network and even mundane services like our portal having dozens of instances about the place. All these are strictly controlled versions, ensuring the network functions identically world-wide, with minimum likelihood of outage and minimisation of impact when something breaks, as it is guaranteed to. That is hard to do and even harder to illustrate convincingly, but it makes more sense to people when described in network terms and imagining a JCB going through a fibre duct, or an event like yesterday.

This is a deep and complicated topic and one of those things you never really get to the bottom of. However, along the way it involves expenses that many would consider pointless. It involves engineering that many would think is overly complex. All to mitigate a hypothetical situation that some would prefer to fob their way out of if it ever happens. If it doesn’t, think of the money they’ve saved – it would be millions in our case. I’m pleased, and proud, that we’ve always come down on the right side of that line, preferring to invest in our network and infrastructure to give maximum survivability and minimum impact to any incident.

When there are so-called carriers out there who don’t have a UK network at all, and others who haven’t spent a penny on their single-site “network” in a decade, it is really gratifying to have occasion to celebrate what we willingly do right and to thank my colleagues for caring so deeply about this stuff. Moreover, in what is a pretty thankless industry a lot of the time, we really really appreciate those who have taken the trouble to recognise the magnitude of what could have been, and frankly, what would have been elsewhere.

Thank you!

The outage that wasn’t!

Related posts

Innovation vs Stability

Service affecting STIR/SHAKEN changes (updated)

Parallel Universes