Back

Commercial

BT’s 999 outage

Simon Woodhead

Simon Woodhead

1st October 2024

I was at an event in Malaysia last week and was, frankly, ashamed to be British. The UK was singled out for so many failings and being on a fast track to third-world status. So imagine my shame and disbelief to get back to our national regulator’s report into the failings of our flagship telco when nearly 14,000 potentially life-dependent calls failed on June 25th 2023. What an absolute disgrace, that our entire industry in this country should be ashamed of.

Before giving views on the contents of the report, I do need to say it is really easy to rip apart other people’s outages, but we all have them! What matters is the lessons learned, changes applied and the lack of hubris demonstrated. My wish here is genuinely to make things better for consumers rather than take advantage of a bad situation for BT. I hope my comments are taken in that light.

My first observation is one of scale. 13,943 calls failed over 10.5 hours, which Ofcom cites as being 23%. That suggests there would otherwise have been 60,621 calls made. 999 is a 24×7 service and I have no basis to assert that there are more calls during waking hours than overnight, so a simple extrapolation suggests 138,562 could be the approximate scale of calls in a 24-hour period. With an average of 39.03 seconds each, that is 90,135 minutes per day, and approximately 2.7m minutes per month. Put another way, it is 1.6 calls per second. If the Emergency Call Handling Service (ECHS) were a Simwood customer, they wouldn’t even register on our league tables which is very very surprising, almost incredible. 

That feeling grew when I read that one issue was that a maximum of 50 calls were permitted in a queue at one time, which has now been increased to 420. That is into Raspberry Pi territory, seriously, and is very much at odds with the volumes tested on us for other three-digit numbers we’ve been involved in, where we were expected to connect many thousand calls in parallel at a very high rate. My expectation of 999 is for far more calls than the services we’ve been involved with and I’m left wondering whether my perceptions are grossly exaggerated or the figures are wrong.

That scale perception is undermined though when one looks at the money. It is interesting Ofcom stipulates a fine based on “relevant turnover” which in this case is taken to be the revenue from operating the ECHS monopoly I presume. Others who have been fined for access to 999 have provided the service for free, so what is relevant for them is clearly different, and presumably includes wider retail revenue. The big Sword of Damocles of an Ofcom investigation was the threat of a 10% of turnover fine which I’ve always taken to be of worldwide group income – £20.8bn in BT’s case – and clearly I’ve misunderstood. I think a fine of £2bn would have satisfied the meaningful deterrent test far better than the £25m proposed, before the discount for cooperation. £25m is just 0.12% of group turnover for BT, or equivalent to fining Simwood £7,200, before discount. Their auditors set materiality at £135m, so by definition this is an immaterial fine. We’ve never been fined but were we, I wouldn’t consider £7,200 bank-breaking, yet would expect to be treated equitably. This feels a dangerous precedent therefore.

My main query with regard to money is the scale of that “relevant turnover” against which a fine is levied. Ofcom redact the precise figure and offer no explanation how it is arrived at. I think publishing that would be really helpful because I have an issue with the figures as they stand. Anything less than £250m of “relevant turnover” would suggest the fine is in excess of the 10%, so “relevant turnover” must be more than that. Extrapolating the figures for 999 calls alone above, gives me £76m so it must also include other things. We know other services like Relay were affected but don’t have volumes to estimate revenue. BT Wholesale’s revenue is lost in segmental reporting in recent years, rather being wrapped up in the £8bn of the “Business” segment, which would more than cover it at a very low percentage penalty. What is nagging me though is what about BT Retail and EE revenue – one assumes that over half the failed calls originated with their retail customers? Others who have been fined have, one assumes, been calculated on the basis of retail revenue alone as their wholesale revenue was zero. Has any of that been included? Is it even fair to do so without including everyone else in the industry, although they have no choice but to consume this monopoly service, and by definition no choice of an alternative? It just doesn’t smell right to me and I think, for the good of the industry, Ofcom should explain how that “relevant turnover” figure is worked out. Sometime soon someone else will be fined over 999 and I would hate for the precedent to be set at 0.12% of revenue, or 18.5% of materiality. At those levels it is no incentive for people to up their game and such restrictive segmentation risks driving behaviours that are very unhelpful; e.g. I wonder whether customers serving Cumbrian farmers named Dave, who call Zambia, need to be put into their own entity to minimise “relevant turnover”. 

Turning to the technical explanation, there are a number of questions I’d love to ask and, frankly, I wouldn’t have accepted the explanations given if presented by anyone in Team Simwood. I am deeply sceptical over why BT appears to have been so “helpful” to the Ofcom investigation here and can’t help but wonder if it was to avoid some of these questions being asked. Firstly, what is a “node”? Three nodes around the country is acceptable if they’re a full service stack like our Availability Zones, but the report talks of misconfigurations and calls being sent to a non-existent “media server”. That is sounding perilously like a node is a single SBC or server. Aside from being grossly insufficient in my opinion, given the relevant turnover is at least £250m that makes this monopoly service look like an indecent profit centre. 

At its root, this appears to be a misconfiguration on call-routing, with the IVR being configured to play a message but pointing to a non-existent “media server” in one of the nodes. This obviously led to calls failing when they hit that node but I cannot reconcile the assertion that it was causing agents to be logged out and unable to take calls via the correctly configured nodes. I do hope Ofcom examined some SIP traces for this scenario because it doesn’t chime with my experience. If a call failed because it couldn’t be routed, logically the agents would be unaware of it as it failed to route to them; if it did route but had no media, as could be more likely, then I cannot fathom what logic where would cause the receiving agent to be logged out. Agents are presumably registered to a local Registrar, which receives calls from any and all nodes, or they are possibly registered to all three nodes directly. I can’t see how in either scenario, taking an affected call would log them out unless some crazy logic exists to bodge other issues, or the underlying issue was something completely different and not mentioned.

Obviously there are massive failures in human configuration of IVRs, which is why we manually configure nothing anywhere on the network and rely so heavily on standard images and auto-deployment. There are also huge process failures in how the incident was handled but, to be honest, my sympathy goes to those on the front-line with this and I can very much relate to the whole cycle of failing over, failing back, and taking some time to understand what the root issue was under the pressure of an outage. That is all too easy to judge after the fact from behind a desk with hindsight. In my opinion, the failure here is the architecture and the management, all of which were matters for the design and procurement of what the rest of the industry has no choice but to consume.

There are other questions raised by this report and architectural concerns but I’ll save them. I and any resources within Simwood are at Ofcom’s disposal if they wish to peel back any curtains because 999 of all services has to be 100% reliable. 99.999% even is unacceptable when the 0.001% is someone trying to save a life. I’d also be really interested in contributing to an alternative because it strikes me this should not be so centralised and monopolistic. When we’re dealing with a budget of at least £250m a year and a call volume and redundancy model that could be handled on three Raspberry Pi, there is ample scope to do something world-class here. I’d favour something completely decentralised with contributing operators serving the IVR and call distribution to agents, with the agent call-centres decentralised and supplied by other providers than BT as well. A call on any network should have numerous routes to help, with no single third party dependencies, and genuine cooperation between operators to fall-back onto each other should the worst happen. That needs more explanation but when I call 999 on my mobile, I connect to any available operator regardless of my service provider; the same can and should be achieved at the carrier and platform level I would suggest.

I hope my thoughts are helpful because as someone who has been at both ends of 999 calls far more than most, this isn’t a place for point scoring or penny-pinching. It just needs to work, 100% of the time every time, and we should all own making it so.

Related posts