Back

Intelligent Solutions

Resilience is not a magic button

Simon Woodhead

Simon Woodhead

5th September 2024

I’m afraid this post might be a bit ranty. Regular readers will know they sometimes are (some love it, some hate it – that’s how principles go) and I don’t usually warn. But today I’ve read the most stupid of utterly stupid things, I think ever. It demonstrates how inept factions of our industry are and how platform operators really really need to expect more from their “carriers”, in the interests of their end-users. I don’t mean to offend anyone personally; nice people have stupid ideas too, nice people can work for people with stupid ideas, and success can delude someone into believing they know what they’re doing, and not seeing the stupidity. As an old boy in finance once said to me during the dot com boom: “I know enough to know I know bugger all”. That stuck with me as a skill I really wanted to master!

Infrastructure resilience

Firstly, let me say, I have the scars from a lack of infrastructure resilience. From explosions in data centres taking out entire racks, to power surges melting primary and backup storage arrays (and working three days straight to recover major retailers), through to once lying alone on the floor in Telehouse as a very major airport was closed because of an outage I’d caused, and I’d need to fix. The most recent of those was over 15 years ago BTW, but they taught me stuff breaks, you need to plan for it as a routine event, and you can never have enough redundancy. 

Other events taught me break-glass solutions are really a last resort because someone needs to be able to enact them. I remember spending a day in Petra, Jordan, in the early 2000s trying to press the big red button I’d engineered especially to have a holiday, and needing it that day, but having no workable mobile service to do so! Petra was fabulous, I’m told, and maybe one day I can go back to appreciate it.

My failures are no secret skeleton in my closet or ammo to snipe with behind my back at Christmas parties. I’m not ashamed and I’ll willingly tell anyone because we all need to learn these lessons and I’d sooner others didn’t have to do it the hard way. I think they call it experience and as I said to a CEO once: “Experience is spotting the first few crystals of ice forming on your windscreen. Any idiot can tell when they’re upside down in a ditch.” Simwood today reflects not just my learnings, but other veterans like our CTO Charles, all of whom bear their own scars. 

That marks the other aspect of this: team. Being the single-point of responsibility is not fun and thankfully a cross I no longer bear. But having carried it for so long, it isn’t a burden I’m prepared to impose on others, especially our customers. Things need to be automated, so even if you’re a one-man-band you can enjoy your trip to one of the seven wonders, or time with the family, without checking your phone every 10 seconds for alerts!

So, that is a long way of saying: resilience should come as standard, without relying on break-glass solutions or human intervention wherever possible, so you can sleep at night, or just focus on the actual issue in hand when there is one.

Our approach

When you configure what we call ‘routing’ on Simwood Carrier Services – how we get incoming calls to you – we’ve talked before about how you can do this at the account level once, once per trunk or per number, with the most specific winning for a given call. Within that, you create a mini-dialplan so you can specify multiple destinations. If SIP, these can be to ring in parallel or to failover, or even with time of day rules which aren’t something you’d normally expect at carrier level. One issue with SIP failover configured like this is that you need to get a failure code or wait for a timeout, i.e. your switch won’t give an error code when it is on fire. 

That is why we lean so heavily on DNS and encourage all customers to use fully qualified domain names (FQDNs) rather than direct IP addresses. If you use FQDNs you have numerous options, from multiple A records, through to updating short-lived records (maybe automatically linked to monitoring), through to SRV and NAPTR. 

SRV is the daddy here because it also enables all of the aforementioned yet gives a pre-configured failover and load-balancing schedule. Equipment, if SRV is enabled, will know, for example, that if server A fails, use server B, or use A and B in a 75:25 ratio. 

We use all of these options for your outbound traffic to represent our 3 UK availability zones, coupled with monitoring linked modification, and geo-DNS so customers in the US see a different 2 availability zones based on their location – all behind the same single FQDN – but crucially we also respect all of them when we’re sending incoming traffic to you. DNS has proved itself robust and mission critical, presuming you don’t try and run a 15 year-old version of BIND yourself on-net. We’ve always outsourced it to experts just because it is so mission critical and whilst something we can do, we can’t add value to what they do. 

Oh lastly, if all the above options don’t cater for what you need, you can change any config be it account, trunk or number level through our API (or portal). Change an account or trunk level config and it obviously affects all numbers under it in a single API call. Some customers have exotic scripts to do this and have had for years. Within this there is a solution for everyone, and whilst there’s a spectrum of implementations ranging from hacky to elegant, the options are there and they work. With these configured, and equipment in more than one location, our customers can have an outage, or even experience a partial-outage (e.g. a routing/availability issue), and individual calls will go the right way to maintain service, as standard, without human-intervention. Like they should!

An alternative!

Now, picture this for contrast. 

One of our slower moving competitors has, with much fanfare introduced a portal, 20 years or so after BT, and 28 years after us. Apparently they only have one instance of it as it requires annual maintenance (the only maintenance they announce) to update its SSL certificate – something we do routinely and automatically for many customers across hundreds of instances. They allow numbers to be mapped to a single URI each. They apparently haven’t heard of SRV or NAPTR, despite them being defined in RFCs in 2000, so instead have been working hard on a new feature. 

Wait for it… 

Within the single instance portal, they can, for an extra fee to reflect the work adding it (presumably manually therefore), give you a button. A magic button no less! When you have an outage, you ignore it for a bit in order to log into the portal and press the magic button. One presumes that behind the scenes, using some kind of wizardry we can only imagine, it’ll update all your numbers. So yes, you can fail over from one site to another but what the actual…!?

Conclusion

If you’re building a real business on this, supporting your customers building their real business, you frankly need to reconsider who you work with. It shouldn’t make me so cross that our competitors are inept, but it does, because real lives and real livelihoods are affected by it and they deserve better from our industry.

Our BYoC capability is a way of getting all the above resilience even when they won’t let you port numbers over. I’ll leave you to discuss with them what you’re actually still paying for! Ported or BYOC, unless you renumber, you’re still exposed to their failure given the way the UK’s ‘onward forwarding’ porting works but you get all the other benefits in moving numbers over either way. 

Of course, if you have your own number ranges hosted then they can be moved over to Simwood very easily and probably with improved economics and performance given our depth of interconnect and infrastructure.

Please give us a call!

Related posts