Turning our network around

By Simon Woodhead

It is almost 4 months since SimCon, and despite the pandemic, our team have been crazy busy. We’re going to give you an update on a special inaugural SimCron in a few weeks time, featuring none other than our CTO Charles Chance, and our new SVP of Product, Mr Jared Smith. I’ll be there too, imposter syndrome active.

However, there’s an important change I wanted to prelude here, not least because there are those who will say it is a change of direction. It concerns our cloud strategy and network expansion, yes, two apparently mutually exclusive things.

The Cloud

In presentations past, I’ve talked about our approach to the Auto-Pilot Pattern, Anycast and other such ‘cloudy’ capabilities we’ve embraced on the network. I’ve also been somewhat dismissive of those who discuss their ‘network’ whilst simply having a few cloud VMs. Lastly, I’ve talked about Kubernetes and how it wasn’t the time for us to deploy it on-net.

We’ve proven the value of anycast, we’ve proven the value of Infrastructure as Code, albeit somewhat different to the original Auto-Pilot vision in detail, but fundamentally the same in nature. But it can all be improved, and here is where I may be accused of changing my view, because the differentiator is the dynamic scale that the cloud can provide.

Serverless applications such as Google’s Big Query have dramatically changed the way we do certain back office functions, and indeed have transformed what is even possible, such as our competitor price matching. Access to hundreds of thousands of CPUs for 20 seconds, dynamically orchestrated, just makes so much sense for that once a month job that otherwise would not be possible, take days, or require grossly over-specified on-net resources.

And so to Kubernetes. The team has deployed Kubernetes on Amazon with great results. It is auto-scaling, they’ve deployed host-groups so Kubernetes Pods (collections of containers representing a service) are appropriately located, across multiple AWS Availability Zones and connected diversely back to our core network. All that in a manner that is cloud agnostic.

This is already powering Simwood Meet and whilst most focus has been on the underlying architecture, certain critical services such as call-routing (our micro-service which handles call routing) which gets called hundreds of times a second, and itself calls Redis hundreds of thousands of times a second, has fully migrated. This used to be anycasted on net with one instance per host and whilst that could flex it meant hosts had to be over-specified and call-routing could be adversely affected by host busyness causing occasional spikes in our otherwise legendary low PDD. It now runs on Kubernetes, as do appropriate Redis instances, and both are dynamically scaled in response to real-time metrics such as request latency. It is awesome!

Other services and functions will be migrating this way, yet at the same time we’re also deploying Kubernetes on-net. Repurposing hosts will enable us to effectively double on-net compute available to containers and migrate to Kubernetes alongside the existing stack. This means we eventually have double the compute on-net to cope with post-COVID growth, whilst we’ll also be able to burst to the cloud and respond in real-time to demands on the network, with no effective limit – scaling hosts and services as demand requires.

I hope you’re already noticing the benefits, but there is much more to come.

IP Network

Our network is substantially larger than our peer group to maximise availability and genuinely add value. Most (90%+) of our traffic comes over peering and direct interconnects, so the geographical spread is somewhat necessary to enable that and absolutely necessary for availability.

I mentioned above how we are starting to use AWS compute behind our own network. This logically means that proxies and media servers remain on-net but support services such as call-routing, billing etc. migrate to the cloud as necessary. However, there’s some subtleties that warrant exploring.

One of our largest peers is actually AWS, with other cloud providers in the running too. A large percentage of our traffic originates on cloud providers because even some of our largest customers make extensive use of it. Some customers have AWS Direct Connect into our network, or are using our IPs on AWS. It’d be so much more sensible all round for us to have capability to service this traffic on AWS directly. That is the direction we’re heading with certain Simwood prefixes hosted by AWS and presenting services as additional Simwood Availability Zones.

Next, anycast works best with massive scale; with global geographic scale. We peer with 400+ other networks in the UK, many of them international operators, but the physical and owned network doesn’t stretch offshore. We therefore cannot leverage the same benefits ourselves as we could using, for example, AWS’ Global Accelerator, which gives us anycasted end-points in 83 Points of Presence in 73 cities across 38 countries. We used this already when we took Simwood Meet global, to ensure global participants hit their most local instance and it works phenomenally, not least because any out of region backhaul is across AWS’s private network.

So if we’re using AWS (or another cloud provider) compute, and moving towards a cloud public edge, what does that mean for our IP network? That is a really good question and one I ask myself regularly. Networks are cost and they are points of failure, so to exist they must add more value than if they didn’t. This is the trap many vanity driven “networks” fall into – a VPN and a software router and ‘ta-da’. They see themselves as big boys and their website makes the most of that, yet anybody that knows what they’re talking about sees laughable liability and technical weakness.

The answer to the question is that our network is going nowhere. You simply cannot be a carrier without being able to, erm, carry traffic. An airline without planes is not an airline, it is a travel agent!

But things are going to change. I talked recently about our public edge and our DDoS strategy which could see it shut down in the worst situation. Those connecting over the public internet, and thus our public edge, could just as easily hit a cloud edge. Those who value security and availability are already directly connected to our private edge anyway, so the case for preserving an on-net public edge long term is not terribly compelling.

However, at the same time we’re building out to potentially 14 new sites in the UK and hundreds in the US for our evolving carrier interconnects. We need to interconnect with peer carriers in more locations, to improve availability and performance way above what it was in a TDM world. That is a different kind of build to the public edge, but it is a build nevertheless.

Conclusion

So what are you going to see? In the short-term very little! SIP proxies, media proxies etc. will all remain on-net unchanged. You will see us make more use of cloud based services however to serve static content or for specific services. You may see a cloud based Availability Zone or several arrive for SIP services and indeed have the opportunity to directly connect with us in-cloud (talk to us), but short term these will complement what is on-net. Meanwhile, the private unseen network will be growing massively to add many many points of IP interconnect with other operators. You could say the network is turning around.

As I said, we’ll be discussing this at the forthcoming SimCron so if you have any questions, please let us know in our community Slack channels.