Reference Architecture

We’ve long been very vocal on our thoughts of how best to architect an ITSP and published ‘Interop Information’ as a guide for interfacing with us. What we’ve never done is pull all that together.

Thankfully, we’re fortunate enough to be winning some major accounts lately, and in almost every case their set-up is involving a ‘workshop’ with us around their own deployment.

This post, therefore, serves two purposes:

An open invite to all customers to speak to Frazer and put a date in the diary to spend time with us in Bristol. We’ll whiteboard your business and give you our thoughts on how to improve your interface with us and, ultimately, service to your customers. This is an engineering-driven offer, not an opportunity to sell to a captive audience!
We’ve noticed, in those that we’ve done recently, there are some emerging trends (many we’ve articulated individually on this blog before) and some common designs emerging out of listening to customer needs. This post, therefore, aims to capture some of those in one place for those who can’t invest the time in letting us propose a tailor-made one. This is necessarily extremely brief!

Availability zones

As those at SimCon1 will recall, our network is divided into three Availability Zones in the UK, two in the US, and one in Asia. Within each country, we operate a largely independent platform capable of surviving the loss of others.

The extent of this is the subject for another forum, notably presentations at SimCon, but the point here is a simple one: customers should connect to us in more than one Availability Zone. Two is a recommended minimum but the more the better.

How to connect

Whilst every Availability Zone encompasses independent connectivity to the public Internet, we recommend customers connect directly to us to improve survivability of service and optimise quality. This means being in the same data-centres with a cross-connect to us and either peering or using an assignment of our IP addresses, or at the very least ensuring your host has very direct connectivity to us.

You should aim for at least two sites that have diverse connectivity to two of ours. Having all of your equipment in the same data-centre as us and a direct cross connection with a peering session to us only gets you half-way there. You need equipment outside of that data-centre that can reach another data-centre of ours. Sometimes the solution is to co-locate with us but given we operate a very open peering policy there’s no imperative to move sensibly located equipment with competent operators.

What about the cloud?

Often we find customers in various ‘cloud’ operators. We’d reiterate the point we’ve made before that “the cloud is just somebody else’s computer” given it is often used to delegate responsibility. For example, “we don’t need two sites because we have things in the cloud now”. This isn’t true, you need two be in diverse Availability Zones with your cloud provider, and potentially there’s opportunity to have diversity of cloud providers.

We get increasingly asked about direct connections from Simwood to said cloud providers. We peer with most of them in multiple places so reaching Simwood over ‘the public internet’ is still very performant, but we will support AWS Direct Connect and similar. That is to say, if you have services in AWS and co-location or connectivity with Simwood, we’re very happy to support a Direct Connect into our network for your use.

Some of our largest customers work this way although we repeatedly question the validity of handling media in ‘the cloud’ and tend to favour media being handled on locally co-located hardware, with other non-media services in ‘the cloud’.

DNS

No matter how many times we say, certain customers continue to insist on using direct IP addresses in their interop with us. This is bad and dangerous!

We understand why this happens, as many other operators who deploy magic boxes tend to favour directly addressing the assigned magic box. If you’re putting 100 customers on one magic box, and 100 on the next magic box, this approach is explicable. We don’t use magic boxes and our entire network performs the same function but in a more highly available way. We ask customers not to undermine that by using IP addresses and instead use the FQDNs we publish and maintain.

By using the FQDNs you give us control to move your traffic around maintenance and even the outage of a site. You also give your own equipment the opportunity to use the SRV records we publish to failover to another site should the first choice site not be reachable. If you use IP addresses you get none of this and (as we say in the interop) need to handle it yourself.

Horrifyingly, one solution to ‘handling it yourself’ we saw very recently was someone recreating our SRV records on a local spoof of our DNS zone that they control. This is very dangerous not to mention utterly pointless.

Inbound – site-specific configurations

Our external interconnects are distributed around Availability Zones. Thus inbound traffic should continue to flow. We encourage our customers to map numbers to FQDNs and not IP addresses and to make use of SRV. We will respect SRV to handle your load-balancing and failover.

As a tangent, we also recommend using default number config wherever possible such that in the event of the config needing to change at 3am on a wet Sunday (less likely if you use DNS!) you can do so once rather than having to change every configured number. You can override the default with any specific number config to get the best of both worlds.

Of course, DNS then becomes a single point of failure so the architecture there matters. We recommend outsourcing this to one of the global specialists in the space, just as we do ourselves.

However, we noticed a quirk recently with a customer doing all of the above. They were in two Simwood Availability Zones, using DNS and SRV to load-balance / failover between two sites, with calls entering Simwood across two Availability Zones. An unintended consequence of this was that 50% of calls entering, say, Slough, were being sent cross-network to their equipment in London, and vice-versa. Failover worked as it should but this isn’t efficient and feels a little dirty (in not a good way!).

Accordingly, you will now see in the portal and the API documentation the ability to have Simwood Availability Zone specific configurations in your numbers. It is important to note these are not forcing traffic to a specific site, but rather configuring how it is handled once it lands there. You have no control of whether it lands there.

The intention is to have an ‘anywhere else’ rule which functions as configuration does by default, but then override that for specific sites. Thus, a customer on-net in Slough and London would configure overrides for Slough and London to an FQDN whose underlying SRV fails over between them but still prefers the local site. The ‘anywhere else’ rule, of course, remains essential as calls could land somewhere else on the network and you need to have a configuration when they do!

***

We hope this has helped and will likely explore the issues raised in more detail again. Do please let us have your feedback and take us up on our offer to discuss this in relation to your own business.