By Simon Woodhead
It has been too long since I got super excited about a new development, and couldn’t wait to tell you about it. That doesn’t mean we’ve been sitting around, it just means the development has been far more ‘business as usual’ and addressing technical debt. That’s great, good for customers, good for the business, but a bit boring from my personal perspective! Today though, I have something to share which scratches that itch and is part of a solution I’ve wanted for years!
Simwood is unique in that we don’t have designated SBCs that we hook customers up to, and statically route from them in both IP and call-flow terms. We’re more dynamic than that! We have global customer edges which are highly available in numerous ‘availability zones’, behind DNS SRV for customer selection and load balancing, and even with dynamic DNS that varies geographically. Your channel limits and rate limits are available globally in real-time, and you can send traffic anywhere.
However, what happens behind that? You’ll recall me saying years and years ago, and maybe saw presentations, on our pioneering use of anycast internally, but this was far more about microservices than SIP. For SIP we had a proof-of-concept anycast SIP proxy in beta which we’ve not done a huge amount with, and arguably its relevance has waned. What we’ve never really talked about much is how we handle calls going back out the network, i.e. to peer carriers or indeed to customers. Today I want to talk about peers.
If we did have static SBCs, mapping them to static SBCs peer side would be possible but when you have multiple availability zones as we do, and peers have multiple SBCs, both geographically diverse, you quickly get a lot of permutations. 5 AZs of ours into 5 SBCs at a given peer is actually 120 permutations or routes to manage (given order matters and repetition is disallowed). Of course though, we don’t just have one peer, we have not quite hundreds but lots. Some have more than 5 SBCs some only have 1 or 2 – indeed resilience standards vary! Assuming there’s 200 peer SBCs faced from 5 of our A-Zs, that’s 3 x 10 to the 11 routes to manage. That’s a lot, although thankfully not every peer SBC is relevant for every call given many are geo specific (e.g. SBCs for a peer who is an exclusively Irish carrier is not going to be involved in calls to Tuvalu). They all need managing though.
To be perfectly honest, I’ve never been particularly proud of how we do that. Sure, it is better than how other people do it but whilst every other element of call routing is completely dynamic, here we relied on fixed rules. We did that in a variety of ways but generally somewhere between duplicate routing tables or geographic overrides. Either made for large routing tables and imperfect failure handling.
The ideal is to have a single global routing table, with no duplication at peer level, and simply a list of contender SBCs. Those contender SBCs should be monitored for availability, performance and proximity. The best proxy for proximity is time, because in internet terms, time is proximity. In a failure scenario, the SBC in the rack next-door might not be locally reachable without traversing the country or continent, even though it is physically still in the rack next door. In time terms, it will become further away the moment that failure occurs and another SBC may now be closer and preferable, whilst other contenders may be unreachable.
What we’ve implemented now does exactly that, with your call flowing to the closest available peer SBC at that time for that call. Of course, we continue to failover to the next best SBC and, where appropriate, load balance as our peers expect. This means your calls are always following the best route even off our network, and with latency and availability being constantly probed from and to everywhere, that is adaptive to real-time network conditions. It also evolves as our network gets even bigger and we add new Availability zones or even just PoPs!
Right now it is running on a few test peers so we can get comfortable with some of the support and monitoring implications, but it’ll be rolled out really soon.