“Gazillions” of events per second

By Simon Woodhead

Some years ago ‘me too’ liked to claim that the 200-odd checks we make per call for fraud prevention wouldn’t scale. I made the point in various talks at the time that they were partly right: it won’t scale the way they would do it, i.e. making lookups against a relational database. This is why the extent of their unique super special award-winning fraud prevention mega-algorithms is telling you after the event how much you’ve already spent!

At the time we highlighted how we were already doing 300k operations per second against Redis, performing 200+ checks per call, and yet still routing with lower post dial delay (PDD) than any of our competitors in testing. We argued that they couldn’t do 300k per second then, before any growth at all.

Redis is insanely fast and has been our tool of choice for anything customer facing for years – there should be no conventional database queries involved in the majority of line of business transactions with us. Thus nothing to grind to a halt when traffic grows, and boy has it grown!

In front of Redis we have always made massive use of queues, for writing at least. Every write goes through a queue which gives total separation between services – nothing can slow anything else down. In the worst case, the queues get bigger! The only trouble with queues getting bigger is that the wait time for queued items increases, but we could handle that by increasing the number of workers processing the queue.

As we come to the end of 2017 we have over 30 queues and some of them have upwards of 500 workers at certain times of day! We were reaching the point where adding more workers was becoming counter-productive; adding a new one just caused more load on the queue and resulted in no decrease in wait time. We noticed this in two areas:

Processing call events in order to drive channel controls. ‘Me too’ will statically assign you to a shiny box, and configure a fixed number of channels there for you. If you’re lucky they’ll do the same again on a second shiny box. By contrast, Simwood customers can hit whichever of our PoPs makes sense to them for that call, and we apply channel limits and the many fraud controls across the network as a whole, and can vary them in real-time. So if a call lands in London and another call lands in San Jose or Tokyo, they all need an up-to-the-second view of an accounts usage and the state of the network as a whole. It goes beyond that though, as if you have trunks on your account, with individual channel limits, we do this not just for each account, but for each and every trunk. As you can imagine, this is a lot of events!

Once a channel is ‘up’ we begin billing it in real-time. This isn’t ‘nibbling’ time or reserving from a balance we are properly billing the call in order to take account of all the complications such as connection charges, minimums, rounding, increments etc. This drives not only the real-time calls in progress view available in the portal and API, but also our own credit control (which is vital for fraud mitigation with ‘locked balances‘) and customer trunk/sub-account billing – yes, you get to see live calls in progress for each of your trunks too.

Of all of our queues, these two are the most latency sensitive. If we’re a few seconds late recognising a channel as in progress then all our controls become less useful. Similarly, we used to bill every account’s calls and trunks every second, but a delay here on top of a delay in a channel being recognised could lead to this missing calls entirely or updating much less frequently.

Concurrently, I’ve spoken much about our containerisation project, anycast and the way we work today. That is a huge piece of work that has kept our team busy all year (and last) but is really bearing fruit now. This issue here is one example where many of our daemons were legacy code, running on our legacy virtualisation stack, whilst our new voice stack, API and portal made it across to the new world earlier in the year.

So, to finally get to the point, our awesome Mauritius team have now embarked on rewriting some of that legacy code in a 2017 way. So that means lots of node.js giving us containers that we can spin up in any number, anywhere on the network, that speak BGP and can be any-casted or otherwise highly available, and rather than being constrained in a VM make full use of the massive hardware they’re on! They’re on our ultra-low-latency Arista network too, but because of routing on the host and anycast, packets only leave the host if they need to and then only travel the minimum distance across the local network – much more efficient than journeying up and down the country (or across the Atlantic) depending where ‘master’ nodes happen to be.

The results were astounding! To be honest they were so astounding I’m ashamed to say I was convinced they’d done that developer ‘thing’ of deleting essential business logic because it looked pointless. Of course, they hadn’t, they’d just combined a fantastic new architecture with expert coding, and come up with something insanely great. There are some differences in the way things work that make a direct comparison hard, for example, queuing is now local, and what we were doing once before we now do in duplicate for HA. Broadly speaking though, a single container now does the work that 500 daemons used to do, in approximately 1/300th of the time. So that single-container represents a 150,000x performance gain!! But of course we don’t run just one container, we now run several in every site which multiplies the capability further, whilst each one has a clone that duplicates its work for HA. Jean-Michel and Thomas, you’ve knocked it out the park!

By now you may be concluding that the legacy code was a weak comparative? Before concluding that, remember that 5 years ago we were doing what ‘me too’ still cannot today. Our 200+ fraud checks per call remain unique and last time we checked we were still connecting calls faster than our competitors (and at weekends!). In another 5 years maybe there’ll be a shiny box that prevents your loss from fraud rather than tells you after they’ve profited, they’ll offer encryption and the Opus codec, on a low latency network. Of course, they’ll be “innovating” and winning acclaim but as a wise man once said “To be innovative, we can’t look to what others have done. The whole idea of blazing a path is that there was no path there before.” I wonder what we and our amazing customers will have done by then?

If you enjoy hearing how we do things or would like to learn more about the open-source projects you depend on, why not come to SimCon1?