By Simon Woodhead
We put our new stack live this weekend and now nearly all traffic is flowing through it.
Operationally things went well considering what a large change this was, and the absence of support tickets left us feeling quite good. Call volumes remained at normal levels and all was good.
This is the end of the story for 99.67%+ of traffic and we’d like to once again thank the majority for engaging and working with us for this large change. Some of the balance is outbound traffic still forced to the old stack through the use of fixed IP addresses rather than hostnames, and our Operations Desk will be following up with those customers. What follows is an analysis of the minority of remaining traffic that was in some way affected.
New IP Addresses
Coming into Monday, we saw a small but steady trickle of tickets from smaller customers who were experiencing a loss of service. Others noticed on Tuesday, or even Wednesday! Almost universally these resulted from customers having failed to permit the new IP addresses we announced in February 2017 – nine months ago, asking for them to be implemented by March 27th. Some of those came into service subsequently, but this change released more.
There was of course suggestion from some it was our fault and we hadn’t given enough notice, yet the majority of customers, including many very large (and some might say process-constrained), had managed the change over the intervening nine months. We announced those changes by specific email, highlighting it as important, and put the notice on the blog as well. That also means it went out by numerous social media channels and was repeated in every subsequent (at least weekly) email as well as being shown on the portal login page. We even put a big red arrow in those subsequent emails highlighting it as important. Of course, there were many subsequent new stack updates as well. It had hundreds of exposures, and the majority took action.
We would, however, welcome any constructive feedback on how we could have done better as clearly a few didn’t notice and their end-users were affected. In one case we heard of from an end-user phoning in, end-users were told we had a “massive outage” which remains of course, to put it politely, an utter pack of lies.
We’d left capability to move numbers and accounts back to the ‘old stack’ and this was used judiciously as a way of ensuring continuity short-term, while the process of making February’s changes was started.
New behaviour
Turning to the new stack specifically, we’d encouraged testing over the previous six months, with progressively increasing insistence. Whilst in the early days this was about getting you familiar with new features, to give time to build these into production services, latterly it was more about ensuring that your platform remained compatible. We’ve replaced the entire stack and some libraries in there are more than seven years newer than those they were replacing, meaning some things would just be handled differently, if only through bugs being fixed.
Thankfully, here a significant majority took action even if it was in some cases at the eleventh hour! This worked well; there were few issues and those who tested ironed those out before deployment.
We can count on one hand the reports of actual new issues that didn’t relate to adding the new IP addresses, but some of these were interesting and a little challenging to iron out. They almost entirely relate to customer premise equipment, even for those who had tested against their main CP-side equipment.
One surprise here was how some customers seem to map service directly to end-users and this amounts to quite a few end-user sites directly connected. This isn’t using the Registration Proxy, which is designed exactly for this, whether with a PBX or not at the customer site. Instead, there seems to be a large amount of mapping to dynamic DNS hostnames, fronting commodity ADSL routers, with port-mapping! Those commodity routers of course vary in age and quality whilst introducing their own idiosyncrasies.
One issue that became apparent here was the increased size of SDP payloads. This has been a trend for a few years and one reason why we’ve repeatedly counselled using TCP or better still TLS, rather than ‘out of the box’ UDP. You’ll note that the Registration Proxy doesn’t support UDP for this reason!
This issue was more acute in the context of us sending full wholesale SDPs directly to end-users behind already MTU-constrained ADSL routers. In some cases, simply using TCP solved this problem but in a couple of cases it didn’t. Whether this was the ADSL router, or the CPE it was port-mapped to we don’t know but what our Operations Team established was that INVITEs exceeding the MTU were simply ‘never received’, regardless of the transport used. This is very odd!
And a new feature
In those edge cases mentioned above, we were able to use an old feature that enabled codecs offered to be overridden. We actually removed this from the API some time ago as it was being mis-used. It was intended to enhance service but some customers were forcing every call to be transcoded in order to give themselves a worse experience – there’s a field, lets fill it.
We’ve therefore adapted this feature to help both with edge-cases such as the above as well as to let those who want to fully embrace Opus.
By default, for existing numbers, nothing changes. However, on newly allocated or reconfigured numbers, there are some new options for SIP and Registration Proxy (“reg”) endpoints around whether and how we offer Opus:
always
- the current behaviour, i.e. include Opus along with our other offered codecs
never
- This will remove all Opus options in the SDP offer, shrinking SDP, allowing for improved compatibility with UDP signalling and the edge cases discussed above of unpredictable consumer routers.
only
- only offer Opus in popular bandpass rates (i.e. qualities). We’ll always enable FEC to dynamically adjust bitrate
These settings can be applied at each sip
or reg
leg in number configuration, or applied to account default configurations.
This feature is live in the API and portal now.
Thanks again
As we mention, whilst long and transparent, the above deals with the experience of a minority of customers, representing broadly 0.3% of traffic. This was a huge change involving new hardware, on a new network, running an all new and world-first architecture, with latest versions and configurations of every element from Kamailio through FreeSwitch to our own call-routing and events handlers. The only bits that didn’t change were our SS7 interconnects and back-office elements such as billing (which is very high on the list now). The team have put thousands of hours into it, so we’re really grateful to those who have embraced the change and given us such positive feedback.