IP Network Overhaul

By Simon Woodhead

It has been more than 4 years since we overhauled our core network and boy how the world has changed since then! Back then we chose Brocade on price/performance grounds, i.e. we could deliver substantially more throughput for the same money. 80Gbps of core forwarding per site seemed ample then and, to be honest, is still way more than we need for routine traffic.

What has changed in the period since is the sheer volume of traffic intra- and inter- data-centre, and the size of burst/flood events. Voice traffic remains relatively low-bandwidth for a given value of turnover compared to hosting or commodity Internet access, but our position in that market-place has continued to evolve too. We’re a very grown up provider (20 years this month!) to some very high profile customers and that is a position we take seriously, as demonstrated by our uptime record. Frankly, we cringe looking in larger competitors racks in common data-centres and seeing equipment that was past its sell-by date a decade ago. We couldn’t sleep at night operating that way.

Looking at overhauling our core again now presents some really exciting options that simply weren’t there before. I spoke at Kamailio World earlier in the year on how the future was panning out and will be elaborating on this at Cluecon (Chicago USA, August), and Astricon (Arizona USA, September). I’d suggest watching the existing video or those from coming events for a fuller explanation but in short the game today is in Merchant Silicon (versus vendor silicon) and software control. The former gives absolutely enormous shifts in price/performance, the latter enables us to do amazing things in dynamically protecting the network.

So what are we doing?

Vendor shift

We will be leaving Brocade for Arista. They have served us well but frankly buying equipment from any of the legacy “vendor silicon” people at this stage of the game would be rather like buying a tractor when you could have a Tesla. If you need to plough a field then you might now opt for a tractor with a Tesla to get there; we don’t need to plough a field so a Tesla will do just fine!

Now Merchant Silicon opens up a whole world of “white-boxes”, i.e. cheap switches you can run Linux on. The important thing to note here is this isn’t like software-based network equipment of old; forwarding takes place in hardware, on silicon, so it is mega-fast. The only difference is that silicon is not proprietary by a Brocade or a Juniper of this world but by a chipset expert such as Intel or most notably Broadcom. The Linux layer on top is giving software control, not actually shifting packets.

If we wanted a simple layer-2 network, white-boxes would probably be the way to go. That isn’t to say that the various flavours of Linux that specialise in powering them (e.g. Cumulus) aren’t excellent at layer-3 because they are. They are used at hyper-scale and do everything we need.

Instead it comes down to the classic build or buy argument. We want the benefits of merchant silicon in ultra-density and software control, but we want the stability and support of a major vendor. Arista has an amazing software team and gives us that confidence.

Ultra-low latency

In the old days you had switches that operated layer-2 and routers that operated at layer-3. Then came layer-3 switches and huge, huge routers used in layer-2 environments powered by MPLS, so arguably the definition has blurred. Really the distinction now is about port-density vs routing-table size, given at a given price-point a switch will deliver many more large ports and forwarding, whilst a router will have a few but hold far more routes. One consistent though is that switches have generally been significantly lower latency and pioneers such as Arista have really pushed those boundaries, measuring latency in nano-seconds rather than milli-seconds.

Ulta-high density collapsed stack

At the same time, whereas we’ve been constrained for 10GE ports in recent times (given all new hosts have at least 4 of them), we’re moving to a platform where 10GE is the standard and, depending on the site, we have up to 96 of them per U of rack space. More importantly, we no longer need stacks of smaller switches underneath to aggregate as was the way networks were designed historically. They introduced extra latency and generally over-subscription, on paper at least. Each site will collapse into n (where n is at least 2) Arista switches taking our 80Gbps core to well over 2Tbps of capability. Every host will have full line-rate bandwidth to every other one in a given site with nano-second latency.

Table size

I mentioned table-size was the trade-off for port density and that historically meant you couldn’t hold the full Internet routing table (619k IPv4 routes at the time of writing), as many ISPs obsess over doing. We learned that was futile many years ago but still have several hundred thousand peer routes to hold, as well as our default transit route. So historically we couldn’t have used a layer-3 switch on our edge but two things have changed to make this possible.

Firstly, Merchant Silicon has evolved and higher-end chipsets can now hold the full table.

More significantly, Spotify did some great work here a few years ago on controlling – in software – which subset of routes you actually need. This enabled them to route 99% of their traffic with just 30k routes and made use of a commodity layer-3 switch practical for their edge. There’s a video on the early work here.

We’ll be using the same principles but taking them a stage further to handle route optimisation as well.

Software control

In addition to the above we finally, in 2016, will have programmatic control over our network. We’re not talking OpenFlow or other traffic “control” mechanisms, we’re talking dynamic config. Historically, routers didn’t have APIs, they had consoles and the closest we came to automation as an industry was scripting console commands, or BGP hacks.

We’re entering a world where we have APIs to dynamically configure the network equipment in real-time. We have some really exciting ideas here but top of the list is security. We can identify attack flows in our SIP IDS (see the Kamailio World talk linked above), or attack profiles from our Honeypot data and fend them off in real-time. Similarly, we’ll be able to identify DDoS traffic in software, and mitigate it directly in hardware.

Exciting eh?!

Deployment

We’ve worked hard to earn a reputation for stability. Controlling our own network (in both IP and SS7) is a key part of that. We’re taking a leap here that will be new to many, but which has been well trodden at hyper-scale in major cloud networks at one extreme, and critical applications in the finance industry. We’re confident the path we’ve chose is ready for now.

However, we’ll be deploying it in a very sensitive incremental way, over the remainder of this year. We’ll be notifying any work ahead of time on our status page.

Naturally, please do not hesitate to contact us with any questions or come and see us at Cluecon or Astricon.