According to a 2016 study by the Ponemon Institute, the 4 most common cause of large datacenter failures were; UPS system failure, cybercrime (DDOS), Accidental/Human error, and Water/heat or air conditioning failures. Frankly I’m surprised Accidental/Human error isn’t #1, and I think when you look at #1 and #2, there’s a good argument to be made that those are human errors at their core, but I suppose I digress.
My point is this, it’s no surprise then, that many of us either have considered, or are considering picking up a second Internet provider to give our enterprise more resiliency (redundancy) in the event that one of our providers should fail.
Editors note:
BGP multi-homing is a common method of stitching two providers together and giving your enterprise two paths out to the Internet. If you’ve never done it before, let’s take a look at what it takes, and what it’s going to add to your bottom line. Most of us, when designing our enterprise networks use an IGP (Interior Gateway Protocol) for routing.
Most likely you’re using OSPF, EIGRP, IS-IS, or RIP. These are protocols used for routing within what is known as a single Autonomous System (AS). An Autonomous System is, simply put, a single network domain. This can be a single enterprise, or set of routing devices. An AS might also be a large Cloud provider like Amazon, or a large Internet service provider like BT Business or Global Services.
Most enterprises connect to their Internet provider, using that IGP and they take a single, default route from the Internet provider. In most cases, the enterprise still only cares about Intra-AS routing (routing within your own AS). The only interaction with the outside world is at that one interface with the provider. If you’re considering BGP multi-homing, you’re considering stepping into the world of routing between multiple Autonomous Systems.
You’re now going to have more than one route to the outside world and you’re going to have (potentially) a LOT more decisions to make about where to send traffic. This is a great thing in terms of survivability when a failure occurs. However, there are a lot of up front considerations and some planning you’ll need to do to make this transition successful. Routing between Autonomous Systems is done with an EGP (Exterior Gateway Protocol).
While there are a myriad of IGPs, there is really only one EGP and that routing protocol is the Border Gateway Protocol (BGP). IGP routing knowledge doesn’t necessarily translate right over to BGP. If you don’t have anyone on staff today with BGP routing experience, you’ll want to start getting someone up to speed. They may need lab experience, lab equipment and time to plan this deployment for weeks ahead of time (or more). It might be a good idea to work with an industry partner who has this experience and can help smooth the transition. There are, of course appliances, purpose built to help simplify this “load sharing” across multiple providers.
Assuming however that you’ll be configuring things yourself, there are two basic approaches you can take when multi-homing. Approach number one involves just taking a “default route” from each provider. This approach can be somewhat simpler, but it also has some drawbacks. All of your traffic will always go out a single provider. The only time provider #2 will ever get used is when provider #1 finally fails. The net effect of this configuration is that you have a “primary” path and a “backup” path. This means you’ll be paying for that second connection and not using it well over 99 percent of the time.
This is not a very efficient use of your bandwidth (or money). Approach number two involves taking a full Internet route table from both providers and allowing your “border” router to decide what is the “best” path to any given destination on the Internet. This allows traffic to flow over both connections all the time (although don’t expect it to be evenly distributed). In the event of a failure of either provider, the surviving link will take up all of the load. An additional benefit of this option is that using route filtering, route policies, and communities, you may be able to “traffic engineer” around some temporary problems with a specific provider.
Another technical consideration is that you will now need your own set of IP numbers. If you’ve been with provider #1 for years, and you’re using their (public) IP numbers, they are most likely not going to let you advertise those IP numbers into provider #2 as if they’re your own. You’ll also need your own Autonomous System Number (ASN). These can be obtained from the registrar for your country or region. I can’t stress this enough. You’re going to want to start the process of obtaining these early. IPv4 address space is in very short supply and there may be substantial waiting lists even if you’re buying from a broker.
Even if you’re operating on very few public IP addresses today, you will not be able to function as an Autonomous System on the Internet with anything less than a /24 (256 addresses) so plan accordingly and acquire a continuous block of IPs that size or larger. Obtaining your own ASN and IP block is also crucial to making yourself mobile and independent from either provider. If you use your own ASN and IP numbers, you can cancel contracts with them and pick up contracts with new providers when more attractive pricing comes along. Finally, you’ll need to take into account your hardware requirements.
You’re going to need to find out if BGP is a current feature of your border router. If not, perhaps it is a licensed add-on? If you decide to take a full Internet route table from both providers, for the greatest flexibility and efficiency, this can add as much a decimal point to the cost of the hardware. Today the full Internet route table (IPv4 and IPv6) is approximately 800,000 routes. You’ll need a device that can take that and ingest it twice. Then do lookups on each prefix to determine which of the two paths is the best and resolve the (approximately 1.5 million) candidate paths down to the final 800,000 or so “best path” to put in the forwarding table. And you need it to do it VERY quickly.
Because, in the case of a provider failure, it’s going to have to throw all of that out, and recalculate it all again. You want it to recalculate those new “best paths” in seconds, not minutes, and get them installed in the forwarding table, so that traffic begins flowing over the surviving provider circuit. Then, when the failed provider comes back up, it has to do it all over again. This doesn’t (more than likely) need to be done at multiple points throughout your network, but at the very least, you’ll need it at your border router (or routers) with the multi-homing providers. As you can see, there are a lot of moving parts to get all of this working. The knowledge and experience of your local network staff is critical to making this project happen. If you’re afraid you don’t have that experience on staff today, it might even be a good time to consider an enterprise provider or partner who can help make this transition a successful one.
With that said, BGP multi-homing is a powerful step toward making your enterprise disaster proof (or at least disaster resistant) today and into the future.