Network outage followup

This is a followup to last night’s post about a network outage.

The root cause of the problem was the failure of an Ethernet switch at our data center. The switch was the one that our network cables actually plug into to connect to the Internet. Unfortunately, it’s one of the few pieces of the network infrastructure that’s not automatically redundant: although the “other side” of the switch is connected to multiple fully redundant upstream paths to the Internet, the side of it that goes to our server cabinets effectively has a single connection for each a group of servers.

When the switch failed, the data center staff replaced it with a new spare one. Because the faulty hardware was completely replaced, the problem is properly solved, and this won’t be something that’s an ongoing problem.

Just so it’s clear, we own and operate all our own servers, but we house these servers in a professional data center that provides provide uninterruptible electrical power, cooling, and extremely fast network connectivity. The data center has engineers on site 24 hours a day, 365 days a year to handle this kind of issue, and they started working on it the minute the outage started. That said, we’re disappointed that the problem wasn’t resolved sooner.

We’ve used the same data center for many years, and the small number of problems we’ve experienced have usually been taken care of very quickly. We expect that generally good performance to continue, but we will take appropriate remedial action if it does not.

Again, we apologize to our customers for the inconvenience caused by this outage.

1 Comment

  1. Thanks for the follow-up on this issue. It’s important to understand what happened. Hopefully the data center has a plan in place to reduce the length of the outage if this reoccurs. Ideally there’d be a redundant switch to a separate pipe, but at the very least, having processese to more quickly swap the hardware would be a good thing.