Network outage followup
This is a followup to last night’s post about a network outage.
The root cause of the problem was the failure of an Ethernet switch at our data center. The switch was the one that our network cables actually plug into to connect to the Internet. Unfortunately, it’s one of the few pieces of the network infrastructure that’s not automatically redundant: although the “other side” of the switch is connected to multiple fully redundant upstream paths to the Internet, the side of it that goes to our server cabinets effectively has a single connection for each a group of servers.
When the switch failed, the data center staff replaced it with a new spare one. Because the faulty hardware was completely replaced, the problem is properly solved, and this won’t be something that’s an ongoing problem.
Just so it’s clear, we own and operate all our own servers, but we house these servers in a professional data center that provides provide uninterruptible electrical power, cooling, and extremely fast network connectivity. The data center has engineers on site 24 hours a day, 365 days a year to handle this kind of issue, and they started working on it the minute the outage started. That said, we’re disappointed that the problem wasn’t resolved sooner.
We’ve used the same data center for many years, and the small number of problems we’ve experienced have usually been taken care of very quickly. We expect that generally good performance to continue, but we will take appropriate remedial action if it does not.
Again, we apologize to our customers for the inconvenience caused by this outage.