Slow connections for some customers (resolved)

Beginning mid-day yesterday (Sunday February 6th), we received reports from a small number of customers that connections to our data center were slow. This was traced to a problem with a combination router/Ethernet switch in a cabinet in our data center corrupting some packets of data. The router has been replaced and the problem resolved.

We initially believed that this was a problem within the Charter cable network, because the reports were only from Charter customers. However, it turns out that this was a red herring: the router was simply corrupting packets from a small number of IP addresses, and the addresses affected happened to be Charter customers for the most part.

Even though most customers were unaffected, we sincerely apologize to those who were. This took us a full day to resolve, which is far too long. Unfortunately, the problem did not show up on the many test IP addresses we used, so tracking it down was difficult.

(For network geeks: the symptom was that for a small number of IP addresses, the router was sometimes clearing the 247th, 279th, or 311th bit of a packet. This corrupts the TCP checksum, causing our servers to fail to acknowledge the packet, leading to extremely poor throughput. Our guess is that the router sorts packets by source IP address into various memory locations, and a memory chip had actually failed, with a bit failing to remain set. The router is now in the “electronics to be recycled” bucket: we don’t even try to fix things like this.)

To top things off, when the router was replaced today, some of the servers on our network were unavailable for up to three minutes beginning at 5:29 PM Pacific time. That happened because an Ethernet switch in another cabinet refused to “talk to” the new router until the switch was also restarted.

These issues are all resolved, and because the hardware causing the problem has been replaced, they will not recur.