External monitoring system problem (resolved)

At 7:59 AM Pacific time on September 13, we tweeted this:

However, this was a false alarm. The problem was in our independent external monitoring system, not a real problem with any of our servers or network.

The reason this happened is that one of our two external monitoring systems runs on the Amazon EC2 system, completely independent of any of our systems or network. This is intentional: it ensures that an outage on our end can’t also take the monitoring system offline.

This obviously raises a question, though. Since Amazon EC2 will occasionally have its own networking failures, how can the monitoring system tell the difference between an Amazon failure that causes our servers to be unreachable and a failure on our end that causes the same thing?

The answer is that if the monitoring system thinks our servers are unreachable, it checks to see if can reach other popular sites, including Google and Yahoo. If those succeed but our servers don’t, it assumes the problem is on our end.

Unfortunately, this morning Amazon EC2 is experiencing a new kind of network failure that makes about half the Internet unreachable, while still allowing connections to Google and Yahoo. This generated a false alarm.

The monitoring system also has a final “kill switch” that allows our staff to override false alarms before they end up as actual tweets… but the partial Amazon networking failure also prevented that from working. “It’s always something!”

Our apologies for any worry the tweet caused. We’re obviously working on making the system a little more resilient.