Service outage May 6, 2011 (resolved)

May 6, 4:43 AM Pacific time: An outage at our primary data center caused a complete service interruption for all customers.

Update 5:08 AM: All services have been restored and are working normally.

Update 9:30 AM: All services continue to run normally. We are contacting the data center for further follow-up on today’s problem.

Update 11:05 AM: The initial report from our data center is that they lost power from PG&E which damaged some of their own power equipment. Of course, the data center’s equipment shouldn’t get damaged by the loss of utility power (or anything else!), so we’re trying to get further details.

Update 5:24 PM: A detailed report from our data center indicates that they lost power just before 3:00 AM Pacific time due to a problem with the utility company, PG&E. This should not be a problem, because the data center has redundant uninterruptible power supplies (UPS systems) that can power the whole data center for several minutes during a power interruption, and redundant diesel generators that can power the UPS indefinitely.

In this case, the UPS system took over and a generator started up properly. However, an “automatic power transfer switch” that changes the UPS power source from the utility to the generator failed to operate. This meant the generator power did not reach the UPS systems, which failed soon after. As a result, we lost power to our server cabinets until 4:40 AM Pacific time.

The automatic transfer switch is being replaced by the manufacturer, which should prevent a recurrence.

What are we doing about it?

We’re extremely unhappy that our primary data center has suffered two power outages in the last six months: a lightning strike damaged the UPS systems on November 20, causing short outages that day and two days later while the UPS was still being repaired. (The UPS systems were completely overhauled then, and today’s problem was caused by a different part of the power system.)

This kind of outage should never happen. The point of using a data center is that the facility has the infrastructure to guarantee power and cooling. Any power interruption is a fundamental failure of what they do.

That said, we don’t necessarily believe this situation would be improved by using a different data center. In fact, last year our secondary data center (which houses our blog, phone system, offsite backups, and some redundant infrastructure) had a similar outage caused by an automatic transfer switch when the utlility power failed. Power outages have affected several other prominent Bay Area data centers in the last few years, too.

Regardless of that, we’re painfully aware that our customers received extremely poor service from us today. We want to emphasize that this is completely unacceptable to us, just as it is to you. We intend to deliver 100% reliability at all times, and it pains us when we don’t live up to that. We can only offer a sincere apology for this incident.

(One other thing: Due to a mistake on our part, we didn’t post prompt Twitter and blog status updates when the problem started. We’ve improved our internal procedures to do a better job of that, too.)

2 Comments

  1. Thanks for the information guys, I know crap “happens” but the no-communicado via twitter or the blog is what had me concerned most. I was working on my site when things were kablooey so I was hovering, waiting from the moment the service went down to when it went up again. I was thisclose to pulling out your phone number to call because when hosting goes down, the first question is: do they know things are shot? I’m happy to see you acknowledge the communication could have been handled better.

    Still very happy with Tigertech, you guys know your stuff like no one else and are very, very helpful when support is needed.

  2. By the way, we’ve since updated our systems to make it easier for us to quickly tweet (and blog) status updates so that it’s easier for us to keep our customers informed whenever there’s an issue that we’re working on.