web11 server hardware failure October 23, 2013 (resolved)

Between 11:59 AM and 1:13 PM Pacific time on October 23, 2013, there was an outage on the “web11” server due to a hardware problem. Other servers were not affected.

The hardware has been replaced and the server is running normally again. During the outage, incoming email was queued for delivery. All incoming email has now been delivered to the appropriate mailboxes. No email was lost.

We consider this type of downtime to be unacceptable. Our servers have redundant power supplies, hard drive RAID arrays, and networking to avoid problems wherever possible. But in this case, the problem was a failure of the server’s motherboard components, causing it to repeatedly “lock up”. Solving this required physically moving the RAID array containing the server’s data to identical hot spare hardware. (All our Web servers use identical hardware.)

We sincerely apologize to our customers affected by this outage.

8 Comments

  1. Ok, guys it is now 12:31 — an update every half hour or so would help us keep the faith. Would have expected a solution more quickly. Jay

  2. We actually did update at 12:23, but forgot to put a time stamp on the update. We have updated the post with the timestamp. We’ll continue to update it frequently — should have another update very soon.

  3. Are we going to be missing any emails? Will clients be getting a bounce back, or are the messages just delayed?

  4. Unfortunately this is the day my client decided to promote her new website. Hope this doesn’t take long. I brag so much about your dependability. But I know things happen.

  5. Whoooo hoooo! We’re back up. Thanks guys for your speedy work. High fives.

  6. Very strange… this happened on another website I manage (domain through Network Solutions) (*boo!), and their DNS server was down. (“Down” so much so that Network Solutions in their ENTIRETY was down. (e.g. networksolutions.com ). Yeah, that big. I’m surprised it’s not in the news more today.

    So anyway don’t feel bad. This is the first time I’ve seen this happen to TigerTech in my several years as a customer, and–if it helps at all–at least you’re not Network Solutions where you’re sweating over 6.6 million websites.

    Regardless of the crisis, everything is manageable! Keep going strong TigerTech!

    -Beau Ch.

  7. Max: The server is back up.

    To answer your question: incoming email is handled in two steps by our system. The first step is that one of our mail servers accepts the incoming message. This is continued to work fine, so senders didn’t see any problems or bounces. The second step is that our mail server delivers the message to your mailbox on your particular Web server. This step wasn’t working while your Web server (web11) was down. (Likewise, you can’t connect to your mailbox to read mail if your Web server is down.) So, incoming email will always be accepted, but not delivered to your mailbox until your Web server comes back up.

    Now that web11 is back up, incoming mail is being delivered to mailboxes on that server. You didn’t lose any email; there was just a delay in delivering it to your mailbox.

  8. Fellene and Beau: thanks so much for the kind words! We hate to have anything like this happen, obviously. It’s especially poor timing when a customer has just been raving about us, or someone is expecting a big jump in traffic to their site. We’re glad our recovery plans worked as planned, but still regret the downtime. We know that people depend on these services.

    For the record, we put all of our service problems on this blog, so you can see that this type of problem is thankfully pretty rare. We’ll obviously investigate this in more detail to try to figure out exactly what went wrong with the hardware. (And this happened even with redundant power supplies, hard drives, networking, etc.)

    Things get pretty stressful around here when something like this happens (obviously) — at times like this it means a lot to get nice comments like this from our customers.