Brief scheduled maintenance on “fry” and “bender” servers (completed)

The “fry” and “bender” Web servers will be restarted between 11:00 and 11:15 PM Pacific time tonight (Friday, April 29, 2011). This will cause a five-minute interruption of Web and e-mail service for customers on those servers.

Other servers will not be affected, and incoming mail will only be delayed, not lost.

Read the rest of this entry »

Problem with “fry” server (resolved)

8:52 PM Pacific time: We’re investigating a problem with the “fry” hosting server that’s requiring us to restart it; further details in a few minutes.

Update 9:42 PM Pacific time: The “fry” server was restarted, but a technician will be doing some maintenance on the server for approximately an hour. This will require a reboot, meaning the server will be unavailable for approximately 5 – 10 minutes. Web service will be unavailable during that time. E-mail service on that server also will be unavailable; delivery of new incoming mail will suspend during that time and then resume when the server comes back; no e-mail will be lost.

All others servers are unaffected.

Update 10:50 PM Pacific time: The “fry” web server will be rebooted in about 10 minutes, at approximately 11:00 PM Pacific time.

Update 11:10 PM Pacific time: The “fry” web server was successfully rebooted as planned. There may be more maintenance on the server this weekend; watch our blog or follow us on Twitter for updates.

Network issues April 10, 2011

Our primary data center experienced network routing problems between 2:06 PM and 2:49 PM Pacific time today (April 10, 2011).

During this time, packets from some (but not all) places on on the Internet were unreliable, causing connection problems. The data center technicians have resolved the issue, and all services are now working normally.

We don’t consider this normal or acceptable, and we sincerely apologize for the inconvenience this caused. (We do not yet have a full explanation from the data center about the root cause, but have requested one so that we can be sure it won’t recur.)

Brief MySQL load problems (resolved)

We had a couple of instances of MySQL queries overloading the bender server today. The first one happened at about 3:41 AM (Pacific time) and the second one happened at about 7:48 AM. Each occurrence lasted about 20 minutes. The problem each time was that a database was running extremely inefficient queries. Each time we fixed the problem by creating indexes so that the queries could then run in a fraction of the time previously required.

We apologize for any inconvenience caused by this problem. Visitors to your Web site (on the bender server) might have seen reduced performance (or, in rare cases, 503 errors). E-mail was not affected. We don’t consider this type of problem to be acceptable. These problems should not recur since the indexes have been created.

Brief scheduled maintenance on elzar server (completed)

The “elzar” Web server will be restarted at 10 PM Pacific time tonight (February 25). This will cause a five-minute interruption of Web and e-mail service for customers on that server.

Other servers will not be affected, and incoming mail will only be delayed, not lost.

This restart is necessary to fix a memory problem. We apologize for the inconvenience.

Update 10:03 PM: The maintenance was completed with less than 3 minutes downtime.

Network issues January 3, 2011 (resolved, updated)

Between 3:29 PM Pacific time and 3:33 PM Pacific time, our monitoring systems detected that most Internet users could not connect to our primary data center. E-mail delivery was properly queued up and delayed during this period.

We will follow up with the data center team, but the problem appears to have been resolved, and all services are operating normally. We’re continuing to monitor it closely, and we sincerely apologize for the inconvenience this caused our customers.

Updated: connectivity was lost for four minutes because the data center was fighting off a severe DoS attack.

AOL e-mail outage December 21 (resolved)

AOL.com had an outage lasting about 3 hours last night (from 11:24 PM Pacific time December 20 to 2:28 AM Pacific time December 21). This problem — a failure of AOL’s DNS servers — affected many people sending e-mail to AOL, and wasn’t related to our service (see this report and this one).

However, if you sent mail to an aol.com address during this time, your messages probably “bounced” with an error saying “Host or domain name not found. Name service error for name=aol.com”. If so, you should try sending the message again, and it will work normally. As always, we’ll continue to monitor AOL deliveries closely.

Network issues December 12, 2010 (resolved)

Between 2:35 PM Pacific time and 3:03 PM Pacific time, our monitoring systems detected that connections to our primary data center from some locations on the Internet were slow or failing due to problems at an Internet “backbone”. Connections from other locations were unaffected.

We’re waiting for a full report from the data center team, but the problem appears to have been resolved, and all services are operating normally. We’re continuing to monitor it closely, and we sincerely apologize for the inconvenience this caused our customers.

Service outage Nov. 23, 2010 (resolved, updated)

Our primary data center had another power interruption this morning at 7:28 am (Pacific time). All of our servers lost power and then had it restored, thus rebooting them. All customer web sites were unavailable during this time. Incoming email would have simply been delayed during the downtime, not lost. When the servers came back online e-mail may have seemed sluggish to some customers for a while but this should also be fixed now.

This incident follows another power incident the previous Saturday night. We are working with the data center to get more details, including an estimate of when they will have replaced any faulty equipment. We will update this post as more information becomes available.

Update Nov. 29: The final data center report is that on the night of November 20, lightning strikes damaged both of the redundant UPS systems, interrupting data center power for a few seconds. The UPS manufacturer scheduled replacements for November 23, but another PG&E utility power interruption lasting a few seconds occurred that morning before it was finished. The UPS manufacturer has since replaced all damaged parts, restoring full redundancy. In addition, the UPS manufacturer has overhauled each unit, replacing and upgrading other parts to increase robustness. We take this very seriously — it’s at the core of what we do — and we will continue to work with the data center to ensure that their infrastructure meets our high standards.

Service outage Nov. 20, 2010 (resolved)

A major power failure at our primary data center in Fremont, California, caused a complete outage for nearly all services beginning at 8:32 PM Pacific time Saturday night. It lasted between six and 13 minutes, depending on the server. Only our blog and redundant DNS infrastructure was unaffected.

All services are now fully operational; please don’t hesitate to contact us if you have any questions. We sincerely apologize for the inconvenience this caused our customers.

Read the rest of this entry »