Brief maintenance on calculon server (completed)

The “calculon” Web server will be restarted at 9 PM Pacific time tonight (July 5). This will cause a five-minute interruption of Web and e-mail service for customers on that server.

Other servers will not be affected, and incoming mail will only be delayed, not lost.

Read the rest of this entry »

Brief maintenance on Calculon server (completed)

The “calculon” Web server will be restarted at 11 PM Pacific time tonight (February 19). This will cause a five-minute interruption of Web and e-mail service for customers on that server.

Other servers will not be affected, and incoming mail will only be delayed, not lost.

We apologize for the problem and for the short notice: the restart is necessary to replace a disk in the RAID array.

Update 11:03 PM Pacific time: The restart was completed with less than 3 minutes “downtime”.

Brief scheduled maintenance Saturday, May 2 (completed)

At approximately 11:00 PM Pacific time this Saturday, May 2, the “bender”, “calculon”, “lrrr” and “hypnotoad” servers will be restarted. As a result, Web site and e-mail service for customers on those servers will be unavailable for approximately five minutes.

Read the rest of this entry »

Problem affecting two servers (resolved)

We posted earlier about a problem affecting the elzar Web server. While we were investigating the cause of that, the same thing happened on another Web server, “calculon”, causing a separate outage for customers on that server from 2:34 PM to 2:43 PM Pacific time this afternoon.

During this period, Web sites on that server were unavailable and incoming e-mail was delayed. (The Web server was slow for about six minutes after it was restarted, too.)

On both servers, high disk and memory usage caused the load to skyrocket to the point where they effectively stopped responding.

The good news is that we have narrowed down the cause, so it shouldn’t happen again. A bug in one of our maintenance programs that runs on each server was almost certainly responsible. The bug has been fixed.

We sincerely apologize for this issue, and regret the inconvenience it caused for customers hosted on these servers. Other servers were not affected.

Brief scheduled maintenance Saturday, January 31 (completed)

At approximately 11:00 PM Pacific time on Saturday, January 31, all of our Web hosting servers (except the “hypnotoad” and “mom” servers) will be restarted. As a result, Web site and e-mail service for some customers will be unavailable for approximately five minutes.

No e-mail will be lost, of course; incoming mail will just be delayed for a few minutes.

We apologize for any inconvenience this may cause. This maintenance is necessary to install an updated “kernel” on our servers, as described in an earlier maintenance announcement.

Update: the maintenance was successfully completed on all servers with less than 5 minutes of “downtime”.

Brief power interruption for some servers (resolved)

This morning at 12:11 AM (Pacific time), one of the cabinets at our data center tripped a circuit breaker, causing all of the servers in that cabinet to lose power. Power was restored at 12:18 AM.

Customer Web sites and e-mail on the bender, calculon, lrrr, and zapp Web servers were unavailable during this 7 minute period. The ability to send and receive e-mail was also interrupted (no mail was lost, of course).

We are investigating the root cause of this problem to prevent it from happening again.

Brief scheduled maintenance for calculon server (completed)

At approximately 11:00 PM Pacific time tonight (October 18), the “calculon” Web server will be restarted. As a result, Web sites and e-mail service for customers using that server will be unavailable for approximately five minutes.

Read the rest of this entry »

Calculon server temporarily unavailable (resolved)

The “calculon” Web server was unavailable between approximately 5:00 and 5:08 Pacific time this afternoon. This resulted in an interruption of service for Web sites on that server. (Some e-mail activity was delayed, but no e-mail was lost.)

We sincerely apologize for this problem! We consider this type of failure to be unacceptable, and are looking into the cause of the problem so that we can take the appropriate steps to prevent it from happening again.

Calculon server problem (resolved)

The “calculon” Web server needed to be restarted at 12:40 AM Pacific time this morning due to extremely high load.

However, the server did not restart immediately, because it performed a time-consuming disk file system check (“fsck”) after the restart, causing an interruption in Web service and a delay in mail delivery for customers on that server (other servers were not affected).

The server finished its fsck check at 3:45 AM and is now working normally.

This is by far the longest outage we’ve experienced on a server in several years. I want to personally apologize to every affected customer: we don’t consider this kind of problem acceptable at all, and we deeply regret the downtime. We’ll be carefully reviewing this incident to see what we can learn from it in the future.

Calculon server restarted (resolved)

The “calculon” Web server needed to be restarted at 1:36 Pacific time today, resulting in a five-minute interruption of service for Web sites and e-mail on that server.

Read the rest of this entry »