Calculon server problem (resolved)

The “calculon” Web server needed to be restarted at 12:40 AM Pacific time this morning due to extremely high load.

However, the server did not restart immediately, because it performed a time-consuming disk file system check (“fsck”) after the restart, causing an interruption in Web service and a delay in mail delivery for customers on that server (other servers were not affected).

The server finished its fsck check at 3:45 AM and is now working normally.

This is by far the longest outage we’ve experienced on a server in several years. I want to personally apologize to every affected customer: we don’t consider this kind of problem acceptable at all, and we deeply regret the downtime. We’ll be carefully reviewing this incident to see what we can learn from it in the future.