Due to a failure of the power distribution unit (essentially a fancy power strip) in one of the cabinets at our data center, the following services became unavailable at 05:52 AM Pacific time:
(Other Web servers are not affected.) A data center technician is replacing the power unit in that cabinet and all systems should be be back online within 15 minutes; we’ll update this post when that happens.
Update: The faulty hardware has been completely replaced. All servers are back online and functioning normally, and all queued e-mail has been delivered and is available for retrieval. The total outage for these servers was from 05:52 AM to 06:15 AM (Pacific time).
In addition, the FTP service on the “zapp” server was not fully working after it was restarted, so FTP publishing on that server was unavailable until shortly after 7:00 AM. This has been corrected (and the underlying problem that could cause incorrect startup was fixed).
We sincerely apologize to customers affected by this outage. This kind of issue has happened to us only once before in the last seven years (and that was with a different brand of power unit). Since the replacement power unit is brand new, we don’t expect the problem to recur.
Our Web-based MySQL interface, phpMyAdmin, has been updated to version 2.10.2. This version includes some security and general bug fixes. Customers should not notice any major changes.
The “elzar” Web server stopped responding a few minutes ago under a heavy load on the MySQL database server, and had to be restarted. This resulted in an interruption of service for Web sites on that server.
We apologize for this problem; we’ll be investigating the issue further and monitoring the server closely to make sure it doesn’t recur.
Update 10:00 PM: The NFS network connection between ftp.tigertech.net and elzar wasn’t working properly even after the Web server was restarted, causing additional problems for customers publishing files. This problem has also been corrected.
Tonight at 11 PM Pacific time (2 AM Eastern time May 10) we’ll be performing brief scheduled maintenance on our mail servers. (We’ll be adding more RAM and adding more disk space to make sure that our mail servers continue to keep up with the growth in our service.) This requires restarting, which takes about five minutes, so you will see a brief period of about five minutes where you are unable to connect to our mail servers. No mail will be lost, of course; it will be queued and available after the maintenance.
We apologize for the inconvenience this causes. We schedule this kind of maintenance for late Saturday night/early Sunday morning (the least busy time of the week) to minimize the impact.
We’ve installed several security updates recently. We’ve updated PHP 4, PHP 5, the ClamAV antivirus scanner, and some XFree86 libraries. In addition, we’ve updated our own blog to use WordPress 2.2 — if you use WordPress, make sure you’ve done the same.
Read the rest of this entry »
We’ve updated the default version of Ruby on Rails on our servers to version 1.2.3.
Read the rest of this entry »
A couple of times in the last week, we’ve seen one of our MySQL database servers have an unusually high number of connections. That’s a serious issue: If there are too many connections to a MySQL server, customer scripts won’t be able to connect to a database, so we’ve spent some time looking at the cause and fixing it.
Read the rest of this entry »
One of the features of our service is the industrial-strength Mailman mailing list manager. Mailman is a very good program in some ways (it’s built like a tank and reliably handles very large volumes of list mail, and it removes much of the drudgery of managing large lists), but it has a couple of undesirable “features”.
The most obvious is that the interface is terribly ugly (the Mailman developers are working on a big improvement to this, thankfully; just so it’s clear, we didn’t create the program, and we’re as horrified by the circa-1996 appearance as everyone else). Another problem with the program, though, is the option for “monthly password reminders”. This is a design flaw that’s being removed from Mailman, and although most of the lists on our servers don’t use password reminders, customers who do should probably turn them off now in preparation for that change.
Read the rest of this entry »
The “farnsworth” Web server locked up and needed restarting again last night at about 9:04 PM Pacific time, causing another short outage for some customers. (A similar problem happened Monday night.) To make sure this doesn’t happen again, we’ll be replacing the entire server (switching it with a spare server) at about 11 PM (Pacific) tonight, which will result in about 5 minutes of downtime.
We’re also taking this opportunity to upgrade the hardware on one of our mail servers to allow for future growth; customers (even those with accounts on other Web servers besides the farnsworth server) may see a short (approximately 5 minute) interruption in their ability to retrieve e-mail between 11 PM and midnight.
We apologize for any inconvenience this causes — as always, we’re committed to the highest possible levels of reliability.
Some of the posts on our blog mention specific servers. You’ll occasionally see things like “The web14 server will be rebooted at 11 PM”, “mail sent from the web01 server was delayed”, or “more memory has been added to the web10 server”. Your question, quite naturally, is “How do I know if they’re talking about the server that has my account?”
Read the rest of this entry »