Followup to January 15, 2010 problem

Posted January 20th, 2010 in System Status, Tech Corner.

On Friday, a problem made our “My Account” control panel system unavailable for about three hours, and caused some other problems as well. We promised we’d follow up with more details.

First of all, the problem: Our primary internal PostgreSQL database server (which we use to manage all customer data) completely failed due to a nasty problem: a program on the server tried to delete every file on the disk.

The way this happened was surprising. A long time ago, our database server had a certain Debian GNU/Linux software “package” installed that — somehow — erroneously created a user with a “home directory” set to the top level of the entire disk (“/”). This caused no harm for almost two years… until we removed the obsolete, unused package while cleaning up odds and ends. The automatic package uninstallation script then began to delete all the files in the “/” directory. The script deleted several important files, including some necessary to “boot” that server, before we stopped it.

When the worst happens, we have a plan to recover from database failures:

If the server hardware somehow fails (despite having redundant hard drives, power supplies, etc.), we can move the data disks to a spare server. This can be done in a few minutes.
If the database files are corrupted or unavailable, we can restore from the correct onsite or offsite hourly backup, re-apply changes made since that backup (either manually or through database “journal logs”), and make it available again. This takes longer and is more tedious and error-prone.

In this case, we knew that some of the files on the server had been deleted, but it wasn’t clear which ones. That caused a significant delay while we contemplated whether it was safe to use the files of the existing database, or whether we needed to restore the database from backups. We eventually determined that it was safe to use the existing data, and restored things to a working state on a spare server.

Unfortunately, it took longer than we’d like to make sure all our systems that rely on databases were fully functional, and there were several things that we’d like to have avoided even if “My Account” was temporarily not working:

The server problem also made our own www.tigertech.net Web site completely stop working for a short time.
Our blog.tigertech.net pages relied on the same database server, so we couldn’t communicate the status well.
Some people saw error messages when using the Webmail system to read certain messages. This shouldn’t happen; the Webmail system is designed to be independent of our database system.
Our “Support” pages don’t work properly when the database is unavailable. This is actually a known issue; the pages are entirely database-driven from what amounts to a custom wiki on our end. On balance, this drawback is probably worth it; it allows our staff to make changes very easily (the ease of changes keeps it up-to-date and comprehensive), and we also customize pages for each viewer depending on what accounts they have with us.

“Learning is fun!”

That’s a sarcastic comment (from Futurama’s Bender character), of course. This kind of learning is actually not fun at all. On the other hand, one of the things we’re committed to is the idea of continuous improvement (fans of management systems probably know about “Kaizen”, or 改善). Each problem is an opportunity to make things better.

Any sufficiently advanced computer system is a web of interdependent parts that humans have difficulty understanding. Occasional failures are probably inevitable. But when we have a problem, we want to learn from it.

To that end, we’re making the following changes:

We’ve added a check to ensure that no software package ever uses the top level of the disk as its home directory.
We’ve made sure that blog.tigertech.net does not rely on our primary database system at all (it’s self-contained on servers in a completely separate data center), so we can use it to communicate with customers no matter what happens. If you normally reach our blog via a link on our main Web site, you should take a moment now to create a bookmark for blog.tigertech.net that you can use later (in case you somehow can’t reach our main site).
We now have a simple way for any employee here to add a status update banner to every page on our site.
We’re documenting improvements to our procedure for restoring database backups, avoiding several little annoyances that cost us valuable minutes.
We’re fixing the software problem that causes some Webmail error messages during database failures.

And long-term, we’re looking into these:

Adding the ability for our “Support” pages to work in a degraded, non-customized mode when the database is not available.
Using the “streaming replication” and/or “hot standby” features that will be available in a forthcoming version of PostgreSQL to provide better database redundancy.
Using a snapshot-able file system like Btrfs to more easily recover from processes that destroy data on a disk.
Better segregation of changing data (databases, etc.) and unchanging data (program files and the root level of disks) on our servers, putting the unchanging data on partitions that are read-only and undeletable. The runaway file deletion process would not have caused problems if the root file system on this server was mounted in “read-only” mode.

We’re always trying to improve, and we regret that our best side wasn’t on display last Friday. We apologize for the inconvenience caused to our customers.

3 Comments

on Wednesday, January 20, 2010 at 12:52 pm (Pacific) Drew wrote:

Problems happen — entropy always wins. That said, I really appreciate the thorough overview of the issue, what caused the issue and what steps you are taking to not only resolve the issue, but learn from it. This is great communication. Thank you very much!
on Thursday, January 21, 2010 at 5:46 pm (Pacific) Kiyan wrote:

Yes, thanks for the honesty. It’s really appreciated. And the thorough explanation..
on Wednesday, August 21, 2013 at 9:52 am (Pacific) Doug wrote:

I really appreciate the clarity of the explanation. I’ve always thought it was was the best business practice to be completely honest about problems like this, and you guys have really done well. Thanks …