High load on some servers (resolved)

Three of our Web hosting servers (amy, flexo, and leela) experienced high load earlier today that caused some customers to see “503 errors” on their Web sites for a few minutes.

This was caused by an upgrade to the eAccelerator PHP caching system that removed all the cached files at once, which doesn’t normally happen.

The problem has been permanently resolved and will not recur.

A technical explanation for why this caused trouble is that the sudden large number of disk writes caused by new eAccelerator files made the Linux kernel decide that disk “buffer” memory was so full that all disk writes needed to happen “synchronously”.

That caused the MySQL database to start writing temporary “filesort” data to the actual RAID array on the server, instead of just storing those files in memory (as Linux usually does for files that exist for less than a few seconds before being deleted). Some of our servers handle hundreds of MySQL queries a second, and the extra disk writing load overwhelmed the “/tmp” filesystem, slowing down MySQL dramatically.

We’ve made three changes to prevent this from happening again:

  • We’ve modified our Debian eAccelerator package to not remove all the cached files at once during a future upgrade.
  • We’ve changed where MySQL stores temporary files. It now uses “/dev/shm” shared memory instead of “/tmp”. (Ironically, “/tmp” used to be shared memory on our servers, but we had to change it to a real disk-based filesystem because it would fill up with large amounts of data if the server wasn’t restarted for months. That past experience gives us some assurance that this MySQL change won’t cause problems, though — and in fact, we’ve been testing this change on a small number of servers for some time anyway as a general performance improvement.)
  • On our servers that support it, we’re now using “AMD64/Intel 64” kernels that allow much larger disk memory buffers before the kernel switches to synchronous disk writes, avoiding the problem a different way. Some servers are already using the improved kernel (sadly, not these three servers), and all of our 64-bit-capable servers will be using it after the scheduled maintenance this coming Saturday.

We sincerely apologize for this incident. Don’t hesitate to let us know if you have any questions.