High load on web04 server May 9 2013 (resolved)

The “web04” server experienced extremely high load for several minutes beginning at 8:00 AM Pacific time on May 9. Sites on this server were slow or unavailable as a result.

This was caused by a single site making “runaway” database queries that left almost no MySQL “cache” memory available for other queries. The problem has been resolved by suspending the site involved, and we are analyzing how to prevent anything similar from happening in the future.

We apologize for this incident; we take reliability seriously and strive to avoid this kind of problem.

Followup: We have made a technical change that will prevent this from recurring. Our MySQL servers are configured to write temporary tables to /dev/shm, which defaults to 12 GB in size on our 24 GB RAM servers. This effectively allowed runaway queries to use up to 12 GB of RAM, emptying much of server’s general file cache. We have lowered the size of /dev/shm to a maximum of 6 GB, ensuring that the file cache doesn’t empty out and cause load spikes.