Outage on web12 server April 9, 2013 (resolved)

Between 12:50 and 1:23 PM Pacific time, service was intermittently unavailable or slow for sites and e-mail on the web12 server. In addition, customers on other servers may have seen brief delays or high load for about two minutes during this period.

This problem was caused by a brief period of high network latency to some destinations. That caused a larger-than-usual number of PHP processes to start, leading to reduced memory available for file system caching. This in turn made the server respond more slowly than usual, which caused even more PHP processes to start to handle the incoming requests. This made the problem worse in a “vicious circle” until we could manually limit the number of PHP processes being started.

The web12 server appears to be more vulnerable to this problem than other servers because of its PHP script usage pattern. While the number of PHP processes on all our servers increased, the problem was just bad enough on web12 that it couldn’t recover from it gracefully. We haven’t seen this particular issue happen before.

We are making immediate changes to the way PHP processes are started and limited to ensure this problem does not recur.

We sincerely apologize for this. We know you count on us for reliable service, and we are constantly striving to avoid this kind of problem.