Customers on the “flexo” server experienced a four-minute interruption in Web site service between 9:48 and 9:52 AM Pacific time this morning (August 12).
E-mail was not affected, and customers on other servers were not affected.
The problem happened when the Apache Web server did not respond to a “graceful reload” command when we installed a “mod_security” update to block certain attacks against the WordPress blog software.
We are looking into the root cause of this incident and will take steps to prevent it from recurring. We don’t consider any kind of service interruption acceptable, and we sincerely apologize for the problem.
As we mentioned in an earlier post, someone attacked our network earlier this morning. Although we blocked the attack, we’ve also been working to identify who attacked our network and why. We now know the answer, and we are almost positive that the problem won’t recur.
Read the rest of this entry »
Beginning at 2:16 AM Pacific time this morning, we began experiencing a “distributed denial of service” attack aimed at our “flexo” Web server.
The attack used more than 2 Gbps of network bandwidth from several thousand different IP addresses. This is an extremely high amount of traffic, saturating even our network connections.
The problem caused most of our servers to become unreachable (or very slow) from the Internet.
We restored service to all servers except the flexo Web server at 2:59 AM (by getting our network providers to block all packets for certain IP addresses). We restored service to the flexo server at 3:29 AM (by getting them to identify and block specific characteristics of the attack).
All services are now operating normally, and all delayed incoming mail has been delivered.
We take reliability seriously. Unfortunately, this is by far the largest attack we’ve seen on our network in ten years. We sincerely regret and apologize for the impact this had on our customers.
At approximately 11:00 PM Pacific time on Friday, April 3, the “flexo”, “mom” and “elzar” servers will be restarted. As a result, Web site and e-mail service for some customers will be unavailable for approximately five minutes.
No e-mail will be lost, of course; incoming mail will just be slightly delayed.
We apologize for any inconvenience this may cause. This maintenance is necessary to install an updated “kernel” on our servers, as described in an earlier post.
Update: We’re also going to include the “zapp” server in this maintenance to replace a disk in the RAID array.
Update 2: The maintenance was completed with less than five minutes of “downtime”.
We recently had a server that twice “crashed” and needed manually restarting. We’ve identified the cause of that problem — an apparent bug in Linux kernel version 2.6.26 — and made some changes to ensure that it doesn’t affect our customers again.
However, we didn’t find any information about this problem when searching the Internet, so we’re describing the details here in the hope that it helps someone else.
Read the rest of this entry »
The “flexo” Web server was unavailable between 9:54 and 10:02 PM Pacific time tonight, March 28. This resulted in an interruption of service for Web sites on that server. (Some e-mail activity was delayed, but no e-mail was lost.)
We sincerely apologize for this problem. We consider this type of failure to be unacceptable, and are looking into the cause of the problem so that we can take the appropriate steps to prevent it from happening again.
Update: The problem happened a second time on March 31 from 6:22 to 6:31 AM. However, the second incident gave our engineers enough details to determine the cause (which we’ve reported in a subsequent blog post), and we have made a technical change that will prevent it from happening again.