Brief scheduled maintenance on flexo server (completed)

At approximately 10:00 PM Pacific time tonight, October 23, the “flexo” Web server will be restarted.

As a result, for customers on the “flexo” server (only), Web site service and the ability to read incoming e-mail will be unavailable for approximately five minutes. Customers on other servers will not be affected.

Read the rest of this entry »

High load on some servers (resolved)

Three of our Web hosting servers (amy, flexo, and leela) experienced high load earlier today that caused some customers to see “503 errors” on their Web sites for a few minutes.

This was caused by an upgrade to the eAccelerator PHP caching system that removed all the cached files at once, which doesn’t normally happen.

The problem has been permanently resolved and will not recur.

Read the rest of this entry »

Flexo server temporarily unavailable (resolved)

Customers on the “flexo” server experienced a four-minute interruption in Web site service between 9:48 and 9:52 AM Pacific time this morning (August 12).

E-mail was not affected, and customers on other servers were not affected.

The problem happened when the Apache Web server did not respond to a “graceful reload” command when we installed a “mod_security” update to block certain attacks against the WordPress blog software.

We are looking into the root cause of this incident and will take steps to prevent it from recurring. We don’t consider any kind of service interruption acceptable, and we sincerely apologize for the problem.

Denial of service attack update

As we mentioned in an earlier post, someone attacked our network earlier this morning. Although we blocked the attack, we’ve also been working to identify who attacked our network and why. We now know the answer, and we are almost positive that the problem won’t recur.

Read the rest of this entry »

Denial of service attack (resolved)

Beginning at 2:16 AM Pacific time this morning, we began experiencing a “distributed denial of service” attack aimed at our “flexo” Web server.

The attack used more than 2 Gbps of network bandwidth from several thousand different IP addresses. This is an extremely high amount of traffic, saturating even our network connections.

The problem caused most of our servers to become unreachable (or very slow) from the Internet.

We restored service to all servers except the flexo Web server at 2:59 AM (by getting our network providers to block all packets for certain IP addresses). We restored service to the flexo server at 3:29 AM (by getting them to identify and block specific characteristics of the attack).

All services are now operating normally, and all delayed incoming mail has been delivered.

We take reliability seriously. Unfortunately, this is by far the largest attack we’ve seen on our network in ten years. We sincerely regret and apologize for the impact this had on our customers.

Brief scheduled maintenance Friday, April 3 (completed)

At approximately 11:00 PM Pacific time on Friday, April 3, the “flexo”, “mom” and “elzar” servers will be restarted. As a result, Web site and e-mail service for some customers will be unavailable for approximately five minutes.

No e-mail will be lost, of course; incoming mail will just be slightly delayed.

We apologize for any inconvenience this may cause. This maintenance is necessary to install an updated “kernel” on our servers, as described in an earlier post.

Update: We’re also going to include the “zapp” server in this maintenance to replace a disk in the RAID array.

Update 2: The maintenance was completed with less than five minutes of “downtime”.

Avoiding a Linux kernel 2.6.26 cgroup bug

We recently had a server that twice “crashed” and needed manually restarting. We’ve identified the cause of that problem — an apparent bug in Linux kernel version 2.6.26 — and made some changes to ensure that it doesn’t affect our customers again.

However, we didn’t find any information about this problem when searching the Internet, so we’re describing the details here in the hope that it helps someone else.

Read the rest of this entry »

Flexo server temporarily unavailable (resolved)

The “flexo” Web server was unavailable between 9:54 and 10:02 PM Pacific time tonight, March 28. This resulted in an interruption of service for Web sites on that server. (Some e-mail activity was delayed, but no e-mail was lost.)

We sincerely apologize for this problem. We consider this type of failure to be unacceptable, and are looking into the cause of the problem so that we can take the appropriate steps to prevent it from happening again.

Update: The problem happened a second time on March 31 from 6:22 to 6:31 AM. However, the second incident gave our engineers enough details to determine the cause (which we’ve reported in a subsequent blog post), and we have made a technical change that will prevent it from happening again.