Old data center problems continuing (resolved)

We are continuing to experience intermittent problems at Hurricane Electric, the data center that we’ve been using until now. Fortunately, as previously mentioned, we are in the process of moving to a new data center.

Some customers have already been moved to the new data center, and they have not seen any of these problems. We are moving the remaining sites as quickly as possible.

We are extremely upset with the attacks that have been hitting Hurricane Electric, and consider these outages to be unacceptable. It definitely appears that Hurricane Electric is being specifically targeted. You can watch their Twitter feed to see information about the latest attacks and the steps they’re taking to fix the attacks, as well as what they’re doing to try to protect themselves against further ones. While it’s undoubtedly of little comfort, it’s worth noting that these attacks are not directed at our customers or at us, but at Hurricane Electric in general. You can see that other large companies hosted at Hurricane Electric are also being affected, such as Linode.

We apologize for the inconvenience these outages have caused, and hope to have all Web sites moved to the new data center very soon.

Sites on “farnsworth” server moved to “zapp”

All Web sites on the “farnsworth” Web server have been moved to a new server named “zapp”.

This change was made for reliability; our monitoring systems detected potential hardware problems with the “farnsworth” server earlier today, and the sites were moved so it can be replaced before it causes any problems.

This doesn’t cause any downtime, and customers shouldn’t notice any change — but as always, don’t hesitate to contact us if you have any questions.

Zapp server added to brief scheduled maintenance (completed)

As we’ve already posted, some of our Web servers will be restarted tonight at 11 PM Pacific time.

We’re adding the “zapp” Web server to that list so we can replace a RAID array disk that caused a problem on that server earlier today.

Update: The maintenance was completed with less than five minutes of “downtime”.

Zapp server temporarily unavailable (resolved)

The “zapp” Web server was unavailable between 3:43 and 3.53 AM Pacific time this morning, April 4. This resulted in an interruption of service for Web sites on that server. (Some e-mail activity was delayed, but no e-mail was lost.)

We sincerely apologize for this problem. We consider this type of failure to be unacceptable, and are looking into the cause of the problem so that we can take the appropriate steps to prevent it from happening again.

Brief scheduled maintenance Friday, April 3 (completed)

At approximately 11:00 PM Pacific time on Friday, April 3, the “flexo”, “mom” and “elzar” servers will be restarted. As a result, Web site and e-mail service for some customers will be unavailable for approximately five minutes.

No e-mail will be lost, of course; incoming mail will just be slightly delayed.

We apologize for any inconvenience this may cause. This maintenance is necessary to install an updated “kernel” on our servers, as described in an earlier post.

Update: We’re also going to include the “zapp” server in this maintenance to replace a disk in the RAID array.

Update 2: The maintenance was completed with less than five minutes of “downtime”.

Brief power interruption for some servers (Resolved)

This afternoon at 3:49 PM (Pacific time), one of the cabinets at our data center tripped a circuit breaker, causing all of the servers in that cabinet to lose power. Power was restored nine minutes later.

Customer Web sites on the calculon, lrrr, and zapp Web servers were unavailable during this time. The ability to send and receive e-mail was also interrupted (no mail was lost, of course). Other servers were not affected.

We pay close attention to the power load in each cabinet to avoid this sort of problem. The previously measured peak load of that cabinet had been 12 amps. Since the circuit allows 15 amps, this issue surprised us (we’ve been using the same setup in the same data center for seven years and this has never happened before). It appears that a combination of several servers experiencing unusually high CPU loads led to power usage beyond what we previously considered possible.

We will take immediate steps to make sure the problem doesn’t happen again, and we sincerely apologize to customers who were affected by this incident.

Update 7:26 PM: We have removed a server from the cabinet in question, lowering the power use.

Update 10:38 PM: We have removed a second server from the cabinet, ensuring that power use is well below any level that could cause further trouble. The problem will not recur.

Zapp server temporarily unavailable (resolved)

The “zapp” Web server was unavailable between 8:20 and 8:40 Pacific time this morning. This resulted in an interruption of service for Web sites and e-mail on that server.

The problem was caused by a faulty hard disk in the RAID array (which theoretically shouldn’t cause a server to stop responding, but did). The hard disk has been removed from the array and will be replaced tonight at 10 PM. The server will be restarted at that time, resulting in about 4 minutes additional downtime.

We sincerely apologize for this problem. We will be investigating the root cause: it’s normal for hard drives to fail — we expect that occasionally — but it shouldn’t cause such negative effects (normally the RAID array would prevent the failure of any single drive from causing the entire machine to fail).

E-mail, zapp, lrrr, servers temporarily unavailable (resolved)

Due to a failure of the power distribution unit (essentially a fancy power strip) in one of the cabinets at our data center, the following services became unavailable at 05:52 AM Pacific time:

(Other Web servers are not affected.) A data center technician is replacing the power unit in that cabinet and all systems should be be back online within 15 minutes; we’ll update this post when that happens.

Update: The faulty hardware has been completely replaced. All servers are back online and functioning normally, and all queued e-mail has been delivered and is available for retrieval. The total outage for these servers was from 05:52 AM to 06:15 AM (Pacific time).

In addition, the FTP service on the “zapp” server was not fully working after it was restarted, so FTP publishing on that server was unavailable until shortly after 7:00 AM. This has been corrected (and the underlying problem that could cause incorrect startup was fixed).

We sincerely apologize to customers affected by this outage. This kind of issue has happened to us only once before in the last seven years (and that was with a different brand of power unit). Since the replacement power unit is brand new, we don’t expect the problem to recur.