Brief interruption to reading mail and using Webmail (resolved)

Around 11:26 AM (Pacific time) this morning, one of our mail servers encountered an unusual load, became unresponsive, and needed to be restarted. This affected our users’ ability to read e-mail and to use our Webmail system for several minutes.

Read the rest of this entry »

Calculon server restarted (resolved)

The “calculon” Web server needed to be restarted at 1:36 Pacific time today, resulting in a five-minute interruption of service for Web sites and e-mail on that server.

Read the rest of this entry »

Spam filtering problem (resolved)

Due to what appears to be a DNS issue at a third party, a small number of messages that weren’t actually spam may have been incorrectly blocked by our mail filters over the last few hours.

We’ve made changes to our system to ignore these errors, making sure no other messages will be blocked.

The number of affected messages was small enough that this wasn’t an issue for most customers. However, if someone tells you they sent a message that was initially blocked with an error message about “red.uribl.com”, but which later went through without problems, this problem was the cause of that.

We sincerely apologize to anyone who had trouble.

Brief power interruption for some servers (Resolved)

This afternoon at 3:49 PM (Pacific time), one of the cabinets at our data center tripped a circuit breaker, causing all of the servers in that cabinet to lose power. Power was restored nine minutes later.

Customer Web sites on the calculon, lrrr, and zapp Web servers were unavailable during this time. The ability to send and receive e-mail was also interrupted (no mail was lost, of course). Other servers were not affected.

We pay close attention to the power load in each cabinet to avoid this sort of problem. The previously measured peak load of that cabinet had been 12 amps. Since the circuit allows 15 amps, this issue surprised us (we’ve been using the same setup in the same data center for seven years and this has never happened before). It appears that a combination of several servers experiencing unusually high CPU loads led to power usage beyond what we previously considered possible.

We will take immediate steps to make sure the problem doesn’t happen again, and we sincerely apologize to customers who were affected by this incident.

Update 7:26 PM: We have removed a server from the cabinet in question, lowering the power use.

Update 10:38 PM: We have removed a second server from the cabinet, ensuring that power use is well below any level that could cause further trouble. The problem will not recur.

Temporary overload on “elzar” server (resolved)

Starting at 10:14 AM this morning, our elzar server experienced an unexpectedly high server load that effectively made some processes on the server unusable for about 10 minutes.

Web sites using scripts or databases on the elzar server may have seemed unresponsive during that time. Also, any customer hosted on elzar who was reading their e-mail during this time may have felt the system was slow or unresponsive (no e-mail was lost, of course).

Customers on other servers were not affected.

Read the rest of this entry »

Mailman server problem this morning (resolved)

Between 4:58 and 5:39 AM Pacific time today (March 23), our server which runs the Mailman mailing list software encountered an internal problem. During most of this time, all Mailman-related functionality was unavailable.

Since Mailman most works via e-mail, no data was lost. Some messages might have been slightly delayed, but not for any longer than might normally be noticed with mail delivery via the Internet.

We apologize for any inconvenience that this might have caused!

Calculon server restarted (resolved)

The “calculon” Web server needed to be restarted at 10:14 AM Pacific time, resulting in a five-minute interruption of service for Web sites and e-mail on that server.

Read the rest of this entry »

Zapp server temporarily unavailable (resolved)

The “zapp” Web server was unavailable between 8:20 and 8:40 Pacific time this morning. This resulted in an interruption of service for Web sites and e-mail on that server.

The problem was caused by a faulty hard disk in the RAID array (which theoretically shouldn’t cause a server to stop responding, but did). The hard disk has been removed from the array and will be replaced tonight at 10 PM. The server will be restarted at that time, resulting in about 4 minutes additional downtime.

We sincerely apologize for this problem. We will be investigating the root cause: it’s normal for hard drives to fail — we expect that occasionally — but it shouldn’t cause such negative effects (normally the RAID array would prevent the failure of any single drive from causing the entire machine to fail).

Mail problem this morning (resolved)

Between 5:58 and 6:26 AM Pacific time today (March 12), a network problem on one of our mail servers prevented some customers from being able to read and send e-mail.

The issue has been resolved and everything is working normally. Although incoming mail was delayed, no mail was lost. Web site service was not affected.

The cause of the problem was that a debugging tool used by one of our technicians (”tcpdump”), when used with certain options, can apparently cause network interface failures. This was not an issue we were previously aware of. We will avoid using the tool in that manner in the future, so the problem should not recur.

We regret the problem and sincerely apologize to our customers who were affected by this issue.

Brief scheduled maintenance on Saturday, March 1 (completed)

At approximately 11:00 PM Pacific time this Saturday night (March 1), all Tiger Technologies servers will be restarted. As a result, customer Web sites and e-mail service will be unavailable for three to five minutes.

No e-mail will be lost, of course; incoming mail will just be delayed for a few minutes.

This brief maintenance is necessary to upgrade the operating system “Linux kernel” to a newer version for security reasons. This was also done two weeks ago; unfortunately our operating system vendor has released an even newer kernel since then — it doesn’t usually happen this often.

We apologize for the inconvenience this causes.

(This  maintenance was also successfully completed with less than four minutes of downtime per server.)