Mail problem this morning (resolved)

Between 5:58 and 6:26 AM Pacific time today (March 12), a network problem on one of our mail servers prevented some customers from being able to read and send e-mail.

The issue has been resolved and everything is working normally. Although incoming mail was delayed, no mail was lost. Web site service was not affected.

The cause of the problem was that a debugging tool used by one of our technicians (“tcpdump”), when used with certain options, can apparently cause network interface failures. This was not an issue we were previously aware of. We will avoid using the tool in that manner in the future, so the problem should not recur.

We regret the problem and sincerely apologize to our customers who were affected by this issue.

Brief scheduled maintenance on Saturday, March 1 (completed)

At approximately 11:00 PM Pacific time this Saturday night (March 1), all Tiger Technologies servers will be restarted. As a result, customer Web sites and e-mail service will be unavailable for three to five minutes.

No e-mail will be lost, of course; incoming mail will just be delayed for a few minutes.

This brief maintenance is necessary to upgrade the operating system “Linux kernel” to a newer version for security reasons. This was also done two weeks ago; unfortunately our operating system vendor has released an even newer kernel since then — it doesn’t usually happen this often.

We apologize for the inconvenience this causes.

(This maintenance was also successfully completed with less than four minutes of downtime per server.)

Brief scheduled maintenance on Saturday, February 16 (completed)

At approximately 11:00 PM Pacific time this Saturday night (February 16), all Tiger Technologies servers will be restarted. As a result, customer Web sites and e-mail service will be unavailable for three to five minutes.

No e-mail will be lost, of course; incoming mail will just be delayed for a few minutes.

This brief maintenance is necessary to upgrade the operating system “Linux kernel” to a newer version for security reasons. We apologize for the inconvenience this causes.

(This maintenance was successfully completed with less than four minutes of downtime per server.)

Network outage followup

This is a followup to last night’s post about a network outage.

The root cause of the problem was the failure of an Ethernet switch at our data center. The switch was the one that our network cables actually plug into to connect to the Internet. Unfortunately, it’s one of the few pieces of the network infrastructure that’s not automatically redundant: although the “other side” of the switch is connected to multiple fully redundant upstream paths to the Internet, the side of it that goes to our server cabinets effectively has a single connection for each a group of servers.

When the switch failed, the data center staff replaced it with a new spare one. Because the faulty hardware was completely replaced, the problem is properly solved, and this won’t be something that’s an ongoing problem.

Read the rest of this entry »

Unscheduled network outage (resolved)

Between 9:52 and 11:06 PM Pacific time on January 10, a complete network failure at our primary data center caused an unscheduled outage that resulted in all services (all Web sites and e-mail) being unreachable from the Internet.

This problem has been resolved and all services are now available. We are waiting for a full report from the data center personnel so that we can determine the cause and ensure that it won’t recur.

We sincerely apologize to our customers who were affected by this. This kind of outage is not normal (it’s the longest outage we’ve experienced in more than four years), and we know it’s not acceptable to our customers who rely on our services. We’ll post a followup message with more details when they become available.

Update Friday 10 AM: As a clarification, we should also have originally mentioned that no e-mail is lost during this kind of outage: it’s delivered after the issue is resolved. While some messages were certainly delayed, they were all properly delivered afterward.

Incoming mail problem (resolved)

From 12:51 to 12:54 PM Pacific time today, one of our inbound mail servers (mx2.tigertech.net) incorrectly rejected some incoming mail that wouldn’t normally have been rejected, due to a configuration problem. This resulted in a small handful of messages being returned to the sender instead of properly delivered.

Read the rest of this entry »

FTP server restarted

A problem with the NFS automounter software on our main FTP server (which also serves our own www.tigertech.net Web site) caused us to restart that server at about 2:30 PM Pacific time. This caused a brief interruption in service for customers uploading files via FTP or using our control panel. (The problem didn’t directly affect any customer Web sites or e-mail service.)

Everything is now working normally. Again, we apologize to anyone who was inconvenienced by this issue — this just hasn’t been our day.

PHP problem on “elzar” server (resolved)

As a result of an error on our part, a small handful of PHP 4 scripts on the “elzar” Web server may have displayed an error message or a blank page for up to 14 minutes today (from 12:48 to 1:02 PM Pacific time in the worst case). The problem has been resolved for any customers who were affected.

This happened because of a mistake we made in an upgrade to our sitewide PHP4/FastCGI configuration file, which our pre-upgrade testing failed to detect. We have added a new check to our automated testing system to ensure this cannot happen again.

We sincerely apologize to any customers affected by this problem.

Brief scheduled maintenance on Saturday, October 13 (completed)

The “farnsworth” and “lrrr” Web hosting servers will be restarted at approximately 11:00 PM Pacific time on Saturday October 13, and customer Web sites on those two servers will be unavailable for approximately five minutes. (See “Which server is my account on?” if you aren’t sure.) E-mail service, and customers on all other Web servers, will not be affected at all.

The restart is necessary so we can increase the memory (RAM) on those two servers to 4 GB, as we described here. After this, all our hosting servers will have 4 GB of memory.

By comparison with many other reasonably priced hosting companies, we keep the load on our servers pretty low to start with, so 4 GB is “overkill” probably 99.99% of the time — but we want to cover the other .01%. (Our unofficial motto should probably be something like “Tiger Technologies: We’re paranoid so you don’t have to be.”)

We apologize for any inconvenience this may cause.

Update: Note that the maintenance time has been changed from 10:00 PM to 11:00 PM.

Mail monitoring now checks AT&T and Yahoo

In an earlier post, we talked about how we use a monitoring system that forwards test e-mail to other large ISPs, then checks to make sure the message was promptly delivered.

We already check delivery to AOL, Comcast, GMail and Verizon, and we’ve now added AT&T/SBCGlobal and Yahoo mail. We’ll continue to expand it in the future.

An extensive monitoring and alert system is at the heart of our reliability, really; it’s what allows us to know that things are working properly. We can guarantee that if our customers send mail to those ISPs right now, it’s being delivered. That’s something few other companies even bother to check.